Hands-On: Building a Vector Database Pipeline with OCI and Open-Source Embeddings

Introduction

Vector databases are rapidly becoming a central element in AI workflows: storing embeddings (numeric vector representations of text, images or other data) and enabling semantic similarity search. In this post you’ll walk through a hands-on example of building a vector-db pipeline on Oracle Database 23 ai (or Autonomous/AI Database on Oracle Cloud Infrastructure) that covers:

  1. Generating embeddings with an open-source model.
  2. Loading embeddings into the vector-enabled database.
  3. Constructing vector indexes and performing similarity queries.
  4. Integrating with metadata to produce hybrid search.
  5. Discussing performance, scalability, maintenance and best practices.

I’ve reviewed the articles on Osama’s blog—while he covers vector search in theory (data type, index, RAG) you’ll find this one emphasises step-by-step code, pipeline creation and hybrid-search use-case, so it should not overlap.

1. Pipeline Overview

Here’s the architecture of the pipeline we’ll build:

  • Data source: A set of documents (in this example, internal knowledge articles).
  • Embedding generation: Use an open-source sentence-transformer (e.g., all-MiniLM-L12-v2) to convert each document text → a vector of dimension 384.
  • Storage: Use Oracle’s VECTOR data type in a table that also holds metadata (title, date, department).
  • Indexing: Create a vector index (approximate nearest-neighbour) for fast similarity search.
  • Querying: Accept a search query (text), embed it, and run a similarity search among documents. Combine vector similarity with metadata filters (e.g., department = “Legal”).
  • Serving: Return top K results ranked by semantic similarity and metadata weight.

Here is a conceptual diagram:

Text documents → embedding model → store (id, metadata, vector) → build index  
Search query → embedding → query vector + metadata filter → results  

2. Setup & Embedding Generation

Prerequisites

  • Provision Oracle Database 23 ai / AI Database on OCI (or a sharded/VM setup supporting VECTOR type).
  • Ensure the database supports the VECTOR column type and vector indexing.
  • Python environment with sentence-transformers and cx_Oracle or oracledb driver.

Embedding generation (Python)

from sentence_transformers import SentenceTransformer
import oracledb

# Load model
model = SentenceTransformer('all-MiniLM-L12-v2')

# Sample documents
docs = [
    {"id": 1, "title": "Employee onboarding policy", "dept": "HR", "text": "..."},
    {"id": 2, "title": "Vendor contract guidelines", "dept": "Legal", "text": "..."},
    # … more rows
]

# Generate embeddings
for doc in docs:
    vec = model.encode(doc['text']).tolist()
    doc['embed'] = vec

# Connect to Oracle DB
conn = oracledb.connect(user="vector_usr", password="pwd", dsn="your_dsn")
cursor = conn.cursor()

# Create table
cursor.execute("""
  CREATE TABLE kb_documents (
    doc_id     NUMBER PRIMARY KEY,
    title      VARCHAR2(500),
    dept       VARCHAR2(100),
    content    CLOB,
    doc_vector VECTOR
  )
""")
conn.commit()

# Insert rows
for doc in docs:
    cursor.execute("""
      INSERT INTO kb_documents(doc_id, title, dept, content, doc_vector)
      VALUES(:1, :2, :3, :4, :5)
    """, (doc['id'], doc['title'], doc['dept'], doc['text'], doc['embed']))
conn.commit()

Why this matters

  • You store both business metadata (title, dept) and embedding (vector) in the same table — enabling hybrid queries (metadata + similarity).
  • Using a stable, open-source embedding model ensures reproducible vectors; you can later upgrade model version and re-embed to evolve.

3. Vector Indexing & Similarity Querying

Create vector index

Once vectors are stored, you create a vector index for fast search.

CREATE INDEX idx_kb_vector 
  ON kb_documents(doc_vector)
  INDEXTYPE IS vector_ann 
  PARAMETERS('distance_metric=cosine, dimension=384');

Running a query: semantic search + metadata filter

Suppose you want to search: “vendor termination risk” but only within dept = “Legal”.

query = "vendor termination risk"
query_vec = model.encode([query]).tolist()[0]

cursor.execute("""
  SELECT doc_id, title, dept, vector_distance(doc_vector, :qv) AS dist
  FROM kb_documents
  WHERE dept = 'Legal'
  ORDER BY vector_distance(doc_vector, :qv)
  FETCH FIRST 5 ROWS ONLY
""", {"qv": query_vec})

for row in cursor:
    print(row)

Explanation

  • vector_distance computes similarity (lower = more similar, for cosine-distance variant).
  • We combine a standard filter WHERE dept = 'Legal' with the vector search.
  • The result returns the closest (by meaning) documents among the “Legal” department.

4. Enhancements & Production Considerations

Chunking & embedding size

  • For large documents (e.g., whitepapers), chunk them into ~512 token segments before embedding; store each segment as a separate row with parent document id.
  • Maintain model_version column so you can know which embedding model was used.

Hybrid ranking

You may want to combine semantic similarity + recency or popularity. For example:

SELECT doc_id, title,
       vector_distance(doc_vector, :qv) * 0.7 + (extract(day from (sysdate - created_date))/365)*0.3 AS score
FROM kb_documents
WHERE dept = 'Legal'
ORDER BY score
FETCH FIRST 5 ROWS ONLY

Here you give 70% weight to semantic distance, 30% to longer-living documents (older documents get scored higher in this case). Adjust weights based on business logic.

Scaling

  • With millions of vectors, approximate nearest-neighbour (ANN) indexing is crucial; tune index parameters such as ef_search, nlist.
  • Monitor latency of vector_distance queries, and monitor index size/maintenance cost.
  • Consider sharding or partitioning the embedding table (by dept, date) if usage grows.

Maintenance

  • When you retrain or change model version: re-compute embeddings, drop and rebuild indexes.
  • Monitor performance drift: track metrics like top-K retrieval relevance, query latency, user feedback.
  • Maintain metadata hygiene: e.g., ensure each row has a valid dept, tag, creation date.

Regards
Osama

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.