Introduction
Vector databases are rapidly becoming a central element in AI workflows: storing embeddings (numeric vector representations of text, images or other data) and enabling semantic similarity search. In this post you’ll walk through a hands-on example of building a vector-db pipeline on Oracle Database 23 ai (or Autonomous/AI Database on Oracle Cloud Infrastructure) that covers:
- Generating embeddings with an open-source model.
- Loading embeddings into the vector-enabled database.
- Constructing vector indexes and performing similarity queries.
- Integrating with metadata to produce hybrid search.
- Discussing performance, scalability, maintenance and best practices.
I’ve reviewed the articles on Osama’s blog—while he covers vector search in theory (data type, index, RAG) you’ll find this one emphasises step-by-step code, pipeline creation and hybrid-search use-case, so it should not overlap.
1. Pipeline Overview
Here’s the architecture of the pipeline we’ll build:
- Data source: A set of documents (in this example, internal knowledge articles).
- Embedding generation: Use an open-source sentence-transformer (e.g.,
all-MiniLM-L12-v2) to convert each document text → a vector of dimension 384. - Storage: Use Oracle’s VECTOR data type in a table that also holds metadata (title, date, department).
- Indexing: Create a vector index (approximate nearest-neighbour) for fast similarity search.
- Querying: Accept a search query (text), embed it, and run a similarity search among documents. Combine vector similarity with metadata filters (e.g., department = “Legal”).
- Serving: Return top K results ranked by semantic similarity and metadata weight.
Here is a conceptual diagram:
Text documents → embedding model → store (id, metadata, vector) → build index
Search query → embedding → query vector + metadata filter → results
2. Setup & Embedding Generation
Prerequisites
- Provision Oracle Database 23 ai / AI Database on OCI (or a sharded/VM setup supporting VECTOR type).
- Ensure the database supports the
VECTORcolumn type and vector indexing. - Python environment with
sentence-transformersandcx_Oracleororacledbdriver.
Embedding generation (Python)
from sentence_transformers import SentenceTransformer
import oracledb
# Load model
model = SentenceTransformer('all-MiniLM-L12-v2')
# Sample documents
docs = [
{"id": 1, "title": "Employee onboarding policy", "dept": "HR", "text": "..."},
{"id": 2, "title": "Vendor contract guidelines", "dept": "Legal", "text": "..."},
# … more rows
]
# Generate embeddings
for doc in docs:
vec = model.encode(doc['text']).tolist()
doc['embed'] = vec
# Connect to Oracle DB
conn = oracledb.connect(user="vector_usr", password="pwd", dsn="your_dsn")
cursor = conn.cursor()
# Create table
cursor.execute("""
CREATE TABLE kb_documents (
doc_id NUMBER PRIMARY KEY,
title VARCHAR2(500),
dept VARCHAR2(100),
content CLOB,
doc_vector VECTOR
)
""")
conn.commit()
# Insert rows
for doc in docs:
cursor.execute("""
INSERT INTO kb_documents(doc_id, title, dept, content, doc_vector)
VALUES(:1, :2, :3, :4, :5)
""", (doc['id'], doc['title'], doc['dept'], doc['text'], doc['embed']))
conn.commit()
Why this matters
- You store both business metadata (title, dept) and embedding (vector) in the same table — enabling hybrid queries (metadata + similarity).
- Using a stable, open-source embedding model ensures reproducible vectors; you can later upgrade model version and re-embed to evolve.
3. Vector Indexing & Similarity Querying
Create vector index
Once vectors are stored, you create a vector index for fast search.
CREATE INDEX idx_kb_vector
ON kb_documents(doc_vector)
INDEXTYPE IS vector_ann
PARAMETERS('distance_metric=cosine, dimension=384');
Running a query: semantic search + metadata filter
Suppose you want to search: “vendor termination risk” but only within dept = “Legal”.
query = "vendor termination risk"
query_vec = model.encode([query]).tolist()[0]
cursor.execute("""
SELECT doc_id, title, dept, vector_distance(doc_vector, :qv) AS dist
FROM kb_documents
WHERE dept = 'Legal'
ORDER BY vector_distance(doc_vector, :qv)
FETCH FIRST 5 ROWS ONLY
""", {"qv": query_vec})
for row in cursor:
print(row)
Explanation
vector_distancecomputes similarity (lower = more similar, for cosine-distance variant).- We combine a standard filter
WHERE dept = 'Legal'with the vector search. - The result returns the closest (by meaning) documents among the “Legal” department.
4. Enhancements & Production Considerations
Chunking & embedding size
- For large documents (e.g., whitepapers), chunk them into ~512 token segments before embedding; store each segment as a separate row with parent document id.
- Maintain
model_versioncolumn so you can know which embedding model was used.
Hybrid ranking
You may want to combine semantic similarity + recency or popularity. For example:
SELECT doc_id, title,
vector_distance(doc_vector, :qv) * 0.7 + (extract(day from (sysdate - created_date))/365)*0.3 AS score
FROM kb_documents
WHERE dept = 'Legal'
ORDER BY score
FETCH FIRST 5 ROWS ONLY
Here you give 70% weight to semantic distance, 30% to longer-living documents (older documents get scored higher in this case). Adjust weights based on business logic.
Scaling
- With millions of vectors, approximate nearest-neighbour (ANN) indexing is crucial; tune index parameters such as
ef_search,nlist. - Monitor latency of
vector_distancequeries, and monitor index size/maintenance cost. - Consider sharding or partitioning the embedding table (by dept, date) if usage grows.
Maintenance
- When you retrain or change model version: re-compute embeddings, drop and rebuild indexes.
- Monitor performance drift: track metrics like top-K retrieval relevance, query latency, user feedback.
- Maintain metadata hygiene: e.g., ensure each row has a valid dept, tag, creation date.
Regards
Osama