rag – Technology Geek

A few months ago, I was helping a team that had just integrated an LLM into their product. The use case was straightforward: users ask questions, the LLM answers. They had it running. The demos looked great. Then they went to production.

The model kept confidently making things up. It had no idea about the company’s internal documentation, the latest product specs, or anything that happened after its training cutoff. The team was frustrated. They had the right model, the right infrastructure, but the wrong architecture.

The fix was not fine-tuning. Fine-tuning is expensive, slow, and you have to redo it every time your data changes. The fix was Retrieval Augmented Generation, or RAG. And at the heart of RAG is something called a vector database.

In this article, I will walk you through building a production-grade RAG architecture on AWS. We will cover what vector databases actually are, when to use Aurora pgvector versus OpenSearch versus Amazon Bedrock Knowledge Bases, and how to wire everything together with real code.

What Is a Vector Database and Why Does It Matter

Before writing any infrastructure code, let me explain what problem we are actually solving.

When you work with text, images, or audio in AI systems, the raw data is not what gets compared. Instead, you pass the data through an embedding model, which converts it into a list of numbers called a vector. That vector captures the semantic meaning of the content.

Two sentences that mean the same thing will have vectors that are close to each other in vector space, even if they use completely different words. “The server is down” and “the system is not responding” will be closer to each other than “the server is down” and “I had pasta for lunch.”

A vector database is optimized for one specific operation: given a query vector, find me the N closest vectors in the collection. This is called approximate nearest neighbor search, and it is fundamentally different from SQL WHERE clauses or text search.

In a RAG architecture, the flow looks like this:

You chunk your documents and generate embeddings for each chunk
You store those embeddings in a vector database
When a user asks a question, you generate an embedding for the question
You query the vector database to retrieve the most semantically similar chunks
You pass the question plus those chunks to your LLM as context
The LLM answers based on actual, grounded information

The result is a model that knows your data, stays current as your data changes, and does not hallucinate facts from your knowledge base because the facts are right there in the prompt.

Options on AWS

AWS gives you three serious paths for vector storage, and choosing the wrong one will cost you performance and money.

Amazon Aurora PostgreSQL with pgvector

pgvector is an open source PostgreSQL extension that adds native vector storage and similarity search. If you already run Aurora PostgreSQL, this is often the right starting point.

The extension supports three distance metrics: L2 (Euclidean), inner product, and cosine similarity. For most text embedding use cases, cosine similarity is what you want.

Here is a minimal setup to get you started:

			
-- Enable the extension on your Aurora instance
CREATE EXTENSION vector;
-- Create a table for your document chunks
CREATE TABLE document_chunks (
    id          BIGSERIAL PRIMARY KEY,
    doc_id      TEXT NOT NULL,
    chunk_text  TEXT NOT NULL,
    source_url  TEXT,
    embedding   vector(1536),   -- 1536 dims for text-embedding-3-small
    created_at  TIMESTAMPTZ DEFAULT NOW()
);
-- IVFFlat index for approximate nearest neighbor search
-- lists = sqrt(number of rows) is a good starting point
CREATE INDEX ON document_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

		

			
SELECT
    chunk_text,
    source_url,
    1 - (embedding <=> $1::vector) AS similarity_score
FROM document_chunks
ORDER BY embedding <=> $1::vector
LIMIT 5;

		

The <=> operator computes cosine distance. One minus that gives you similarity.

For production, tune the ivfflat.probes parameter at query time. Higher probes means more accuracy but slower queries. For most use cases, setting it between 10 and 20 is a reasonable balance:

Aurora pgvector is the right choice when your team already knows PostgreSQL, you want to join vector search results with relational data in the same query, or you have an existing Aurora cluster and want to avoid managing another service.

The limitation is scale. Once you push past 10 to 20 million vectors, or you need sub-10ms latency at high concurrency, you will start to feel the ceiling.

Amazon OpenSearch Service with Vector Engine

OpenSearch’s vector engine is built for scale. It uses the HNSW (Hierarchical Navigable Small World) algorithm, which delivers excellent recall and latency even at hundreds of millions of vectors.

Setting up an index for vector search:

			
PUT /documents
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 512
    }
  },
  "mappings": {
    "properties": {
      "doc_id":     { "type": "keyword" },
      "chunk_text": { "type": "text" },
      "source_url": { "type": "keyword" },
      "embedding": {
        "type": "knn_vector",
        "dimension": 1536,
        "method": {
          "name":       "hnsw",
          "space_type": "cosinesimil",
          "engine":     "nmslib",
          "parameters": {
            "ef_construction": 512,
            "m": 16
          }
        }
      }
    }
  }
}

		

The ef_construction and m parameters control the index build quality. Higher values give better recall but increase memory usage and indexing time. For most production workloads, m=16 and ef_construction=512 is a solid baseline.

Indexing a document:

			
import boto3
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
region = "us-east-1"
service = "es"
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key,
                  region, service, session_token=credentials.token)
client = OpenSearch(
    hosts=[{"host": your_opensearch_endpoint, "port": 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)
document = {
    "doc_id":     "product-manual-v3-page-42",
    "chunk_text": "The power button is located on the right side of the device...",
    "source_url": "s3://your-bucket/manuals/product-v3.pdf",
    "embedding":  generate_embedding("The power button is located...")
}
client.index(index="documents", body=document)

		

Querying for semantic similarity:

			
query = {
    "size": 5,
    "query": {
        "knn": {
            "embedding": {
                "vector": generate_embedding(user_question),
                "k": 5
            }
        }
    },
    "_source": ["chunk_text", "source_url"]
}
response = client.search(index="documents", body=query)

		

OpenSearch also lets you combine vector search with traditional filters, which is something pgvector struggles with at scale:

			
hybrid_query = {
    "size": 5,
    "query": {
        "bool": {
            "must": [
                {
                    "knn": {
                        "embedding": {
                            "vector": generate_embedding(user_question),
                            "k": 20
                        }
                    }
                }
            ],
            "filter": [
                { "term": { "product_line": "enterprise" } },
                { "range": { "doc_date": { "gte": "2024-01-01" } } }
            ]
        }
    }
}

		

Retrieving 20 candidates via vector search, then filtering down with metadata, is called pre-filtering, and it is critical when your knowledge base spans multiple products, teams, or access tiers.

Amazon Bedrock Knowledge Bases

If you want the fastest path to production and do not want to manage chunking, embedding, or indexing yourself, Bedrock Knowledge Bases handles all of it.

You point it at an S3 bucket. It crawls your documents, chunks them, generates embeddings using your chosen model, and stores them in an OpenSearch Serverless collection. When you query it, it handles the retrieval and optionally the generation too.

			
resource "aws_bedrockagent_knowledge_base" "product_docs" {
  name     = "product-documentation-kb"
  role_arn = aws_iam_role.bedrock_kb_role.arn
  knowledge_base_configuration {
    type = "VECTOR"
    vector_knowledge_base_configuration {
      embedding_model_arn = "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0"
    }
  }
  storage_configuration {
    type = "OPENSEARCH_SERVERLESS"
    opensearch_serverless_configuration {
      collection_arn    = aws_opensearchserverless_collection.kb_vectors.arn
      vector_index_name = "bedrock-knowledge-base-default-index"
      field_mapping {
        vector_field   = "bedrock-knowledge-base-default-vector"
        text_field     = "AMAZON_BEDROCK_TEXT_CHUNK"
        metadata_field = "AMAZON_BEDROCK_METADATA"
      }
    }
  }
}
resource "aws_bedrockagent_data_source" "s3_docs" {
  knowledge_base_id = aws_bedrockagent_knowledge_base.product_docs.id
  name              = "s3-product-documentation"
  data_source_configuration {
    type = "S3"
    s3_configuration {
      bucket_arn = aws_s3_bucket.documentation.arn
    }
  }
  vector_ingestion_configuration {
    chunking_configuration {
      chunking_strategy = "SEMANTIC"
      semantic_chunking_configuration {
        max_token       = 300
        buffer_size     = 0
        breakpoint_percentile_threshold = 95
      }
    }
  }
}

		

Querying it from your application:

			
import boto3
bedrock_agent = boto3.client("bedrock-agent-runtime", region_name="us-east-1")
response = bedrock_agent.retrieve_and_generate(
    input={
        "text": user_question
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": "YOUR_KB_ID",
            "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0",
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults": 5,
                    "overrideSearchType": "HYBRID"
                }
            }
        }
    }
)
answer = response["output"]["text"]
citations = response["citations"]

		

The HYBRID search type combines vector similarity with keyword search under the hood, which improves recall for queries that contain specific product names, version numbers, or technical terms that embeddings alone sometimes miss.

Chunking Strategy: The Part Everyone Gets Wrong

The quality of your RAG system depends more on how you chunk your documents than on which vector database you choose. I have seen teams spend weeks optimizing their similarity search while their chunking strategy was destroying recall.

A few rules that hold up in practice:

Chunk size matters. Too small and you lose context. Too large and you dilute the semantic signal. For most document types, 300 to 500 tokens with a 50-token overlap between chunks is a reasonable starting point. The overlap ensures that sentences that fall on chunk boundaries are still retrievable.

Chunk by structure when you can. If your documents have headers, sections, or natural breaks, use those as chunk boundaries rather than fixed token counts. A section about “Troubleshooting Network Errors” should stay together rather than getting split at 400 tokens.

Store metadata with every chunk. The chunk text alone is not enough. You need the source document, the section title, the creation date, the product version. This metadata enables the filtering patterns we covered in OpenSearch and prevents your model from citing a three-year-old document when a current one exists.

Test with real queries. The only way to validate your chunking strategy is to run the queries your users will actually ask and check whether the right chunks are being retrieved. Build a small evaluation set early, before you optimize anything else.

Embedding Model Selection

For AWS workloads, you have two main options through Bedrock:

Amazon Titan Text Embeddings V2 produces 1024-dimensional vectors. It is fast, cheap, and fine for general English text. If you are building an internal knowledge base over English documents, this is the right default.

Cohere Embed v3 supports multilingual embeddings and produces 1024-dimensional vectors with better performance on technical and domain-specific text. If your documents cover specialized subject matter legal, medical, engineering Cohere will typically outperform Titan on retrieval quality.

A critical point that is easy to overlook: you must use the same embedding model at indexing time and query time. If you indexed your documents with Titan and query with Cohere, the vectors live in different spaces and your similarity scores will be meaningless. Build this constraint into your infrastructure from day one.

Architecture Summary

For a production RAG system on AWS, here is the architecture that has worked well for teams I have worked with.

Document ingestion: an S3 bucket triggers a Lambda function, or Step Functions for large files. The function chunks the document, generates embeddings via Bedrock, and writes to your vector store with metadata.

Vector storage: Aurora pgvector for under 5 million vectors with heavy relational joins. OpenSearch for everything larger, or when you need metadata filtering at scale. Bedrock Knowledge Bases when you want fully managed infrastructure and your team does not want to own the pipeline.

Query path: API Gateway triggers a Lambda function that embeds the user query, retrieves top-k chunks from the vector store, builds a context-enriched prompt, and calls Claude or another Bedrock model for the final response.

Observability: CloudWatch captures embedding latency, retrieval similarity scores, and end-to-end response time. Set alerts if retrieval quality drops since that is usually a signal that something changed in your document pipeline.

Regards
Osama

Tag: rag

Building Generative AI Applications with Vector Databases on AWS