Large Language Models (LLMs) are powerful, but they come with an important limitation: they only know what was included during training. They cannot natively reason over private documents, internal knowledge bases, or freshly updated content.
This is where Retrieval-Augmented Generation (RAG) becomes essential.
This post walks through a minimal, production-ready RAG example using LangChain, OpenAI, and Pinecone, focusing on clarity, correctness.
๐ What This Project Demonstrates
This demo shows how to:
- Load and chunk documents
- Generate vector embeddings using OpenAI
- Store embeddings in Pinecone
- Retrieve relevant context at query time
- Generate answers strictly grounded in retrieved data
This approach dramatically reduces hallucinations and makes LLM responses explainable and auditable.
๐งฉ High-Level Architecture
The system follows a clean two-phase flow:
1️⃣ Indexing Phase (Offline / One-Time)
- Load a text document
- Split it into chunks
- Generate embeddings
- Store vectors in Pinecone
2️⃣ Query Phase (Runtime)
- Convert a user question into an embedding
- Retrieve the most relevant chunks
- Inject retrieved context into the prompt
- Generate an answer using an LLM
๐ Project Structure
langchain-rag-demo/
├── load_file_to_vector_db.py # Indexing step
├── main.py # Query + generation
├── example_pinecone.txt # Sample document
├── pyproject.toml # Dependencies
├── .env # API keys (local only)
└── README.md ๐ง Step 1: Loading Data into the Vector Database
The indexing script performs four key steps:- Load the document
- Split text into chunks
- Create embeddings
- Store vectors in Pinecone
Key design choices:
- Fixed chunk size for predictable retrieval
- Zero overlap for simplicity
- Environment-based configuration for security
Once this script runs, the document becomes searchable via semantic similarity.
๐ Step 2: Querying with RAG
At runtime, the application:
- Retrieves the top-K most relevant chunks
- Injects them into a strict prompt template
- Forces the model to answer only from retrieved context
- Returns "I don't know" if the answer is missing
This constraint is critical for enterprise and compliance-sensitive workloads.
๐ก️ Prompt Guardrails (Why They Matter)
Use ONLY the following context to answer the question.
If the answer is not in the context, say "I don't know".
This single rule:
- Prevents hallucinations
- Makes failures explicit
- Improves trustworthiness of responses
⚙️ Configuration via Environment Variables
All sensitive configuration is externalized:
OPENAI_API_KEY=...
OPENAI_MODEL=...
PINECONE_API_KEY=...
PINECONE_INDEX_NAME=...
This keeps the codebase:
- Portable
- CI/CD-friendly
- Safe for public repositories
๐งช Why This Minimal Example Matters
Despite its simplicity, this project already matches real-world RAG patterns used in:
- Internal documentation assistants
- Customer support bots
- Knowledge base search
- Developer tooling
- Compliance-aware AI systems
It provides a clean foundation that can later be extended with:
- Metadata filtering
- Hybrid search
- Reranking
- Streaming responses
- Tool calling
๐ Final Thoughts
RAG is not an advanced optimization—it is a baseline requirement for reliable LLM systems.
This demo intentionally avoids unnecessary abstractions to make the core ideas clear and reusable across projects.
๐ Source code available on GitHub ๐ langchain-rag-demo

No comments:
Post a Comment