Build Your Own
AI Memory System
Learn to build RAG — the technology that lets AI answer questions about your documents. Free tools, no API keys, runs on your laptop.
What is RAG?
Imagine you hired a brilliant researcher. Every time someone asks them a question, they run to the library, pull out only the relevant pages, read them, and give you an answer based on the actual text — not from memory. That is RAG. Retrieval-Augmented Generation.
Why it matters
Your Private Knowledge
AI models are trained on public internet data. RAG lets them answer questions about your private documents — contracts, reports, manuals — that they have never seen before.
Stops Hallucination
Instead of guessing from memory, the AI reads the actual text before answering. If the answer is not in the documents, it says so.
Always Up to Date
AI training is frozen in time. With RAG, you just add new documents — no retraining needed. Yesterday's report can be searched today.
Key Terms
| Term | Plain English | One-line definition |
|---|---|---|
| RAG | Look-it-up before you answer | Retrieval-Augmented Generation |
| LLM | A text-predicting brain | Large Language Model (Llama, GPT, Claude etc.) |
| Embedding | GPS coordinates for meaning | A list of numbers representing text meaning |
| Vector DB | A library organised by meaning | Database optimised for similarity search |
| Chunk | A sticky note torn from your document | A small segment of text (200-600 chars) |
How RAG Works
Phase 1 (Index) — happens once: You take all your documents, slice them into pieces, convert each piece into GPS coordinates (embeddings), and store them in a special database. Like building a well-organised filing system.
Phase 2 (Query) — happens every search: A user asks a question. You convert their question into GPS coordinates, find the text pieces closest to those coordinates, and hand them to the AI to form an answer.
Free Setup — No API Keys
Forget OpenAI. Forget credit cards. This course uses tools that are completely free and run entirely on your own computer. Ollama is like having a private ChatGPT that never phones home. sentence-transformers runs locally too. Chroma is a database that lives in a folder on your laptop.
Install everything in one step
Load Your Documents
Before AI can read your documents, we need to extract the raw text from them — like photocopying pages from a book and turning them into a typed transcript. PDFs store text in a special format that Python cannot read directly, so we use a free library called PyMuPDF to pull the words out, page by page.
What file types can you load?
| File type | Library | Free? | Best for |
|---|---|---|---|
| pymupdf (fitz) | Yes | Research papers, reports, contracts | |
| Word .docx | python-docx | Yes | Office documents, memos |
| Plain .txt | Built-in Python | Yes | Logs, notes, simple text |
| Webpage | requests + BeautifulSoup | Yes | Articles, documentation sites |
| Markdown | Built-in Python | Yes | GitHub readmes, wikis |
Chunking — Slicing Your Text
You cannot hand 200 pages to an AI and say "answer this question" — models have a limited reading window. So we slice documents into small pieces, like cutting a long letter into sticky notes. When someone asks a question, we only hand them the 5 most relevant sticky notes, not the whole letter.
Try It: Chunking Visualiser
Adjust the sliders to see how each strategy slices the same text differently. No code running here — this is a real-time simulation.
Embeddings — GPS for Meaning
An embedding turns words into GPS coordinates. Every piece of text becomes a list of about 400 numbers that represent what it means. Words with similar meanings get similar coordinates. "Dog" and "puppy" land close together. "Dog" and "rocket" land far apart. This lets us search by meaning, not just by matching exact words.
Try It: Word Similarity Demo
Type two words or phrases and see their similarity score. This simulates what an embedding model computes — words with shared meaning score close to 1.0.
Vector Database — Your AI Filing Cabinet
A normal database organises files alphabetically or by date. A vector database organises them by meaning. When you search it, you say "find me everything that means something similar to this question" — and it returns the closest matches in milliseconds across millions of documents, using a clever algorithm called HNSW.
HNSW: How it finds things fast
HNSW is like a map with three zoom levels. At the highest zoom (layer 2) you see only major cities — navigate quickly to the right region. Zoom to layer 1 for neighbourhoods. Zoom to layer 0 for every house. Instead of checking all houses from the start, you jump: city, neighbourhood, house. That is why it is fast even with millions of vectors.
Querying — Turning Questions Into Vectors
This is the moment a user types a question and your system springs into action. Everything you built in modules 4-7 now runs in reverse: a question comes in, gets embedded, and is matched against your stored vectors.
Think of it like a library catalogue. You describe what you're looking for, the librarian converts your description into a search code, and the system finds the books whose codes are most similar. It never reads every book — it just compares codes. That's vector search.
Understanding Similarity Scores
The scores you see (0.93, 0.89, 0.81) are cosine similarity values. A score of 1.0 means identical, 0.0 means completely unrelated. In practice, scores above 0.75 are relevant, below 0.5 is noise.
| Score Range | Meaning | What to do |
|---|---|---|
| 0.90 – 1.00 | Nearly identical content | Always include in context |
| 0.75 – 0.89 | Highly relevant | Include in context |
| 0.55 – 0.74 | Somewhat related | Include only if k is small |
| Below 0.55 | Likely noise | Filter out with threshold |
Generation — Getting Answers from a Free Local LLM
Retrieval finds the right paragraphs. Generation turns those paragraphs into a fluent, accurate answer. We use Ollama running Llama 3.2 locally — completely free, no API keys, no usage costs.
Think of the retrieved chunks as sticky notes you hand to a very smart friend (the LLM). You say: "I found these relevant passages from the document. Now, using only these notes, please answer my question." The friend doesn't guess — it sticks to the notes. That's what makes RAG trustworthy.
Installing Ollama (takes 5 minutes)
Download Ollama
Go to ollama.com and download the installer for Mac, Windows, or Linux. It runs as a background service.
Pull the model
Run ollama pull llama3.2 in your terminal. It downloads about 2GB. One-time setup.
Test it works
Run ollama run llama3.2. You should see a chat prompt. Type anything. Press Ctrl+D to exit.
Use it from Python
Install the OpenAI SDK: pip install openai. Point base_url at localhost. No API key needed.
!curl -fsSL https://ollama.com/install.sh | sh in a cell, then !ollama pull llama3.2 &.Build the API — Wrapping RAG in a FastAPI Endpoint
You have all the pieces. Now package them into a real HTTP API so any app, website, or chatbot can talk to your RAG system with a single request.
An API is like a waiter at a restaurant. Your app (the customer) tells the waiter what it wants. The waiter goes to the kitchen (your RAG pipeline), gets the result, and brings it back. The app never has to know how the kitchen works — it just sends a request and gets an answer.
The Request and Response Shape
Your API accepts a JSON body and returns a structured JSON response. This is what callers send and what they get back:
Full Pipeline — Connecting Every Piece End-to-End
This is the complete picture. Every module you have worked through slots into one cohesive pipeline. Two phases, run in sequence: build the index once, then query it as many times as you like.
Phase 1 is like building a library: you collect books, cut them into chapters, label each chapter, and file everything. You do this once. Phase 2 is like consulting that library: you ask a question, the system finds the most relevant chapters, and a smart assistant reads them and gives you a direct answer. The library stays the same unless you add new books.
The complete pipeline.py file
This single file ties together every module — load, chunk, embed, store, query, and generate. Run it and you have a working RAG system.
pip install chromadb sentence-transformers openai pymupdf fastapi uvicorn and Ollama running with ollama pull llama3.2. Total install time: about 3 minutes.Test and Improve — Making Your RAG System Actually Good
Getting a RAG pipeline running is step one. Getting it to give good answers consistently is the real work. This module shows you the most common failure modes and how to fix each one.
Building a RAG system is like hiring a researcher. A bad researcher misreads your question, finds the wrong files, or makes things up. A good one understands exactly what you asked, finds the most relevant sections, and cites their sources. This module turns your researcher from "pretty good" to "reliable enough to trust."
The 5 Most Common RAG Failures
Wrong chunks retrieved
The most common problem. Fix by improving your chunking strategy or using a better embedding model. Semantic chunking often helps here.
LLM hallucination
The model invents facts not in the chunks. Fix by lowering temperature (try 0.0), making the system prompt stricter, and filtering low-score chunks.
Chunks too big or too small
400-word chunks are a starting point. If your docs are dense, go smaller (200-300 words). If they are narrative, go larger (600 words). Always test both.
Duplicate chunks in results
Add MMR (Maximal Marginal Relevance) — it penalizes redundant results. Chroma does not support MMR natively; implement it with a simple re-ranking loop.
Context window overflow
Sending too many chunks exceeds the model's context limit. Cap at 3-5 chunks for 4K-context models. Use larger models (llama3.1:8b) for longer contexts.
Multi-document confusion
When querying many docs, use metadata filters. Always include the source document name in the chunk metadata and surface it in the answer.
A Simple Eval Loop
The fastest way to improve is to build a small test set: 20-30 questions with known correct answers. Then measure retrieval recall (did the right chunk appear in the results?) and answer faithfulness (is every claim in the answer traceable to a retrieved chunk?).
| Metric | What it measures | Good score |
|---|---|---|
| Retrieval Recall@k | Did the correct chunk appear in top-k? | > 0.80 |
| Answer Faithfulness | No invented facts | > 0.90 |
| Answer Relevance | Does the answer address the question? | > 0.85 |
| Latency (p95) | End-to-end response time | < 4 s local |