Build Your Own
AI Memory System

Learn to build RAG — the technology that lets AI answer questions about your documents. Free tools, no API keys, runs on your laptop.

🆓
100% Free Stack: This course uses Ollama (free, local LLM), sentence-transformers (free embeddings), and Chroma (free vector database). No API keys. No credit cards. Everything runs on your own machine.
PHASE 1 — INDEX (run once) PHASE 2 — QUERY (every request) 📄 Your Documents PDF, Word, TXT, HTML any text-based file ✂️ Chunk Slice into small pieces 200-600 chars each 🔢 Embed sentence-transformers text to numbers 🗄️ Chroma Vector Database stores vectors Stored with Metadata doc: "report.pdf" page: 12 chunk: "text here..." Index complete Repeat only when documents change 💬 User Question "What does the contract say about X?" 🔢 Embed Query same model as index phase 🔍 Search Chroma HNSW finds top-5 nearest vectors by cosine similarity vector similarity search bridges both phases Top-5 Retrieved Chunks [0.94] "Section 12: Cancellation requires 30 days written notice..." [0.87] "Either party may terminate the agreement by providing..." [0.81] "Notice period for service termination is defined as..." 🤖 Ollama (Llama 3.2) Grounded Answer "Based on Section 12, cancellation needs 30 days..."
What is RAG
How it works
Free setup
Load docs
Chunking
Embeddings
Vector DB
Querying
Generation
Build API
Full pipeline
Testing

What is RAG?

Imagine you hired a brilliant researcher. Every time someone asks them a question, they run to the library, pull out only the relevant pages, read them, and give you an answer based on the actual text — not from memory. That is RAG. Retrieval-Augmented Generation.

WITHOUT RAG 🧠 Only training data frozen in 2024 "What does our Q3 report say?" AI guesses or makes things up (hallucination) WITH RAG YOUR DOCS retrieve 🧠 + real facts Accurate from your own docs "What does our Q3 report say?" AI answers from actual document text, with source citation

Why it matters

📚

Your Private Knowledge

AI models are trained on public internet data. RAG lets them answer questions about your private documents — contracts, reports, manuals — that they have never seen before.

🎯

Stops Hallucination

Instead of guessing from memory, the AI reads the actual text before answering. If the answer is not in the documents, it says so.

🔄

Always Up to Date

AI training is frozen in time. With RAG, you just add new documents — no retraining needed. Yesterday's report can be searched today.

Key Terms

TermPlain EnglishOne-line definition
RAGLook-it-up before you answerRetrieval-Augmented Generation
LLMA text-predicting brainLarge Language Model (Llama, GPT, Claude etc.)
EmbeddingGPS coordinates for meaningA list of numbers representing text meaning
Vector DBA library organised by meaningDatabase optimised for similarity search
ChunkA sticky note torn from your documentA small segment of text (200-600 chars)
Quick Check
Why can a regular AI not "know" what is in your company's private documents?

How RAG Works

Phase 1 (Index) — happens once: You take all your documents, slice them into pieces, convert each piece into GPS coordinates (embeddings), and store them in a special database. Like building a well-organised filing system.

Phase 2 (Query) — happens every search: A user asks a question. You convert their question into GPS coordinates, find the text pieces closest to those coordinates, and hand them to the AI to form an answer.

PHASE 1 — INDEX (one time setup) PHASE 2 — QUERY (runs on every question) 📄LoadPDF to text ✂️ChunkSlice text 🔢EmbedTo numbers 🗄️StoreChroma 💬QuestionUser query 🔢EmbedQuery to nums 🔍RetrieveTop-k chunks 🧠GenerateLLM answers What happens in plain English Read all docs, slice, convert to GPS coords, file away. Done. Only redo this when documents change. What happens in plain English Convert question to GPS coords. Find closest text pieces. Hand them to the AI. Get a grounded answer in milliseconds. Timing: slow (minutes for large docs) But only runs once per document. Worth the wait. Timing: fast (under 500ms per question) This is the path every user experiences, every time.
Quick Check
You add 500 new pages to your document library. What do you need to redo?

Free Setup — No API Keys

Forget OpenAI. Forget credit cards. This course uses tools that are completely free and run entirely on your own computer. Ollama is like having a private ChatGPT that never phones home. sentence-transformers runs locally too. Chroma is a database that lives in a folder on your laptop.

🆓
The Free Stack: Ollama (local LLM) + sentence-transformers (local embeddings) + Chroma (local vector DB). Total cost: $0. Runs offline. No accounts needed.
PAID STACK FREE STACK (this course) Embeddings OpenAI text-embedding-3-small $0.02 / 1M tokens Embeddings sentence-transformers (runs locally) FREE LLM (Generation) GPT-4o or Claude Sonnet $5-15 / 1M tokens LLM (Generation) Ollama with Llama 3.2 (runs locally) FREE Vector Database Pinecone or Weaviate Cloud $70+ / month Vector Database Chroma (local folder, no server needed) FREE

Install everything in one step

setup.sh
No local setup? Run this free in Google Colab — paste the code into a new notebook cell.
💻
Minimum specs: 8GB RAM works for Llama 3.2 (3B model). 16GB RAM is comfortable. For embeddings, even a 5-year-old laptop handles sentence-transformers without issues.

Load Your Documents

Before AI can read your documents, we need to extract the raw text from them — like photocopying pages from a book and turning them into a typed transcript. PDFs store text in a special format that Python cannot read directly, so we use a free library called PyMuPDF to pull the words out, page by page.

📄 your_paper.pdf 50 pages complex layout tables, headers fitz Extracted Pages (Python list of dicts) {"page": 1, "text": "Transformer models have revolutionized NLP..."} [12 more paragraphs...] {"page": 2, "text": "The attention mechanism allows the model to weigh..."} [9 more paragraphs...] You get a clean Python list with page number + text for each page. 50 pages = 50 dict objects. Ready to chunk in the next step.

What file types can you load?

File typeLibraryFree?Best for
PDFpymupdf (fitz)YesResearch papers, reports, contracts
Word .docxpython-docxYesOffice documents, memos
Plain .txtBuilt-in PythonYesLogs, notes, simple text
Webpagerequests + BeautifulSoupYesArticles, documentation sites
MarkdownBuilt-in PythonYesGitHub readmes, wikis
load_documents.py
No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Chunking — Slicing Your Text

You cannot hand 200 pages to an AI and say "answer this question" — models have a limited reading window. So we slice documents into small pieces, like cutting a long letter into sticky notes. When someone asks a question, we only hand them the 5 most relevant sticky notes, not the whole letter.

Too Small under 80 characters Problem: no context "is a technique" means nothing without context Just Right 200 to 600 characters Full sentences, clear topic, strong signal for retrieval Too Large over 2000 characters Covers 5 different topics. Embedding is an average of everything — weak match. Problem: diluted meaning

Try It: Chunking Visualiser

Adjust the sliders to see how each strategy slices the same text differently. No code running here — this is a real-time simulation.

120
20
chunking.py
No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Embeddings — GPS for Meaning

An embedding turns words into GPS coordinates. Every piece of text becomes a list of about 400 numbers that represent what it means. Words with similar meanings get similar coordinates. "Dog" and "puppy" land close together. "Dog" and "rocket" land far apart. This lets us search by meaning, not just by matching exact words.

EMBEDDING SPACE (simplified to 2D) meaning dimension 2 dog puppy kitten cat pet ANIMALS car truck bus van vehicle VEHICLES atom quantum physics SCIENCE far = different meaning close = similar your query nearest chunks returned

Try It: Word Similarity Demo

Type two words or phrases and see their similarity score. This simulates what an embedding model computes — words with shared meaning score close to 1.0.

Cosine Similarity Score
0.91
Very similar — these would be close in embedding space
embeddings.py
No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Vector Database — Your AI Filing Cabinet

A normal database organises files alphabetically or by date. A vector database organises them by meaning. When you search it, you say "find me everything that means something similar to this question" — and it returns the closest matches in milliseconds across millions of documents, using a clever algorithm called HNSW.

Normal Database organised alphabetically Row 1: id=1, text="Apple tree grows..." Row 2: id=2, text="Banana splits are..." Row 3: id=3, text="Canine behaviour..." Row 4: id=4, text="Dog training tips..." Query "puppy care" must scan every row VS Vector Database (Chroma) organised by meaning in n-dimensional space dog puppy canine car truck query Query jumps directly to nearest cluster

HNSW: How it finds things fast

HNSW is like a map with three zoom levels. At the highest zoom (layer 2) you see only major cities — navigate quickly to the right region. Zoom to layer 1 for neighbourhoods. Zoom to layer 0 for every house. Instead of checking all houses from the start, you jump: city, neighbourhood, house. That is why it is fast even with millions of vectors.

vector_store.py
No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Querying — Turning Questions Into Vectors

This is the moment a user types a question and your system springs into action. Everything you built in modules 4-7 now runs in reverse: a question comes in, gets embedded, and is matched against your stored vectors.

Think of it like a library catalogue. You describe what you're looking for, the librarian converts your description into a search code, and the system finds the books whose codes are most similar. It never reads every book — it just compares codes. That's vector search.

STEP 1 💬 User Query "What are the payment terms?" plain text string STEP 2 🔢 Embed Query same MiniLM model used at index time 384-dim float vector STEP 3 🔍 HNSW Search Chroma scans graph finds top-k nearest cosine similarity scores STEP 4 📄 Retrieved Chunks [0.93] "Net-30 payment required..." [0.89] "Late fees apply after..." [0.81] "Invoice must be submitted..." ranked by relevance score

Understanding Similarity Scores

The scores you see (0.93, 0.89, 0.81) are cosine similarity values. A score of 1.0 means identical, 0.0 means completely unrelated. In practice, scores above 0.75 are relevant, below 0.5 is noise.

Score RangeMeaningWhat to do
0.90 – 1.00Nearly identical contentAlways include in context
0.75 – 0.89Highly relevantInclude in context
0.55 – 0.74Somewhat relatedInclude only if k is small
Below 0.55Likely noiseFilter out with threshold
💡
Pro tip: Always add a minimum score threshold (e.g. 0.6) so your LLM never receives completely irrelevant context. One bad chunk can derail the whole answer.
Quick Check
Your query embedding and a stored chunk embedding have a cosine similarity of 0.97. What does that mean?
query.py
No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Generation — Getting Answers from a Free Local LLM

Retrieval finds the right paragraphs. Generation turns those paragraphs into a fluent, accurate answer. We use Ollama running Llama 3.2 locally — completely free, no API keys, no usage costs.

Think of the retrieved chunks as sticky notes you hand to a very smart friend (the LLM). You say: "I found these relevant passages from the document. Now, using only these notes, please answer my question." The friend doesn't guess — it sticks to the notes. That's what makes RAG trustworthy.

Prompt Assembly Combine context + question System: You are a helpful assistant. Answer using ONLY the context below. If unsure, say "I don't know." Context: [Chunk 1] Net-30 payment required... [Chunk 2] Late fees apply after 30... Question: What are the payment terms? Full Prompt String Ollama + Llama 3.2 🤖 Runs 100% on your machine. No internet. No API key. No cost. Compatible with OpenAI Python SDK base_url="http://localhost:11434/v1" Grounded Answer Cites only what was retrieved "Based on the contract, payment is due within 30 days of invoice (Net-30). Late fees are applied after the 30-day period. See Section 4.2 for exact amounts." No hallucination — every claim is traceable to a retrieved chunk

Installing Ollama (takes 5 minutes)

1

Download Ollama

Go to ollama.com and download the installer for Mac, Windows, or Linux. It runs as a background service.

2

Pull the model

Run ollama pull llama3.2 in your terminal. It downloads about 2GB. One-time setup.

3

Test it works

Run ollama run llama3.2. You should see a chat prompt. Type anything. Press Ctrl+D to exit.

4

Use it from Python

Install the OpenAI SDK: pip install openai. Point base_url at localhost. No API key needed.

🆓
Completely free: Llama 3.2 on Ollama costs $0. It runs on your CPU (or GPU if you have one). For Colab, use !curl -fsSL https://ollama.com/install.sh | sh in a cell, then !ollama pull llama3.2 &.
Quick Check
Why do we tell the LLM "answer using ONLY the context below" in the system prompt?
generate.py
No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Build the API — Wrapping RAG in a FastAPI Endpoint

You have all the pieces. Now package them into a real HTTP API so any app, website, or chatbot can talk to your RAG system with a single request.

An API is like a waiter at a restaurant. Your app (the customer) tells the waiter what it wants. The waiter goes to the kitchen (your RAG pipeline), gets the result, and brings it back. The app never has to know how the kitchen works — it just sends a request and gets an answer.

📱 Your App web / mobile / script POST /ask FastAPI Server validates request calls query() calls generate() returns JSON 🗄️ Chroma DB top-k chunks 🤖 Ollama LLM generates answer JSON response: {answer, sources, scores}

The Request and Response Shape

Your API accepts a JSON body and returns a structured JSON response. This is what callers send and what they get back:

POST
Generated curl command:
curl -X POST http://localhost:8000/ask \ -H "Content-Type: application/json" \ -d '{"question":"What are the payment terms?","k":5}'
api.py
No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Full Pipeline — Connecting Every Piece End-to-End

This is the complete picture. Every module you have worked through slots into one cohesive pipeline. Two phases, run in sequence: build the index once, then query it as many times as you like.

Phase 1 is like building a library: you collect books, cut them into chapters, label each chapter, and file everything. You do this once. Phase 2 is like consulting that library: you ask a question, the system finds the most relevant chapters, and a smart assistant reads them and gives you a direct answer. The library stays the same unless you add new books.

PHASE 1 — INDEX (run once) 📄 Load Docs PDF, TXT, HTML Module 4 ✂️ Chunk Text fixed / recursive Module 5 🔢 Embed Chunks MiniLM free Module 6 🗄️ Store in Chroma free, local Module 7 PHASE 2 — QUERY (runs every request) 💬 User Question plain text Module 8 🔢 Embed Query same MiniLM Module 8 🔍 Search Chroma HNSW top-k Module 8 🤖 Ollama LLM Llama 3.2 free Module 9 Grounded Answer cites sources, no hallucination Module 10 (API) shared vector store

The complete pipeline.py file

This single file ties together every module — load, chunk, embed, store, query, and generate. Run it and you have a working RAG system.

📋
Requirements: pip install chromadb sentence-transformers openai pymupdf fastapi uvicorn and Ollama running with ollama pull llama3.2. Total install time: about 3 minutes.
pipeline.py
No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Test and Improve — Making Your RAG System Actually Good

Getting a RAG pipeline running is step one. Getting it to give good answers consistently is the real work. This module shows you the most common failure modes and how to fix each one.

Building a RAG system is like hiring a researcher. A bad researcher misreads your question, finds the wrong files, or makes things up. A good one understands exactly what you asked, finds the most relevant sections, and cites their sources. This module turns your researcher from "pretty good" to "reliable enough to trust."

The 5 Most Common RAG Failures

🔎

Wrong chunks retrieved

The most common problem. Fix by improving your chunking strategy or using a better embedding model. Semantic chunking often helps here.

🪄

LLM hallucination

The model invents facts not in the chunks. Fix by lowering temperature (try 0.0), making the system prompt stricter, and filtering low-score chunks.

📏

Chunks too big or too small

400-word chunks are a starting point. If your docs are dense, go smaller (200-300 words). If they are narrative, go larger (600 words). Always test both.

🔁

Duplicate chunks in results

Add MMR (Maximal Marginal Relevance) — it penalizes redundant results. Chroma does not support MMR natively; implement it with a simple re-ranking loop.

📚

Context window overflow

Sending too many chunks exceeds the model's context limit. Cap at 3-5 chunks for 4K-context models. Use larger models (llama3.1:8b) for longer contexts.

🌐

Multi-document confusion

When querying many docs, use metadata filters. Always include the source document name in the chunk metadata and surface it in the answer.

A Simple Eval Loop

The fastest way to improve is to build a small test set: 20-30 questions with known correct answers. Then measure retrieval recall (did the right chunk appear in the results?) and answer faithfulness (is every claim in the answer traceable to a retrieved chunk?).

MetricWhat it measuresGood score
Retrieval Recall@kDid the correct chunk appear in top-k?> 0.80
Answer FaithfulnessNo invented facts> 0.90
Answer RelevanceDoes the answer address the question?> 0.85
Latency (p95)End-to-end response time< 4 s local
💡
Free eval tools: Try RAGAS (pip install ragas) — it uses an LLM to score your pipeline automatically. Works with Ollama as the judge model, so it costs $0.
Final Check
Your RAG answers are well-written but contain facts that aren't in your documents. What is this called and how do you fix it?
eval.py
No local setup? Run this free in Google Colab — paste the code into a new notebook cell.
🎓

You built a RAG system.

From loading a PDF to serving live answers through an API — using nothing but free, open-source tools. No API keys. No monthly bills. Just your machine and the knowledge you now have.

📄
Doc Loading
PyMuPDF ✓
✂️
Chunking
Fixed + Semantic ✓
🔢
Embeddings
MiniLM-L6 ✓
🗄️
Vector DB
Chroma ✓
🤖
Generation
Ollama / Llama 3.2 ✓
API
FastAPI ✓

What to build next

Add a chat UI with Streamlit or Gradio
Try Contextual Retrieval for better precision
Add multi-user support with auth
Deploy to a free Hugging Face Space
Experiment with hybrid search (BM25 + vector)