Full RAG Course

Build Your Own
AI Memory System

Learn to build RAG — the technology that lets AI answer questions about your documents. Free tools, no API keys, runs on your laptop.

🆓

100% Free Stack: This course uses Ollama (free, local LLM), sentence-transformers (free embeddings), and Chroma (free vector database). No API keys. No credit cards. Everything runs on your own machine.

What is RAG

How it works

Free setup

Load docs

Chunking

Embeddings

Vector DB

Querying

Generation

Build API

Full pipeline

Testing

Module 1

What is RAG?

Imagine you hired a brilliant researcher. Every time someone asks them a question, they run to the library, pull out only the relevant pages, read them, and give you an answer based on the actual text — not from memory. That is RAG. Retrieval-Augmented Generation.

Why it matters

📚

Your Private Knowledge

AI models are trained on public internet data. RAG lets them answer questions about your private documents — contracts, reports, manuals — that they have never seen before.

🎯

Stops Hallucination

Instead of guessing from memory, the AI reads the actual text before answering. If the answer is not in the documents, it says so.

🔄

Always Up to Date

AI training is frozen in time. With RAG, you just add new documents — no retraining needed. Yesterday's report can be searched today.

Key Terms

Term	Plain English	One-line definition
RAG	Look-it-up before you answer	Retrieval-Augmented Generation
LLM	A text-predicting brain	Large Language Model (Llama, GPT, Claude etc.)
Embedding	GPS coordinates for meaning	A list of numbers representing text meaning
Vector DB	A library organised by meaning	Database optimised for similarity search
Chunk	A sticky note torn from your document	A small segment of text (200-600 chars)

Quick Check

Why can a regular AI not "know" what is in your company's private documents?

Module 2

How RAG Works

Phase 1 (Index) — happens once: You take all your documents, slice them into pieces, convert each piece into GPS coordinates (embeddings), and store them in a special database. Like building a well-organised filing system.

Phase 2 (Query) — happens every search: A user asks a question. You convert their question into GPS coordinates, find the text pieces closest to those coordinates, and hand them to the AI to form an answer.

Quick Check

You add 500 new pages to your document library. What do you need to redo?

Module 3

Free Setup — No API Keys

Forget OpenAI. Forget credit cards. This course uses tools that are completely free and run entirely on your own computer. Ollama is like having a private ChatGPT that never phones home. sentence-transformers runs locally too. Chroma is a database that lives in a folder on your laptop.

🆓

The Free Stack: Ollama (local LLM) + sentence-transformers (local embeddings) + Chroma (local vector DB). Total cost: $0. Runs offline. No accounts needed.

Install everything in one step

No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

💻

Minimum specs: 8GB RAM works for Llama 3.2 (3B model). 16GB RAM is comfortable. For embeddings, even a 5-year-old laptop handles sentence-transformers without issues.

Module 4 — Index Phase, Step 1

Load Your Documents

Before AI can read your documents, we need to extract the raw text from them — like photocopying pages from a book and turning them into a typed transcript. PDFs store text in a special format that Python cannot read directly, so we use a free library called PyMuPDF to pull the words out, page by page.

What file types can you load?

File type	Library	Free?	Best for
PDF	pymupdf (fitz)	Yes	Research papers, reports, contracts
Word .docx	python-docx	Yes	Office documents, memos
Plain .txt	Built-in Python	Yes	Logs, notes, simple text
Webpage	requests + BeautifulSoup	Yes	Articles, documentation sites
Markdown	Built-in Python	Yes	GitHub readmes, wikis

No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Module 5 — Index Phase, Step 2

Chunking — Slicing Your Text

You cannot hand 200 pages to an AI and say "answer this question" — models have a limited reading window. So we slice documents into small pieces, like cutting a long letter into sticky notes. When someone asks a question, we only hand them the 5 most relevant sticky notes, not the whole letter.

Try It: Chunking Visualiser

Adjust the sliders to see how each strategy slices the same text differently. No code running here — this is a real-time simulation.

Chunk Size (chars)120

Overlap (chars)20

# ── Strategy 1: Fixed-Size (simplest) ──────────────────────────────
def fixed_chunks(text: str, size=512, overlap=64) -> list[str]:
    """Splits every N characters, repeating the tail as overlap."""
    chunks, start = [], 0
    while start < len(text):
        chunks.append(text[start : start + size])
        start += size - overlap
    return chunks

# ── Strategy 2: Recursive (recommended default) ────────────────────
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " "]  # tries each separator in order
)
chunks = splitter.split_text(full_text)

# ── Strategy 3: Semantic (best quality, slowest) ───────────────────
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")  # free, runs locally

def semantic_chunks(text: str, threshold=0.75) -> list[str]:
    sentences = [s.strip() for s in text.split(". ") if s.strip()]
    embeddings = model.encode(sentences)

# cosine similarity between adjacent sentences
    chunks, current = [], [sentences[0]]
    for i in range(1, len(sentences)):
        sim = np.dot(embeddings[i-1], embeddings[i]) / (
            np.linalg.norm(embeddings[i-1]) * np.linalg.norm(embeddings[i])
        )
        if sim < threshold:   # topic shift detected
            chunks.append(". ".join(current))
            current = [sentences[i]]
        else:
            current.append(sentences[i])
    if current:
        chunks.append(". ".join(current))
    return chunks

No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Module 6 — Index Phase, Step 3

Embeddings — GPS for Meaning

An embedding turns words into GPS coordinates. Every piece of text becomes a list of about 400 numbers that represent what it means. Words with similar meanings get similar coordinates. "Dog" and "puppy" land close together. "Dog" and "rocket" land far apart. This lets us search by meaning, not just by matching exact words.

Try It: Word Similarity Demo

Type two words or phrases and see their similarity score. This simulates what an embedding model computes — words with shared meaning score close to 1.0.

Word or Phrase A

Word or Phrase B

Cosine Similarity Score

0.91

Very similar — these would be close in embedding space

No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Module 7 — Index Phase, Step 4

Vector Database — Your AI Filing Cabinet

A normal database organises files alphabetically or by date. A vector database organises them by meaning. When you search it, you say "find me everything that means something similar to this question" — and it returns the closest matches in milliseconds across millions of documents, using a clever algorithm called HNSW.

HNSW: How it finds things fast

HNSW is like a map with three zoom levels. At the highest zoom (layer 2) you see only major cities — navigate quickly to the right region. Zoom to layer 1 for neighbourhoods. Zoom to layer 0 for every house. Instead of checking all houses from the start, you jump: city, neighbourhood, house. That is why it is fast even with millions of vectors.

# pip install chromadb sentence-transformers  (both free)
import chromadb
from sentence_transformers import SentenceTransformer
import uuid

# Local, persistent — stores in a folder called ./my_rag_db
client     = chromadb.PersistentClient(path="./my_rag_db")
collection = client.get_or_create_collection(
    name="my_docs",
    metadata={"hnsw:space": "cosine"}
)

model = SentenceTransformer("all-MiniLM-L6-v2")

def store_chunks(chunks: list[str], doc_name: str):
    """Embed and store chunks with metadata."""
    vectors = model.encode(chunks, normalize_embeddings=True).tolist()

collection.add(
        ids        = [str(uuid.uuid4()) for _ in chunks],
        embeddings = vectors,
        documents  = chunks,
        metadatas  = [{"doc": doc_name} for _ in chunks]
    )
    print(f"Stored {len(chunks)} chunks from {doc_name}")

def search(query: str, k=5, doc_filter=None) -> list[dict]:
    """Find the top-k most relevant chunks."""
    query_vec = model.encode([query], normalize_embeddings=True).tolist()

where = {"doc": doc_filter} if doc_filter else None

results = collection.query(
        query_embeddings = query_vec,
        n_results        = k,
        where            = where,
        include          = ["documents", "distances", "metadatas"]
    )

return [{
        "text":  doc,
        "score": 1 - dist,   # convert distance to similarity
        "doc":   meta["doc"]
    } for doc, dist, meta in zip(
        results["documents"][0],
        results["distances"][0],
        results["metadatas"][0]
    )]

No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Module 8

Querying — Turning Questions Into Vectors

This is the moment a user types a question and your system springs into action. Everything you built in modules 4-7 now runs in reverse: a question comes in, gets embedded, and is matched against your stored vectors.

Think of it like a library catalogue. You describe what you're looking for, the librarian converts your description into a search code, and the system finds the books whose codes are most similar. It never reads every book — it just compares codes. That's vector search.

Understanding Similarity Scores

The scores you see (0.93, 0.89, 0.81) are cosine similarity values. A score of 1.0 means identical, 0.0 means completely unrelated. In practice, scores above 0.75 are relevant, below 0.5 is noise.

Score Range	Meaning	What to do
0.90 – 1.00	Nearly identical content	Always include in context
0.75 – 0.89	Highly relevant	Include in context
0.55 – 0.74	Somewhat related	Include only if k is small
Below 0.55	Likely noise	Filter out with threshold

💡

Pro tip: Always add a minimum score threshold (e.g. 0.6) so your LLM never receives completely irrelevant context. One bad chunk can derail the whole answer.

Quick Check

Your query embedding and a stored chunk embedding have a cosine similarity of 0.97. What does that mean?

from sentence_transformers import SentenceTransformer
import chromadb

model      = SentenceTransformer("all-MiniLM-L6-v2")
client     = chromadb.PersistentClient(path="./my_rag_db")
collection = client.get_or_create_collection("my_docs", metadata={"hnsw:space":"cosine"})

def query(question: str, k: int = 5, min_score: float = 0.6) -> list[dict]:
    vec = model.encode([question], normalize_embeddings=True).tolist()
    res = collection.query(
        query_embeddings=vec,
        n_results=k,
        include=["documents", "distances", "metadatas"]
    )
    results = []
    for doc, dist, meta in zip(
        res["documents"][0],
        res["distances"][0],
        res["metadatas"][0]
    ):
        score = 1 - dist   # Chroma returns distance; flip to similarity
        if score >= min_score:
            results.append({"text": doc, "score": round(score, 3), "source": meta.get("doc","?")})
    return results

# Try it:
hits = query("What are the payment terms?", k=3, min_score=0.65)
for h in hits:
    print(f"[{h['score']}] ({h['source']}) {h['text'][:80]}...")

No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Module 9

Generation — Getting Answers from a Free Local LLM

Retrieval finds the right paragraphs. Generation turns those paragraphs into a fluent, accurate answer. We use Ollama running Llama 3.2 locally — completely free, no API keys, no usage costs.

Think of the retrieved chunks as sticky notes you hand to a very smart friend (the LLM). You say: "I found these relevant passages from the document. Now, using only these notes, please answer my question." The friend doesn't guess — it sticks to the notes. That's what makes RAG trustworthy.

Installing Ollama (takes 5 minutes)

1

Download Ollama

Go to ollama.com and download the installer for Mac, Windows, or Linux. It runs as a background service.

2

Pull the model

Run ollama pull llama3.2 in your terminal. It downloads about 2GB. One-time setup.

3

Test it works

Run ollama run llama3.2. You should see a chat prompt. Type anything. Press Ctrl+D to exit.

4

Use it from Python

Install the OpenAI SDK: pip install openai. Point base_url at localhost. No API key needed.

🆓

Completely free: Llama 3.2 on Ollama costs $0. It runs on your CPU (or GPU if you have one). For Colab, use !curl -fsSL https://ollama.com/install.sh | sh in a cell, then !ollama pull llama3.2 &.

Quick Check

Why do we tell the LLM "answer using ONLY the context below" in the system prompt?

# pip install openai  (the SDK, NOT the service — we point it at Ollama)
# First: run `ollama pull llama3.2` in your terminal once
from openai import OpenAI

# Point the SDK at your local Ollama instance — no API key needed
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

def build_prompt(question: str, chunks: list[dict]) -> str:
    context = "\n\n".join(
        f"[Source: {c['source']}]\n{c['text']}" for c in chunks
    )
    return f"""You are a helpful assistant. Answer the user's question using ONLY
the context provided below. If the answer is not in the context, say
"I don't have enough information to answer that."

Context:
{context}

Question: {question}"""

def generate(question: str, chunks: list[dict]) -> str:
    prompt = build_prompt(question, chunks)
    response = client.chat.completions.create(
        model="llama3.2",
        messages=[
            {"role": "system", "content": "You are a helpful, concise assistant."},
            {"role": "user",   "content": prompt}
        ],
        temperature=0.1,   # low temp = more factual
        max_tokens=512
    )
    return response.choices[0].message.content

# Example (assumes you already have chunks from Module 8):
# chunks = query("What are the payment terms?", k=3)
# answer = generate("What are the payment terms?", chunks)
# print(answer)

No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Module 10

Build the API — Wrapping RAG in a FastAPI Endpoint

You have all the pieces. Now package them into a real HTTP API so any app, website, or chatbot can talk to your RAG system with a single request.

An API is like a waiter at a restaurant. Your app (the customer) tells the waiter what it wants. The waiter goes to the kitchen (your RAG pipeline), gets the result, and brings it back. The app never has to know how the kitchen works — it just sends a request and gets an answer.

The Request and Response Shape

Your API accepts a JSON body and returns a structured JSON response. This is what callers send and what they get back:

POST

Endpoint

Question

Top-K results

Doc filter

Generated curl command:

curl -X POST http://localhost:8000/ask \ -H "Content-Type: application/json" \ -d '{"question":"What are the payment terms?","k":5}'

# pip install fastapi uvicorn openai chromadb sentence-transformers
from fastapi import FastAPI
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer
from openai import OpenAI
import chromadb

app    = FastAPI(title="My RAG API")
model  = SentenceTransformer("all-MiniLM-L6-v2")
chroma = chromadb.PersistentClient(path="./my_rag_db")
col    = chroma.get_or_create_collection("my_docs", metadata={"hnsw:space":"cosine"})
llm    = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

class AskRequest(BaseModel):
    question:   str
    k:          int   = 5
    doc_filter: str | None = None
    min_score:  float = 0.6

@app.post("/ask")
def ask(req: AskRequest):
    # 1. Embed the question
    vec = model.encode([req.question], normalize_embeddings=True).tolist()
    # 2. Search Chroma
    where = {"doc": req.doc_filter} if req.doc_filter else None
    res   = col.query(query_embeddings=vec, n_results=req.k, where=where,
                      include=["documents","distances","metadatas"])
    chunks = [
        {"text": doc, "score": round(1 - dist, 3), "source": meta.get("doc","?")}
        for doc, dist, meta in zip(
            res["documents"][0], res["distances"][0], res["metadatas"][0]
        ) if (1 - dist) >= req.min_score
    ]
    if not chunks:
        return {"answer": "No relevant context found.", "sources": []}
    # 3. Build prompt and generate
    context  = "\n\n".join(f"[{c['source']}] {c['text']}" for c in chunks)
    messages = [
        {"role":"system", "content":"Answer using ONLY the context. Say 'I don't know' if unsure."},
        {"role":"user",   "content": f"Context:\n{context}\n\nQuestion: {req.question}"}
    ]
    answer = llm.chat.completions.create(
        model="llama3.2", messages=messages, temperature=0.1, max_tokens=512
    ).choices[0].message.content
    return {"answer": answer, "sources": chunks}

# Run with: uvicorn api:app --reload

No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Module 11

Full Pipeline — Connecting Every Piece End-to-End

This is the complete picture. Every module you have worked through slots into one cohesive pipeline. Two phases, run in sequence: build the index once, then query it as many times as you like.

Phase 1 is like building a library: you collect books, cut them into chapters, label each chapter, and file everything. You do this once. Phase 2 is like consulting that library: you ask a question, the system finds the most relevant chapters, and a smart assistant reads them and gives you a direct answer. The library stays the same unless you add new books.

The complete pipeline.py file

This single file ties together every module — load, chunk, embed, store, query, and generate. Run it and you have a working RAG system.

📋

Requirements: pip install chromadb sentence-transformers openai pymupdf fastapi uvicorn and Ollama running with ollama pull llama3.2. Total install time: about 3 minutes.

"""
Full RAG Pipeline — free stack, no API keys required.
  - sentence-transformers  (free local embeddings)
  - chromadb               (free local vector DB)
  - ollama + llama3.2      (free local LLM)
"""
import fitz                          # pip install pymupdf
import chromadb
from sentence_transformers import SentenceTransformer
from openai import OpenAI
import uuid

# ── Models & DB ──────────────────────────────────────────
embed_model = SentenceTransformer("all-MiniLM-L6-v2")
chroma      = chromadb.PersistentClient(path="./rag_db")
collection  = chroma.get_or_create_collection(
    "docs", metadata={"hnsw:space": "cosine"}
)
llm = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# ── Phase 1: Index ────────────────────────────────────────
def load_pdf(path: str) -> str:
    return " ".join(page.get_text() for page in fitz.open(path))

def chunk(text: str, size=400, overlap=80) -> list[str]:
    words, chunks, i = text.split(), [], 0
    while i < len(words):
        chunks.append(" ".join(words[i:i+size]))
        i += size - overlap
    return chunks

def index_file(path: str):
    text   = load_pdf(path)
    chunks = chunk(text)
    vecs   = embed_model.encode(chunks, normalize_embeddings=True).tolist()
    doc    = path.split("/")[-1]
    collection.add(
        ids       =[str(uuid.uuid4()) for _ in chunks],
        embeddings=vecs,
        documents =chunks,
        metadatas =[{"doc": doc} for _ in chunks]
    )
    print(f"Indexed {len(chunks)} chunks from {doc}")

# ── Phase 2: Query ────────────────────────────────────────
def search(question: str, k=5, min_score=0.6) -> list[dict]:
    vec = embed_model.encode([question], normalize_embeddings=True).tolist()
    res = collection.query(query_embeddings=vec, n_results=k,
                           include=["documents","distances","metadatas"])
    return [
        {"text": d, "score": round(1-s, 3), "source": m.get("doc","?")}
        for d, s, m in zip(res["documents"][0],
                           res["distances"][0],
                           res["metadatas"][0])
        if (1 - s) >= min_score
    ]

def generate(question: str, chunks: list[dict]) -> str:
    if not chunks:
        return "I could not find relevant information in your documents."
    ctx = "\n\n".join(f"[{c['source']}] {c['text']}" for c in chunks)
    resp = llm.chat.completions.create(
        model="llama3.2",
        messages=[
            {"role":"system","content":"Answer using ONLY the context. Say 'I don't know' if unsure."},
            {"role":"user",  "content":f"Context:\n{ctx}\n\nQuestion: {question}"}
        ],
        temperature=0.1, max_tokens=512
    )
    return resp.choices[0].message.content

def ask(question: str) -> dict:
    chunks = search(question)
    answer = generate(question, chunks)
    return {"answer": answer, "sources": chunks}

# ── Run it ────────────────────────────────────────────────
if __name__ == "__main__":
    # Index a document (do this once)
    # index_file("my_contract.pdf")

# Then ask questions
    result = ask("What are the payment terms?")
    print("Answer:", result["answer"])
    print("\nSources:")
    for s in result["sources"]:
        print(f"  [{s['score']}] {s['source']}: {s['text'][:60]}...")

No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

Module 12

Test and Improve — Making Your RAG System Actually Good

Getting a RAG pipeline running is step one. Getting it to give good answers consistently is the real work. This module shows you the most common failure modes and how to fix each one.

Building a RAG system is like hiring a researcher. A bad researcher misreads your question, finds the wrong files, or makes things up. A good one understands exactly what you asked, finds the most relevant sections, and cites their sources. This module turns your researcher from "pretty good" to "reliable enough to trust."

The 5 Most Common RAG Failures

🔎

Wrong chunks retrieved

The most common problem. Fix by improving your chunking strategy or using a better embedding model. Semantic chunking often helps here.

🪄

LLM hallucination

The model invents facts not in the chunks. Fix by lowering temperature (try 0.0), making the system prompt stricter, and filtering low-score chunks.

📏

Chunks too big or too small

400-word chunks are a starting point. If your docs are dense, go smaller (200-300 words). If they are narrative, go larger (600 words). Always test both.

🔁

Duplicate chunks in results

Add MMR (Maximal Marginal Relevance) — it penalizes redundant results. Chroma does not support MMR natively; implement it with a simple re-ranking loop.

📚

Context window overflow

Sending too many chunks exceeds the model's context limit. Cap at 3-5 chunks for 4K-context models. Use larger models (llama3.1:8b) for longer contexts.

🌐

Multi-document confusion

When querying many docs, use metadata filters. Always include the source document name in the chunk metadata and surface it in the answer.

A Simple Eval Loop

The fastest way to improve is to build a small test set: 20-30 questions with known correct answers. Then measure retrieval recall (did the right chunk appear in the results?) and answer faithfulness (is every claim in the answer traceable to a retrieved chunk?).

Metric	What it measures	Good score
Retrieval Recall@k	Did the correct chunk appear in top-k?	> 0.80
Answer Faithfulness	No invented facts	> 0.90
Answer Relevance	Does the answer address the question?	> 0.85
Latency (p95)	End-to-end response time	< 4 s local

💡

Free eval tools: Try RAGAS (pip install ragas) — it uses an LLM to score your pipeline automatically. Works with Ollama as the judge model, so it costs $0.

Final Check

Your RAG answers are well-written but contain facts that aren't in your documents. What is this called and how do you fix it?

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# ── MMR Re-Ranking (removes duplicate chunks) ─────────────
def mmr(query: str, chunks: list[dict], top_n: int = 5, lam: float = 0.7) -> list[dict]:
    """
    Maximal Marginal Relevance.
    lam=1.0 means pure relevance, lam=0.0 means pure diversity.
    """
    if not chunks:
        return []
    q_vec  = model.encode([query], normalize_embeddings=True)[0]
    c_vecs = model.encode([c["text"] for c in chunks], normalize_embeddings=True)

selected, remaining = [], list(range(len(chunks)))
    while len(selected) < min(top_n, len(chunks)):
        scores = []
        for i in remaining:
            rel = float(np.dot(q_vec, c_vecs[i]))  # similarity to query
            if selected:
                red = max(float(np.dot(c_vecs[i], c_vecs[j])) for j in selected)
            else:
                red = 0.0
            scores.append((lam * rel - (1 - lam) * red, i))
        _, best = max(scores)
        selected.append(best)
        remaining.remove(best)
    return [chunks[i] for i in selected]

# ── Simple Eval Loop ──────────────────────────────────────
test_set = [
    {"question": "What are the payment terms?",
     "expected_keywords": ["net-30", "invoice", "30 days"]},
    {"question": "What is the cancellation policy?",
     "expected_keywords": ["written notice", "30 days", "terminate"]},
]

def recall_at_k(chunks: list[dict], keywords: list[str]) -> float:
    combined = " ".join(c["text"].lower() for c in chunks)
    hits = sum(1 for kw in keywords if kw.lower() in combined)
    return hits / len(keywords)

# from pipeline import ask, search
# for test in test_set:
#     chunks = search(test["question"], k=5)
#     score  = recall_at_k(chunks, test["expected_keywords"])
#     print(f"Q: {test['question'][:50]}")
#     print(f"   Recall@5: {score:.0%}")

No local setup? Run this free in Google Colab — paste the code into a new notebook cell.

🎓

You built a RAG system.

From loading a PDF to serving live answers through an API — using nothing but free, open-source tools. No API keys. No monthly bills. Just your machine and the knowledge you now have.

📄

Doc Loading

PyMuPDF ✓

✂️

Chunking

Fixed + Semantic ✓

🔢

Embeddings

MiniLM-L6 ✓

🗄️

Vector DB

Chroma ✓

🤖

Generation

Ollama / Llama 3.2 ✓

⚡

API

FastAPI ✓

What to build next

Add a chat UI with Streamlit or Gradio

Try Contextual Retrieval for better precision

Add multi-user support with auth

Deploy to a free Hugging Face Space

Experiment with hybrid search (BM25 + vector)

Build Your OwnAI Memory System

What is RAG?

Why it matters

Your Private Knowledge

Stops Hallucination

Always Up to Date

Key Terms

How RAG Works

Free Setup — No API Keys

Install everything in one step

Load Your Documents

What file types can you load?

Chunking — Slicing Your Text

Try It: Chunking Visualiser

Embeddings — GPS for Meaning

Try It: Word Similarity Demo

Vector Database — Your AI Filing Cabinet

HNSW: How it finds things fast

Querying — Turning Questions Into Vectors

Understanding Similarity Scores

Generation — Getting Answers from a Free Local LLM

Installing Ollama (takes 5 minutes)

Download Ollama

Pull the model

Test it works

Use it from Python

Build the API — Wrapping RAG in a FastAPI Endpoint

The Request and Response Shape

Full Pipeline — Connecting Every Piece End-to-End

The complete pipeline.py file

Test and Improve — Making Your RAG System Actually Good

The 5 Most Common RAG Failures

Wrong chunks retrieved

LLM hallucination

Chunks too big or too small

Duplicate chunks in results

Context window overflow

Multi-document confusion

A Simple Eval Loop

You built a RAG system.

What to build next

Build Your Own
AI Memory System