RAG + Chengeta AI
Full RAG pipeline with three caching layers — first query ~45s live, repeat query < 1ms from cache.
| Layer | Class | Caches | Bypasses |
|---|---|---|---|
| 1 | EmbeddingCache | text -> float32 vector | Ollama embed API |
| 2 | RetrievalCache | query -> doc list | FAISS search |
| 3 | ResponseCache | messages -> answer | OpenRouter LLM |
LLM: OpenRouter nvidia/nemotron-3-super-120b-a12b:free
Embeddings: Ollama embeddinggemma:300m (local, no API key needed)
Vector index: FAISS (in-process, dim auto-detected)
Cache backend: TieredBackend (InMemory L1 + DiskBackend L2)
How caching works
Setup
# 1. Pull Ollama embedding model
ollama pull embeddinggemma:300m
# 2. Clone and install
cd examples/rag-ollama
uv sync
# 3. Set your OpenRouter API key
cp .env.example .env
# Edit .env → OPENROUTER_API_KEY=sk-or-v1-...
# 4. Index documents
uv run python main.py ingest
# 5. Ask a question
uv run python main.py query "What is retrieval-augmented generation?"
# 6. Demo — shows cache speedup
uv run python main.py demo
Commands
| Command | Description |
|---|---|
python main.py ingest | Embed + index data/docs.txt into FAISS |
python main.py query "..." | Ask a question |
python main.py demo | 5 queries (2 repeats show cache hits) |
python main.py stats | Print hit/miss metrics |
Measured latency
| Call | Time | State |
|---|---|---|
| First query | ~45s | All misses — Ollama embed + FAISS + OpenRouter LLM |
| Repeat query | 0.001–0.003s | All hits — served from disk cache |
Code walkthrough
Cache setup (rag/cache.py)
from chengeta_ai.backends.tiered_backend import TieredBackend
from chengeta_ai.backends.memory_backend import InMemoryBackend
from chengeta_ai.backends.disk_backend import DiskBackend
from chengeta_ai.core.policies import TTLPolicy
manager = CacheManager(
backend=TieredBackend(
l1=InMemoryBackend(max_size=2_000), # hot entries in RAM
l2=DiskBackend("~/.cache/rag-openrouter/store"), # survives restarts
l1_ttl=600,
),
key_builder=CacheKeyBuilder(namespace="rag"),
ttl_policy=TTLPolicy(
default_ttl=3_600,
per_type={"embedding": 86_400 * 7, "retrieval": 3_600, "response": 86_400},
),
)
Embedding with cache (rag/embedder.py)
# EmbeddingCache.set(text, vector, model_id) ← correct arg order
cached = self._cache.get(text, self._model)
if cached is not None:
return cached # skip Ollama entirely
vec = requests.post(
"http://localhost:11434/api/embeddings",
json={"model": model, "prompt": text}, timeout=120,
).json()["embedding"]
self._cache.set(text, np.array(vec, dtype=np.float32), self._model)
Retrieval with cache (rag/indexer.py)
# RetrievalCache.get(query, retriever_id, top_k)
cached = self._cache.get(query, retriever_id="faiss-openrouter", top_k=4)
if cached is not None:
return cached # skip FAISS search
_, indices = self._index.search(query_vec, top_k)
docs = [self._documents[i] for i in indices[0]]
self._cache.set(query, docs, retriever_id="faiss-openrouter", top_k=4)
Generation with cache (rag/generator.py)
# ResponseCache.get(messages, model_id) ← messages first
cached = self._cache.get(messages, self._model)
if cached is not None:
return cached # skip OpenRouter entirely
answer = openai_client.chat.completions.create(
model="nvidia/nemotron-3-super-120b-a12b:free",
messages=messages,
).choices[0].message.content
# ResponseCache.set(messages, response, model_id)
self._cache.set(messages, answer, self._model)