Skip to main content

RAG + Chengeta AI

Full RAG pipeline with three caching layers — first query ~45s live, repeat query < 1ms from cache.

LayerClassCachesBypasses
1EmbeddingCachetext -> float32 vectorOllama embed API
2RetrievalCachequery -> doc listFAISS search
3ResponseCachemessages -> answerOpenRouter LLM

LLM: OpenRouter nvidia/nemotron-3-super-120b-a12b:free Embeddings: Ollama embeddinggemma:300m (local, no API key needed) Vector index: FAISS (in-process, dim auto-detected) Cache backend: TieredBackend (InMemory L1 + DiskBackend L2)


How caching works


Setup

# 1. Pull Ollama embedding model
ollama pull embeddinggemma:300m

# 2. Clone and install
cd examples/rag-ollama
uv sync

# 3. Set your OpenRouter API key
cp .env.example .env
# Edit .env → OPENROUTER_API_KEY=sk-or-v1-...

# 4. Index documents
uv run python main.py ingest

# 5. Ask a question
uv run python main.py query "What is retrieval-augmented generation?"

# 6. Demo — shows cache speedup
uv run python main.py demo

Commands

CommandDescription
python main.py ingestEmbed + index data/docs.txt into FAISS
python main.py query "..."Ask a question
python main.py demo5 queries (2 repeats show cache hits)
python main.py statsPrint hit/miss metrics

Measured latency

CallTimeState
First query~45sAll misses — Ollama embed + FAISS + OpenRouter LLM
Repeat query0.001–0.003sAll hits — served from disk cache

Code walkthrough

Cache setup (rag/cache.py)

from chengeta_ai.backends.tiered_backend import TieredBackend
from chengeta_ai.backends.memory_backend import InMemoryBackend
from chengeta_ai.backends.disk_backend import DiskBackend
from chengeta_ai.core.policies import TTLPolicy

manager = CacheManager(
backend=TieredBackend(
l1=InMemoryBackend(max_size=2_000), # hot entries in RAM
l2=DiskBackend("~/.cache/rag-openrouter/store"), # survives restarts
l1_ttl=600,
),
key_builder=CacheKeyBuilder(namespace="rag"),
ttl_policy=TTLPolicy(
default_ttl=3_600,
per_type={"embedding": 86_400 * 7, "retrieval": 3_600, "response": 86_400},
),
)

Embedding with cache (rag/embedder.py)

# EmbeddingCache.set(text, vector, model_id)  ← correct arg order
cached = self._cache.get(text, self._model)
if cached is not None:
return cached # skip Ollama entirely

vec = requests.post(
"http://localhost:11434/api/embeddings",
json={"model": model, "prompt": text}, timeout=120,
).json()["embedding"]

self._cache.set(text, np.array(vec, dtype=np.float32), self._model)

Retrieval with cache (rag/indexer.py)

# RetrievalCache.get(query, retriever_id, top_k)
cached = self._cache.get(query, retriever_id="faiss-openrouter", top_k=4)
if cached is not None:
return cached # skip FAISS search

_, indices = self._index.search(query_vec, top_k)
docs = [self._documents[i] for i in indices[0]]
self._cache.set(query, docs, retriever_id="faiss-openrouter", top_k=4)

Generation with cache (rag/generator.py)

# ResponseCache.get(messages, model_id)  ← messages first
cached = self._cache.get(messages, self._model)
if cached is not None:
return cached # skip OpenRouter entirely

answer = openai_client.chat.completions.create(
model="nvidia/nemotron-3-super-120b-a12b:free",
messages=messages,
).choices[0].message.content

# ResponseCache.set(messages, response, model_id)
self._cache.set(messages, answer, self._model)

Source

👉 examples/rag-ollama/