Skip to main content

LlamaIndex + Chengeta AI

Two adapters — one for LLM calls, one for QueryEngine (RAG) queries.

Install

pip install 'chengeta-ai[llamaindex]'

LLM Cache

Drop LlamaIndexLLMCacheAdapter in anywhere a LlamaIndex LLM is accepted:

import time
from llama_index.llms.openai import OpenAI
from chengeta_ai import CacheManager, InMemoryBackend, CacheKeyBuilder
from chengeta_ai.adapters.llamaindex_adapter import LlamaIndexLLMCacheAdapter

llm = OpenAI(model="gpt-4o-mini")
manager = CacheManager(
backend=InMemoryBackend(),
key_builder=CacheKeyBuilder(namespace="myapp"),
)
cached_llm = LlamaIndexLLMCacheAdapter(llm, manager)

prompt = "What is retrieval-augmented generation?"

t0 = time.perf_counter()
r1 = cached_llm.complete(prompt)
t1 = time.perf_counter()
print(r1.text)
print(f"First: {t1-t0:.3f}s")

t2 = time.perf_counter()
r2 = cached_llm.complete(prompt)
t3 = time.perf_counter()
print(f"Second: {t3-t2:.3f}s (cache hit)")

Use as global LlamaIndex LLM

from llama_index.core import Settings

Settings.llm = cached_llm # all LlamaIndex components use cached LLM

QueryEngine Cache (RAG)

Cache entire QueryEngine.query() results — no retrieval, no LLM call on repeated queries:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from chengeta_ai.adapters.llamaindex_adapter import LlamaIndexQueryCacheAdapter

documents = SimpleDirectoryReader("data/").load_data()
index = VectorStoreIndex.from_documents(documents)
engine = index.as_query_engine()

cached_engine = LlamaIndexQueryCacheAdapter(engine, manager)

query = "What are the key findings in the report?"

r1 = cached_engine.query(query) # retrieves + calls LLM
r2 = cached_engine.query(query) # instant from cache

# Async
r3 = await cached_engine.aquery(query)

Async LLM

r = await cached_llm.acomplete("Explain vector embeddings")
r = await cached_llm.achat(messages)

Run the cookbook

uv run python -m cookbook.llamaindex.agent