RAG vs. direct API calls? If you’ve run into this dilemma, you’re not alone. In the rapidly evolving world of AI development, a new model, framework, or architecture seems to drop every week, each promising smarter, faster, and more efficient solutions. It’s an exciting time, but it also creates a critical challenge: choosing the right tool for the job. Do you go with the sophisticated, multi-step pipeline, or does a simpler, more direct approach hold the answer? Should you use Retrieval-Augmented Generation (RAG) or a direct API call? Let’s find out by comparing the two solutions side by side.
From broad choices to a specific question
Here at SolDevelo, we faced this exact question on a recent project. Our goal was to build a system that automatically processes large volumes of complex documents found online. We needed to:
- Process these lengthy documents,
- Classify them according to a set of specific business rules
- Extract key pieces of information: specific dates, financial totals, and more.
Our tech stack for the classification part was straightforward: a Python service using the Haystack library to orchestrate the process, pulling scraped web content from a RabbitMQ message queue and using a powerful LLM via OpenRouter to perform the classification.
The core of the project, however, hinged on a single decision: how should we present the document’s data to the AI?
We decided to test two popular methods head-to-head. The first was the sophisticated, industry-favorite Retrieval-Augmented Generation (RAG) pipeline. The second was a deceptively simple Direct API call, sending the entire document as context. The results weren’t what we initially expected, and they taught us a valuable lesson about matching the solution to the specific problem at hand.
Approach #1: The sophisticated solution – building a RAG pipeline
Our first instinct was to build a RAG pipeline. It’s a powerful and popular technique, and for good reason.
What is RAG?
At a high level, RAG works like an expert researcher preparing a briefing for a CEO. Instead of handing the CEO a 100-page report (the entire document), the researcher first reads the report, finds the most relevant paragraphs related to the CEO’s questions (the “retrieval” step), and then presents only those key excerpts to generate a concise answer (the “generation” step). This prevents the CEO (the LLM) from getting overwhelmed and saves a lot of time and effort.
Our implementation with Haystack & OpenRouter
We used the Haystack framework to build our pipeline. The process looked like this:
- Chunking: We took the full text of a document and used a DocumentSplitter to break it down into smaller, more manageable pieces. We settled on chunks of 400 words with an 80-word overlap to ensure no context was lost at the boundaries.
- Embedding: Each chunk was then converted into a vector embedding using a local, open-source model (sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2). These embeddings are numerical representations that capture the semantic meaning of the text.
- Retrieval: When we wanted to ask a question (e.g., “What is the deadline?”), we would embed the question using the same model and use a InMemoryEmbeddingRetriever to find the top 5 text chunks whose embeddings were most similar to our query’s embedding.
- Generation: Finally, these top 5 relevant chunks, not the entire document, were passed to a powerful LLM via OpenRouter, along with a prompt instructing it to extract the specific information we needed.
Here is a runnable example that demonstrates how such a pipeline can be structured in Python.
# To run this, you'll need a few libraries:
# pip install haystack-ai sentence-transformers openrouter-haystack
import os
from pathlib import Path
from haystack import Pipeline, Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import (
SentenceTransformersDocumentEmbedder,
SentenceTransformersTextEmbedder,
)
from haystack.components.writers import DocumentWriter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack_integrations.components.generators.openrouter import OpenRouterChatGenerator
# --- 0) Initial Setup ---
# Make sure to set your OpenRouter API key as an environment variable
# For this example to run, create a file named 'document.txt' in the same directory.
if not os.environ.get("OPENROUTER_API_KEY"):
os.environ["OPENROUTER_API_KEY"] = "your_key_here" # Or set it in your system
# --- 1) Read Document and Write to Document Store ---
doc_store = InMemoryDocumentStore()
try:
document_path = Path(__file__).parent / "document.txt"
raw_text = document_path.read_text(encoding="utf-8")
doc = Document(content=raw_text)
except FileNotFoundError:
print("Error: document.txt not found. Please create this file with sample text.")
exit()
# --- 2) Split document, embed chunks, and write to store ---
splitter = DocumentSplitter(split_by="word", split_length=400, split_overlap=80)
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
doc_embedder.warm_up()
# Process the documents
split_docs = splitter.run([doc])["documents"]
embedded_docs = doc_embedder.run(split_docs)["documents"]
writer = DocumentWriter(document_store=doc_store)
writer.run(embedded_docs)
# --- 3) Set up the Retriever ---
query_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
query_embedder.warm_up()
retriever = InMemoryEmbeddingRetriever(document_store=doc_store, top_k=5)
# --- 4) Define the Prompt Template ---
template = [
ChatMessage.from_system(
"You are an expert assistant who extracts key information from documents. "
"Based *only* on the context provided, answer the user's query. "
"If the information is not in the context, say 'Not found'."
),
ChatMessage.from_user(
"""
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
Query: Based on the context, what are the key effective dates and financial totals?
"""
),
]
prompt_builder = ChatPromptBuilder(template=template)
# --- 5) Configure the LLM and Build the Pipeline ---
llm = OpenRouterChatGenerator(model="openai/gpt-5")
pipe = Pipeline()
pipe.add_component("query_embedder", query_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", llm)
# Connect the components
pipe.connect("query_embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever.documents", "prompt_builder.documents")
pipe.connect("prompt_builder.prompt", "llm.messages")
# --- 6) Run the Classification ---
# This query is for the retriever to find relevant chunks
query_for_retriever = "What are the key dates, financial figures, and reference codes?"
result = pipe.run({"query_embedder": {"text": query_for_retriever}})
reply = result["llm"]["replies"][0]
print("LLM reply:", reply.content)
The verdict: Token-efficient, but flawed
The initial results were promising from a cost perspective.
The good (Token-efficiency)
• The RAG pipeline was incredibly token-efficient.
• A 70-page document (~33,000 tokens) was processed using only 1,000–4,000 tokens per call.
• This efficiency was a huge win for managing our operational costs.
• Accuracy for general topics and themes was solid.
The bad (Accuracy issues)
• The pipeline struggled to consistently find specific, literal data points like dates, reference code, and percentages.
• Why? Embedding models are optimized for semantic similarity, not literal values.
• Example: “the report’s effective date” is semantically close to “key timeline,” but a specific date like “2024-12-15” might not be weighted as heavily by the embedding algorithm, causing the retriever to miss the chunk containing it.
• We tried tweaking chunk sizes and the number of retrieved documents, but the results remained mixed, with key information still being missed in some of our test documents.
Could we have improved this?
Almost certainly. We could have invested more time in fine-tuning retrieval queries, experimenting with different chunking strategies, or exploring more specialized embedding models.
But this path meant adding more complexity and development time. Before heading down that rabbit hole, we decided to ask a simple question: what if we just tried a more direct route?
Approach #2: The “keep it simple” method – direct API call
That question led us to our second experiment: what if we ignored the complexity of RAG and simply sent the entire document to the LLM in a single API call?
Overcoming old limitations
Just a few years ago, this would have been a non-starter. Early LLMs had tiny context windows (a few thousand tokens) that couldn’t handle large documents. But the landscape has changed dramatically. Many models available through OpenRouter now boast massive context windows of 200k for most of the models, with even huge 2M context windows on brand new models. Our ~33,000 token document, once too large, now fits comfortably inside.
The second concern was cost. Sending ~33k tokens is roughly 10 times more than the 1-4k tokens used by our RAG pipeline. Surely that would be expensive?
The verdict: Accurate, fast, and surprisingly cheap
The results were immediate and impressive. With the full document available as context, even a smaller, cost-effective model had no trouble accurately extracting every piece of information we needed, including the elusive dates and financial totals.
The implementation was also radically simpler. The entire multi-step process of chunking, embedding, and retrieving could be replaced with a short script. This example demonstrates the core logic:
# To run this, you'll need a few libraries:
# pip install haystack-ai openrouter-haystack
import os
from pathlib import Path
from haystack import Pipeline, Document
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack_integrations.components.generators.openrouter import OpenRouterChatGenerator
# --- 0) Set your OpenRouter API key ---
if not os.environ.get("OPENROUTER_API_KEY"):
os.environ["OPENROUTER_API_KEY"] = "your_key_here" # Or set it in your system
# --- 1) Read the full document ---
try:
document_path = Path(__file__).parent / "document.txt"
raw_text = document_path.read_text(encoding="utf-8")
doc = Document(content=raw_text)
except FileNotFoundError:
print("Error: document.txt not found. Please create this file with sample text.")
exit()
# --- 2) Define the Prompt Template ---
template = [
ChatMessage.from_system(
"You are an expert assistant who extracts key information from documents. "
"Based *only* on the context provided, answer the user's query."
),
ChatMessage.from_user(
"""
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
Query: Based on the context, what are the key effective dates and financial totals?
"""
),
]
prompt_builder = ChatPromptBuilder(template=template)
# --- 3) Configure the LLM and Build the Pipeline ---
llm = OpenRouterChatGenerator(model="openai/gpt-4o-mini")
pipe = Pipeline()
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", llm)
pipe.connect("prompt_builder.prompt", "llm.messages")
# --- 4) Run the Classification ---
# We pass the entire document directly into the prompt builder
result = pipe.run({"prompt_builder": {"documents": [doc]}})
reply = result["llm"]["replies"][0]
print("LLM reply:", reply.content)The numbers don’t lie (The cost breakdown)
But what about the cost? This is where the numbers were truly surprising. For testing purposes, we used a model that, at the time of testing, cost around $0.05 per 1 million input tokens. Let’s do the math for our 33,000-token document:
(33,000 tokens / 1,000,000 tokens) * $0.05/1M = $0.00165
The total cost for a highly accurate classification was less than two-tenths of a US cent. This discovery opened the door for us to experiment with various models, balancing the razor’s edge between cost and performance. While accuracy remains our top priority, we are continuously testing different models to pinpoint the absolute best price-to-performance ratio for our production needs. For our specific use case, the direct approach wasn’t just more accurate, it was also astonishingly affordable.
Head-to-head: A clear comparison
To summarize our findings, here’s a direct comparison of the two approaches for our project:
| Feature | RAG approach | Direct API approach |
|---|---|---|
| Accuracy | Good, but struggles with specifics | Excellent |
| Tokens per Classification | Low (1k-4k) | High (~33k) |
| Total Cost & Complexity | Low token cost, but higher infrastructure cost & dev complexity | Negligible token cost, minimal infrastructure complexity |
| Best For… | Conversational Q&A, knowledge bases | One-off classification & extraction |
Explaining the differences
Tokens per classification
- RAG: champion of token efficiency.
- Retrieves only the most relevant chunks, reducing tokens sent to the LLM.
- Great for API cost savings.
Total cost & complexity
This is where the story gets nuanced.
- RAG approach: used fewer tokens, but the total cost of ownership is more than just the API bill. RAG requires running a local embedding model, which can necessitate more powerful (and expensive) server infrastructure to handle the computational load. It also adds significant development and maintenance complexity.
- The Direct API approach: while using more tokens, has a near-zero infrastructure footprint outside of the API call itself and is radically simpler to build and maintain. For our project, this made its total cost significantly lower.
Best For…
The Direct API: perfect for our use case – high-accuracy, one-off analysis of a single document.
RAG: excels where context needs to be dynamically retrieved over multiple interactions. It’s the ideal architecture for building chatbots over a large knowledge base or for multi-turn conversational AI where you need to fetch relevant information to answer follow-up questions accurately.
Conclusion: The right tool for the job
Our journey through these two methods brought us to a clear conclusion: for our specific goal of one-off document classification and data extraction, the direct API approach was the undisputed winner. It delivered superior accuracy with far less implementation complexity and at a cost so low it was practically negligible.
This doesn’t mean RAG is a flawed technology. On the contrary, it remains an incredibly powerful and essential tool for many other applications. If you’re building a chatbot to answer questions over a vast internal knowledge base or need to minimize hallucinations in a multi-turn conversation, RAG is likely the superior architecture.
The key takeaway for us was a reinforcement of a core engineering principle: always match the complexity of the solution to the complexity of the problem. With LLM context windows expanding and token prices plummeting, the most direct path is sometimes the most effective one. It’s crucial to challenge our assumptions and test simple solutions before committing to a more intricate architecture.
At SolDevelo, we pride ourselves on finding these pragmatic and effective solutions to complex problems. Ready to make AI practical for your team? Let’s chat →













