Case Study: Validating 9 LLMs for complex document analysis

November 6, 2025

Damian Szafranek

Table Of Contents

The Pain: Why manual document analysis fails
The Criteria: Our mandates for an AI solution
The Experiment Part 1: The human-led test
- Key Discovery: High accuracy wasn't limited to top-tier models
The Experiment Part 2: Assembling the AI Jury for validation
The Verdict: The jury confirms the strategy
What we learned
Conclusion: Our new AI-powered workflow
- Need a custom AI solution?

For many organizations, extracting critical information from documents is a notorious pain point. The critical data is always there, but it’s often buried in dozens of pages of dense, technical text. Missing a key compliance rule or a deadline can mean financial risk and weeks of wasted effort. Our goal was to eliminate this manual bottleneck for our client by building an AI-powered solution.

We tested nine LLMs against two non-negotiable mandates: the need for 100% accuracy and the challenge of finding a way to prove and validate that accuracy at scalable speed. Here is the journey that led us to an optimized, cost-effective model and the development of an AI Jury – a validation process over 180 times faster than human review.

Faster than human review

The Pain: Why manual document analysis fails

The challenge for our client was not just volume, but the density and complexity of their documents. They faced critical risks and productivity bottlenecks because:

Information is buried: Key data is scattered across long documents, making manual extraction time-intensive.
High cost of error: A single factual mistake in compliance or a contract term can result in costly legal or operational issues.
Zero scalability: The reliance on human domain experts meant that the validation process itself was a massive bottleneck that could not be scaled with organization growth.

This high-stakes environment meant any AI solution could not just be “good” – it had to be perfectly trustworthy.

The Criteria: Our mandates for an AI solution

To solve this pain point, our AI solution had to meet certain requirements.

Criteria 1: 100% Accuracy and nuance

The AI had to perform two distinct tasks on every document with expert-level precision:

Complex classification: This is about understanding complex rules and nuance. The model needs to read the entire document and answer questions like, “Is this project compliant with new regulations?” or “Does this agreement require a co-signature?” The answer isn’t just a simple “yes” or “no”; it’s often nuanced, like “Yes, but only as part of a consortium.”
Extraction: This is more straightforward data retrieval. The model needs to find specific data points, such as the project deadline, the total contract value, or the list of eligible regions.

Criteria 2: Explainability and trust (The evidence field)

To build essential user trust and satisfy the requirements, the system had to go beyond a simple answer. For every output, the AI was required to provide two fields:

value: The AI’s final answer (e.g., a specific date, a list of countries, or a True/False value).
evidence: The AI’s explanation for its answer, which often includes a direct quote from the source text.

This evidence field became one of the most important parts of our evaluation. It not only allows a user to instantly verify the AI’s conclusion, but it also provides the critical context that a simple value field would miss, building essential trust in the system.

Our evaluation metrics

When evaluating LLMs, you can measure many things. Public benchmarks available online test for speed (tokens per second), cost (price per million tokens), complex reasoning, coding abilities, and more.

For our specific use case, however, our priorities were very clear:

Accuracy was the #1 priority. A wrong answer on a key information is worse than no answer at all. The system had to be trustworthy.
Cost was a secondary, but important, metric. While our goal was 100% accuracy on all fields, we were interested in the cost-to-performance ratio. A model that was 10x the price would need to prove it was significantly better than a cheaper alternative that could be tuned to perfection.
AI response time was irrelevant. Our tool was designed to process and classify a new document in the background, store the results and notify the user if needed. Whether the analysis took 10 seconds or 1 minute made no practical difference to the user experience.

With this framework, we were ready to set up our first test: a direct comparison against a human expert.

The Experiment Part 1: The human-led test

Our first step was to establish a “gold standard.” A human domain expert manually analyzed three complex documents, creating a perfect set of classifications and extractions for each.

Next, we ran the same three documents through our 9 candidate LLMs. The expert then meticulously compared each model’s output (all 9 sets of answers) against the gold standard.

This review was more complex than just checking for “correct” or “incorrect.” We found that most models were factually accurate on simple extractions. The real difference was in the quality and usefulness of the evidence and the interpretation of nuanced rules.

Because of this, the expert used a 3-point “preference” scale, where a lower score is better:

Excellent

The answer is accurate and the evidence is complete and highly useful.

Good

The answer is accurate, but the evidence is weak, partially missing, or could be explained better.

Low usefulness / Incorrect

The answer is factually wrong, or it’s technically correct but so poorly explained.

Here are the final sub-scores for one of the test documents, which gives a clear picture of the performance spread:

Model	Average score for all documents (lower is better)
GPT-5 Mini	1.8
GPT-5	2.0
Gemini 2.5 Pro	2.0
Gemini 2.5 Flash	2.0
GPT-5 Nano	2.1
GPT 4.1 Mini	2.2
Claude Sonnet 4.5	2.3
Claude Opus 4.1	2.4
GLM 4.6	2.5

Key Discovery: High accuracy wasn’t limited to top-tier models

This manual process was slow and laborious. It took our expert over 3 hours just to manually classify a single document – but the insights were invaluable:

Some models performed poorly: We immediately identified and removed several consistently low-scoring models from future consideration.
Unexpected issues appeared: Unexpected, problematic behaviors emerged (e.g., date unawareness, language switching). We determined these were often “quirks” fixable through more specific prompt instructions.
Key Discovery: High accuracy wasn’t limited to top-tier models: We found that high accuracy wasn’t limited to the most expensive, top-tier models. This proved our accuracy goal was achievable with cheaper alternatives, shifting our focus to finding the best cost-to-performance ratio.
Pivoting to a target model (GPT-5 Nano): Based on its strong performance and lower cost, we selected GPT-5 Nano as the target. Through iterative prompt engineering, we added instructions to fix its quirks and elevate its performance to the required “excellent” standard.

This process was a success, but it was far too slow to be repeatable. We needed a way to run these evaluations automatically to verify if accuracy is still at a high level if we add more classification rules or find a challenging document. This led us to our next idea: if a human expert can grade the AI, could another AI do it for us?

The Experiment Part 2: Assembling the AI Jury for validation

The “Why”: A need for speed

Our human-led review was a success. It gave us invaluable insights and, most importantly, confirmed our key finding: high accuracy wasn’t exclusive to flagship models.

But this process had a fatal flaw: it was painfully slow (over 3 hours per single document). This was a massive bottleneck. We couldn’t possibly repeat this process to test our prompt refinements, validate new models, or run continuous quality checks. It just wasn’t scalable.

We needed a way to automate the evaluation itself. If a human expert can grade an AI’s output, could another AI do it for us?

The Methodology: An AI Jury of peers

We decided to build an “AI Jury” to act as a proxy for our human expert. We selected three powerful, flagship models to serve as the jurors: GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5.

The methodology was simple:

We gave each juror the exact same “Correct Answer Sheet” (the “gold standard”) our human expert had created.
We then gave them each of the 9 candidate models’ outputs, one by one.
Their task was to compare the model’s answer to the ground truth and provide a score from 1-10, based on a detailed rubric.

To ensure the jury was meticulous and impartial, we gave them a very specific system prompt. This prompt instructed them to act as expert document evaluators and defined the exact scoring rubric for their 1-10 rating.

For those interested in the exact details, here’s the prompt we provided to each juror:

Copy Code

system_prompt = """
You are a meticulous and impartial expert analyst. Your sole task is to analyze the performance of an AI model that was tasked with classifying, extracting, and summarizing information from a complex technical document.

You will be given two pieces of information:

1.  The Correct Answer Sheet: This was prepared by a human expert and is considered the "ground truth"
2.  The Model's Answer: This is the output from the AI model you are evaluating. Most of the answers contain the AI's explanation of the decision.

Your instructions are as follows:

1.  Compare Field by Field: For each classification and extraction field, carefully compare the value in "The Model's Answer" against the value in "The Correct Answer Sheet."
2.  Evaluate Accuracy, Nuance, and Evidence: Assess the factual correctness and nuance of the value. Additionally, evaluate the evidence provided by the model. Useful evidence is a direct quote from the source text that clearly supports the value. It should help a human user quickly verify the information.
3.  Evaluate the Document Summary: Read the document_summary. A high-quality summary must coherently include the document's main purpose, key entities involved, critical dates, financial figures, and any major compliance or eligibility requirements.
4.  Provide an Overall Score: Give a single, overall score from 1 to 10 based on the model's total performance across all tasks, using the scoring rubric below.
5.  Provide a Justification: Write a brief, clear justification for your score, highlighting the model's key successes and failures in classification, extraction, evidence quality, and summary generation.

SCORING RUBRIC
10 (Perfect): The model's answers are a perfect match with the expert's in both value and nuance. The evidence is always a relevant, helpful quote. The summary is flawless.
9 (Excellent): The model's answers are correct but might have very minor wording differences. The evidence is consistently good. The summary is complete and well-written.
7-8 (Good): The model is largely correct but has one or two minor errors in value, or the evidence is sometimes weak/irrelevant. The summary might be missing one or two non-critical details.
5-6 (Acceptable): The model is partially correct but has significant errors or omissions. The evidence is often missing or unhelpful. The summary is incomplete and misses key details.
3-4 (Poor): The model is mostly incorrect, containing major factual errors. The evidence is poor. The summary is badly written or missing most key information.
1-2 (Very Poor): The model's answer is completely wrong, irrelevant, or fails to follow instructions entirely.
"""

user_prompt = """
DATA FOR EVALUATION

1. The Correct Answer Sheet (Human Expert):
{{ correct_answer.content }}

2. The Model's Answer:
{{ model_answer.content }}
"""

The Verdict: The jury confirms the strategy

The AI Jury’s scorecard

The AI Jury processed all 9 models and returned their scores. We then averaged the scores from all three jurors to get a final, blended rating for each model.

Here are the complete results (higher is better):

Model	Juror 1 (GPT-5)	Juror 2 (Gemini 2.5 Pro)	Juror 3 (Claude Sonnet 4.5)	Average score
GPT-5	7	8	8	7.67
Gemini 2.5 Pro	5	8	8	7.00
GPT-5 Nano	6	7	7	6.67
Claude Sonnet 4.5	6	7	7	6.67
GPT-5 Mini	5	8	7	6.67
Gemini 2.5 Flash	5	7	7	6.33
GLM 4.6	4	7	6	5.67
Claude Opus 4.1	5	6	6	5.67
GPT-4.1 Mini	4	6	6	5.33

llms for complex document analysis AI jury evaluation

The scores themselves were interesting, but the real breakthrough was that the AI Jury’s findings almost perfectly mirrored the conclusions from our human-led review.

The models clustered into the same groups. The same top-tier models (GPT-5, Gemini 2.5 Pro, GPT-5 Nano) performed well, and the exact same models that our human expert flagged as problematic (like GLM 4.6 and Claude Opus 4.1) received the lowest scores from the jury.

This confirmed our approach. The jury wasn’t just spitting out random numbers, it was successfully identifying the same strengths and weaknesses our expert did. This gave us confidence that we could rely on this automated process for future testing.

The Killer Metric: Several hours vs. < 1 minute

The most game-changing discovery wasn’t just that the AI Jury worked, but how fast it worked.

Human expert

Several hours of painstaking, manual classification and comparison.

AI Jury

Less than 1 minute for all three jurors to evaluate all 9 models.

This is the force multiplier we were looking for. We now had a method to validate our work that was over 180 times faster than our original process. This speed unlocked the ability to do things that were previously impossible. We could now run a full evaluation every time we tweaked a prompt, test a new model the day it’s released, or run continuous quality monitoring. All at a fraction of the human cost and time.

What we learned

This two-part experiment yielded a crucial insight: our final deliverable wasn’t just an optimized model, but an entirely new, scalable process for testing and validation. Our key takeaways have less to do with any single model and more to do with a modern strategy for applying AI.

Key Insight 1: High accuracy doesn’t always mean high cost

Our initial hypothesis was that accuracy was king, and we stuck to that. We would have chosen a more expensive, flagship model if it was the only one that could deliver 100% accuracy.

The surprising discovery was that we didn’t have to. For a task like complex document classification and data extraction, even the cheaper, smaller models like GPT-5 Nano were already performing at a very high level. The difference wasn’t a huge gap in accuracy, but minor “quirks”. Through iterative prompt engineering, we were able to tweak GPT-5 Nano to eliminate those issues and achieve the perfect, expert-level results we required.

This is a critical finding: for specific, well-defined tasks, the most powerful flagship models are not always necessary. A more cost-effective alternative, when paired with careful prompt refinement, can perform at the exact same level, mitigating or even eliminating any initial performance difference.

Key Insight 2: The AI Jury as a force multiplier

The “Several hours vs. < 1 minute” metric says it all. Automating our evaluation process is a genuine force multiplier. This isn’t just a one-time cost saving; it fundamentally changes how we can work. We can now:

Iterate rapidly: Tweak a prompt to fix a “quirk” and get immediate feedback on its impact.
Test continuously: Run our AI Jury as part of an automated workflow to ensure the quality of our chosen model never degrades over time.
Validate new models instantly: When a new model is released, we can add it to our test suite and know how it stacks up against our champion (GPT-5 Nano) in minutes, not days.

An AI-driven evaluation process is a viable and powerful tool, allowing for a level of speed and iteration that is impossible with human-only evaluation.

Key Insight 3: Prompt engineering still rules

Choosing a model is only the first step. The biggest performance gains, and the final push to perfect accuracy, came from carefully refining our system prompts.

Our initial human-led review was critical for finding the “quirky” failures, like models not being date-aware or switching languages. We solved these problems not by switching to a more expensive model, but by adding clearer instructions to our prompt. The model itself is the engine, but the prompt is the steering wheel.

Conclusion: Our new AI-powered workflow

Our journey started with a simple question: “Which AI model is best?” We tested 9 models, and the answer was that for our non-negotiable high-accuracy needs, several models were up to the task – including the cost-efficient GPT-5 Nano. We didn’t pivot away from accuracy, we proved that after careful prompt refinement, we could achieve our accuracy goals with a more economical model.

Our final workflow is now a hybrid, human-in-the-loop system:

Human oversight: A human expert still sets the “gold standard” by creating the perfect answer sheet for any new or particularly complex task.
AI-powered evaluation: The AI Jury uses this gold standard to provide scalable, near-instantaneous evaluation of all candidate models.
Optimized AI execution: This allows us to continuously optimize and confidently use our chosen model, GPT-5 Nano, for the live classification and extraction tasks.

To ensure our quality remains high in production, we also integrated a simple feedback system into our tool. If a user spots a result that isn’t accurate, they can flag it. This allows us to react quickly, identify new edge cases, and run our evaluation suite again, creating a continuous loop of improvement.

The future of applied AI isn’t just about finding one single, powerful model to do a job. It’s about building systems where multiple AIs work together with human oversight. One AI to perform the task, and another to validate it. To create solutions that are efficient, reliable, and, most importantly, scalable.

Need a custom AI solution?

At SolDevelo, we specialize in building and integrating custom AI solutions just like this one. If your organization needs help with AI-powered data processing or automation, check our offer.

Learn More

Author

Damian Szafranek
View all posts

Industries

Technologies

View our work

App Speed

Software Quality & Security

Product Development

Cloud & Hosting

User Experience & Accessibility

AI & Machine Learning