Rahul Shetty
AI Testing is Painful: How we ditched benchmarks for real feedback
Mar 6, 2025

If you're building any AI tool—traditional testing methods are failing you. The old-school way of testing software (you know, benchmarks, accuracy metrics, leaderboards) doesn’t cut it when you're dealing with generative AI. You can’t just test if it gets the "right" answer. You have to test if it delivers something useful.
So how do you actually test AI tools that generate technical documentation, API guides, and manuals? You focus on what matters: user feedback, practical utility, and real-world performance. That’s the key lesson we’ve learned from testing Knowledge Keeper.
Benchmarking is Broken: Why "Vibe Checking" is the Future
For decades, AI models were tested with structured benchmarks—think chess, ImageNet, speech recognition datasets. And yeah, they were useful... at the time. But now? AI is so powerful that companies are gaming the system. The leaderboard obsession leads to overfitting, models memorizing test questions, and results that look good on paper but don’t translate to the real world.
Enter vibe checking. This isn’t just some fluffy concept—it’s how real users evaluate AI. When someone interacts with an AI-generated document, they don’t care about the model’s leaderboard ranking. They care if the content feels right. Is the tone correct? Does it make sense? Does it actually help them do their job? That’s the ultimate test.
The Top 5 AI Benchmarks (And Why They Don’t Work)
MMLU (Massive Multitask Language Understanding) – Evaluates AI on multiple-choice questions across subjects like history, law, and math. The problem? Models can memorize question patterns and game the system.
GSM8K (Grade School Math 8K) – Measures AI’s ability to solve basic math problems. Great for structured logic, but doesn’t reflect real-world problem-solving.
HellaSwag – Tests AI’s ability to predict the most reasonable sentence completion. A step up from basic text matching, but still doesn’t assess user intent.
BIG-bench – A massive general knowledge test for AI. Covers tons of topics but isn’t reflective of how users interact with AI in specialized domains.
Winograd Schema Challenge – Evaluates AI’s reasoning ability with ambiguous sentence structures. It’s a solid measure of linguistic nuance but doesn’t test practical AI application.
The common flaw in all these? They measure isolated problem-solving ability, not whether AI is actually useful in a real-world setting. Beating these benchmarks doesn’t mean your AI is good—it means it's good at taking tests.
The Next Generation of AI Evaluation
Smart teams are realizing that testing AI isn't about chasing benchmarks—it’s about capturing real user interactions. Here’s where the cutting-edge stuff is happening:
Vibe-Eval: A new benchmark that actually incorporates qualitative assessments alongside tough quantitative tests. It's designed to test how well AI models understand multimodal (text + image) inputs.
Humanity’s Last Exam: A next-level benchmark that only includes questions no AI model can currently answer. The idea? Build tests that future-proof AI evaluation instead of rewarding short-term hacks.
RE-Bench: A real-world test where AI systems tackle actual engineering tasks, simulating real constraints. Turns out, AI models perform best with 30–120 minutes of work before they start spinning their wheels.
The Best Tools for AI Testing and Feedback Collection
Here are some of the top tools AI teams are using to evaluate models in the real world:
1. LangSmith (by LangChain)
What it does: Tracks AI response quality, failure rates, and user feedback in real-world applications.
Why it’s useful: Helps teams spot inconsistencies and optimize prompt engineering.
User reviews: Developers love its deep debugging capabilities but say it can be overwhelming for beginners.
2. Weights & Biases
What it does: Tracks and visualizes model performance across different training runs.
Why it’s useful: Essential for model comparison, hyperparameter tuning, and debugging.
User reviews: AI researchers swear by it, but some find the UI a bit clunky.
3. Langfuse
What it does: Provides observability for AI apps, tracking latency, costs, and failure modes.
Why it’s useful: Gives visibility into how AI systems are actually performing in production.
User reviews: Startups love it for cost tracking, but enterprise users want more customization.
4. OpenAI Evals
What it does: Allows developers to create custom tests for AI models.
Why it’s useful: Helps test AI across different datasets and applications.
User reviews: Powerful but requires technical expertise to set up properly.
5. Feedback Labs
What it does: Captures user interactions and implicit feedback signals.
Why it’s useful: Helps improve models without requiring explicit user input.
User reviews: Businesses love its insights, but it’s expensive for small teams.
Implicit vs. Explicit Feedback: The Secret to Training AI Without Annoying Users
Users hate giving explicit feedback. Ask them to rate every AI-generated response, and they'll get survey fatigue fast. Worse, research shows that when you keep asking for feedback, users start trusting the AI less.
The smarter way? Implicit feedback—tracking how users naturally interact with AI outputs. Here’s how we do it at KnowledgeKeeper:
If a user accepts an AI-generated doc without changes, the model nailed it.
If they tweak it slightly, the AI was close, but there's room to improve.
If they reject it but still change the section, the AI identified the right spot but suggested the wrong fix.
This approach lets AI learn without annoying users or interrupting their workflow. More interaction = better models, without the burden of constant rating requests.
The Future of AI Testing: "Interaction-Embedded Evaluation"
The next wave of AI testing is all about baking evaluation into real workflows—not running models through artificial test environments. We call this Interaction-Embedded Evaluation (IEE). Here’s what makes it powerful:
No artificial test conditions—evaluation happens in real usage scenarios.
Scales automatically—the more users interact, the more feedback AI gets.
Direct impact on product improvements—models evolve based on actual user behavior, not theoretical benchmarks.
Bottom Line: The AI Testing Playbook for Documentation Automation
The AI documentation automation space is moving fast. The winners will be the teams who move beyond traditional testing and embrace real-world feedback as the ultimate benchmark.