GitHub - NehmeAILabs/llm-sanity-checks
Navigation Menu
Search code, repositories, users, issues, pull requests...
Provide feedback
We read every piece of feedback, and take your input very seriously.
Saved searches
Use saved searches to filter your results more quickly
To see all available qualifiers, see our documentation.
License
NehmeAILabs/llm-sanity-checks
Folders and files
Latest commit
History
Repository files navigation
LLM Sanity Checks
A practical guide to not over-engineering your AI stack.
Before you reach for a frontier model, ask yourself: does this actually need a trillion-parameter model?
Most tasks don't. This repo helps you figure out which ones.
The Decision Tree
Quick Checks
Check 1: Can you describe the task in one sentence?
If yes → probably a small model task.
If no → you might have an architecture problem, not a model problem.
Check 2: What's your accuracy requirement?
Scaling to frontier models rarely buys you more than 5% accuracy on simple tasks. That 5% costs 50x more.
Check 3: How many output tokens do you need?
Output tokens are the bottleneck. They determine latency and cost.
The JSON Tax
Everyone defaults to JSON for structured output. But JSON has overhead:
For simple extraction tasks:
When to use JSON: nested structures, optional fields, API contracts.
When to use delimiters: simple extraction, high-volume pipelines.
Read more: The JSON Tax →
Model Selection Cheat Sheet
Tiny (1B-4B params)
Best for: classification, yes/no, simple extraction
Small (8B-17B params)
Best for: most production tasks, RAG, extraction, summarization
Medium (27B-70B params)
Best for: complex reasoning, long context, multi-step tasks
Frontier (100B+ dense params)
Best for: novel tasks, complex reasoning, when nothing else works
Before you use these, ask: have you tried a smaller model?
Anti-Patterns
❌ "We use GPT-5 for everything"
That's not a flex. That's a $50K/month cloud bill waiting to happen.
❌ "We need the best model for our enterprise customers"
Your enterprise customers care about latency, reliability, and cost. Not model prestige.
❌ "Small models aren't accurate enough"
Did you test? With the right prompt? On your actual data?
❌ "We'll optimize later"
You'll optimize never. The technical debt compounds. Start right-sized.
❌ "JSON output is industry standard"
For simple extraction, it's industry waste. See: The JSON Tax.
❌ "We need RAG for our documents"
For small document sets? No you don't.
Context windows are now 2M-10M tokens. That's thousands of pages. If your knowledge base is <100 pages, just stuff it in context. Preprocess, convert to markdown, include directly.
RAG adds complexity: chunking strategies, embedding models, vector databases, retrieval tuning, reranking. All that infrastructure for documents that fit in a single prompt.
When RAG makes sense:
When to skip RAG:
Patterns
✅ Cascade Architecture
Start with smallest model. Verify output. Escalate only on failure.
Verifier can be: format validation, a classifier, or FlashCheck for grounding checks.
See examples/cascade.py for a working extraction example.
✅ Task-Specific Models
One model per task type, sized appropriately.
✅ Measure First, Scale Never
Before adding a bigger model:
✅ Simple Tools Over Browser Automation
For research tasks, don't reach for computer use or Puppeteer.
Three tools. No browser. No screenshots. No vision model.
Browser automation is only for: login walls, dynamic forms, actions (booking, purchasing).
See patterns/agents.md for the full agent decision tree.
More Patterns
Tools
RightSize
Test your prompts against multiple model sizes. See what's actually needed.
→ Try RightSize
FlashCheck
Verify LLM outputs with tiny specialized models. Sub-10ms verification.
→ Learn about FlashCheck
Contributing
Found a pattern that works? Open a PR.
Keep it practical. Keep it measured. No vibes-based claims.
License
MIT. Use it. Share it. Don't over-engineer it.
Built by Nehme AI Labs — AI architecture consultancy.
About
Resources
License
Contributing
Uh oh!
There was an error while loading. Please reload this page.
Stars
Watchers
Forks
Releases
Packages
0
Footer
Footer navigation