newsence
來源篩選

Show HN: LLM Sanity Checks – A Practical Guide to Avoiding Over-Engineering Your AI Stack

Hacker News

This Hacker News 'Show HN' post introduces a GitHub repository offering a practical guide and decision tree to help developers avoid over-engineering their AI stack by selecting appropriate LLM sizes for their tasks.

newsence

Show HN:LLM 健全性檢查 – 避免過度設計 AI 技術堆疊的實用指南

Hacker News
大約 1 個月前

AI 生成摘要

這篇 Hacker News 的「Show HN」文章介紹了一個 GitHub 儲存庫,提供實用的指南和決策樹,協助開發者透過為任務選擇合適的 LLM 模型大小,來避免過度設計其 AI 技術堆疊。

GitHub - NehmeAILabs/llm-sanity-checks

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

To see all available qualifiers, see our documentation.

License

NehmeAILabs/llm-sanity-checks

Folders and files

Latest commit

History

Repository files navigation

LLM Sanity Checks

A practical guide to not over-engineering your AI stack.

Before you reach for a frontier model, ask yourself: does this actually need a trillion-parameter model?

Most tasks don't. This repo helps you figure out which ones.

The Decision Tree

Quick Checks

Check 1: Can you describe the task in one sentence?

If yes → probably a small model task.

If no → you might have an architecture problem, not a model problem.

Check 2: What's your accuracy requirement?

Scaling to frontier models rarely buys you more than 5% accuracy on simple tasks. That 5% costs 50x more.

Check 3: How many output tokens do you need?

Output tokens are the bottleneck. They determine latency and cost.

The JSON Tax

Everyone defaults to JSON for structured output. But JSON has overhead:

For simple extraction tasks:

When to use JSON: nested structures, optional fields, API contracts.
When to use delimiters: simple extraction, high-volume pipelines.

Read more: The JSON Tax →

Model Selection Cheat Sheet

Tiny (1B-4B params)

Best for: classification, yes/no, simple extraction

Small (8B-17B params)

Best for: most production tasks, RAG, extraction, summarization

Medium (27B-70B params)

Best for: complex reasoning, long context, multi-step tasks

Frontier (100B+ dense params)

Best for: novel tasks, complex reasoning, when nothing else works

Before you use these, ask: have you tried a smaller model?

Anti-Patterns

❌ "We use GPT-5 for everything"

That's not a flex. That's a $50K/month cloud bill waiting to happen.

❌ "We need the best model for our enterprise customers"

Your enterprise customers care about latency, reliability, and cost. Not model prestige.

❌ "Small models aren't accurate enough"

Did you test? With the right prompt? On your actual data?

❌ "We'll optimize later"

You'll optimize never. The technical debt compounds. Start right-sized.

❌ "JSON output is industry standard"

For simple extraction, it's industry waste. See: The JSON Tax.

❌ "We need RAG for our documents"

For small document sets? No you don't.

Context windows are now 2M-10M tokens. That's thousands of pages. If your knowledge base is <100 pages, just stuff it in context. Preprocess, convert to markdown, include directly.

RAG adds complexity: chunking strategies, embedding models, vector databases, retrieval tuning, reranking. All that infrastructure for documents that fit in a single prompt.

When RAG makes sense:

When to skip RAG:

Patterns

✅ Cascade Architecture

Start with smallest model. Verify output. Escalate only on failure.

Verifier can be: format validation, a classifier, or FlashCheck for grounding checks.

See examples/cascade.py for a working extraction example.

✅ Task-Specific Models

One model per task type, sized appropriately.

✅ Measure First, Scale Never

Before adding a bigger model:

✅ Simple Tools Over Browser Automation

For research tasks, don't reach for computer use or Puppeteer.

Three tools. No browser. No screenshots. No vision model.

Browser automation is only for: login walls, dynamic forms, actions (booking, purchasing).

See patterns/agents.md for the full agent decision tree.

More Patterns

Tools

RightSize

Test your prompts against multiple model sizes. See what's actually needed.

→ Try RightSize

FlashCheck

Verify LLM outputs with tiny specialized models. Sub-10ms verification.

→ Learn about FlashCheck

Contributing

Found a pattern that works? Open a PR.

Keep it practical. Keep it measured. No vibes-based claims.

License

MIT. Use it. Share it. Don't over-engineer it.

Built by Nehme AI Labs — AI architecture consultancy.

About

Resources

License

Contributing

Uh oh!

There was an error while loading. Please reload this page.

Stars

Watchers

Forks

Releases

Packages

0

Footer

Footer navigation