GitHub - kxzk/snapbench: 📸 gotta find 'em all; spatial reasoning benchmark for LLMs
Navigation Menu
Search code, repositories, users, issues, pull requests...
Provide feedback
We read every piece of feedback, and take your input very seriously.
Saved searches
Use saved searches to filter your results more quickly
To see all available qualifiers, see our documentation.
📸 gotta find 'em all; spatial reasoning benchmark for LLMs
kxzk/snapbench
Folders and files
Latest commit
History
Repository files navigation
SnapBench
Inspired by Pokémon Snap (1999). VLM pilots a drone through 3D world to locate and identify creatures.
Architecture
Overview
The simulation generates procedural terrain and spawns creatures (cat, dog, pig, sheep) for the drone to discover. It handles drone physics and collision detection, accepting 8 movement commands plus identify and screenshot. The Rust controller captures frames from the simulation, constructs prompts enriched with position and state data, then parses VLM responses into executable command sequences. The objective: locate and successfully identify 3 creatures, where identify succeeds when the drone is within 5 units of a target.
Gotta catch 'em all?
I gave 7 frontier LLMs a simple task: pilot a drone through a 3D voxel world and find 3 creatures.
Only one could do it.
Is this a rigorous benchmark? No. However, it's a reasonably fair comparison - same prompt, same seeds, same iteration limits. I'm sure with enough refinement you could coax better results out of each model. But that's kind of the point: out of the box, with zero hand-holding, only one model figured out how to actually fly.
Why can't Claude look down?
The core differentiator wasn't intelligence - it was altitude control. Creatures sit on the ground. To identify them, you need to descend.
This left me puzzled. Claude Opus is arguably the most capable model in the lineup. It knows it needs to identify creatures. It tries - aggressively. But it never adjusts its approach angle.
The two-creature anomaly
Run 13 (seed 72) was the only run where any model found 2 creatures. Why? They happened to spawn near each other. Gemini Flash found one, turned around, and spotted the second.
In most other runs, Flash found one creature quickly but ran out of iterations searching for the others. The world is big. 50 iterations isn't a lot of time.
Bigger ≠ better
This was the most surprising finding. I expected:
Instead, the cheapest model beat models costing 10x more.
What's going on here? A few theories:
I genuinely don't know. But if you're building an LLM-powered agent that needs to navigate physical or virtual space, the most expensive model might not be your best choice.
Color theory, maybe
Anecdotally, creatures with higher contrast (gray sheep, pink pigs) seemed easier to spot than brown-ish creatures that blended into the terrain. A future version might normalize creature visibility. Or maybe that's the point - real-world object detection isn't normalized either.
Prior work
Before this, I tried having LLMs pilot a real DJI Tello drone.
Results: it flew straight up, hit the ceiling, and did donuts until I caught it. (I was using Haiku 4.5, which in hindsight explains a lot.)
The Tello is now broken. I've ordered a BetaFPV and might get another Tello since they're so easy to program. Now that I know Gemini Flash can actually navigate, a real-world follow-up might be worth revisiting.
Rough edges
This is half-serious research, half "let's see what happens."
Try it yourself
Prerequisites
You'll also need an OpenRouter API key.
Setup
Running the simulation manually
Running the benchmark suite
Results get saved to data/run_.csv.
Where this could go
Attribution
Donated to Poly Pizza to support the platform.
About
📸 gotta find 'em all; spatial reasoning benchmark for LLMs
Resources
Uh oh!
There was an error while loading. Please reload this page.
Stars
Watchers
Forks
Releases
Packages
0
Languages
Footer
Footer navigation