newsence
來源篩選

Show HN: Only 1 LLM Can Fly a Drone

Hacker News

A new benchmark called SnapBench, inspired by Pokémon Snap, tests the spatial reasoning capabilities of Large Language Models (LLMs) by having them pilot a drone through a 3D world to locate and identify creatures.

newsence

Show HN:僅有一個大型語言模型能駕駛無人機

Hacker News
大約 1 個月前

AI 生成摘要

一個名為SnapBench的新基準測試,靈感來自《寶可夢 Snap》,透過讓大型語言模型(LLM)駕駛無人機穿越3D世界來定位和識別生物,以測試其空間推理能力。

GitHub - kxzk/snapbench: 📸 gotta find 'em all; spatial reasoning benchmark for LLMs

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

To see all available qualifiers, see our documentation.

📸 gotta find 'em all; spatial reasoning benchmark for LLMs

kxzk/snapbench

Folders and files

Latest commit

History

Repository files navigation

SnapBench

Inspired by Pokémon Snap (1999). VLM pilots a drone through 3D world to locate and identify creatures.

Image Image Image

Architecture

Overview

The simulation generates procedural terrain and spawns creatures (cat, dog, pig, sheep) for the drone to discover. It handles drone physics and collision detection, accepting 8 movement commands plus identify and screenshot. The Rust controller captures frames from the simulation, constructs prompts enriched with position and state data, then parses VLM responses into executable command sequences. The objective: locate and successfully identify 3 creatures, where identify succeeds when the drone is within 5 units of a target.

Gotta catch 'em all?

I gave 7 frontier LLMs a simple task: pilot a drone through a 3D voxel world and find 3 creatures.

Only one could do it.

Image

Is this a rigorous benchmark? No. However, it's a reasonably fair comparison - same prompt, same seeds, same iteration limits. I'm sure with enough refinement you could coax better results out of each model. But that's kind of the point: out of the box, with zero hand-holding, only one model figured out how to actually fly.

Why can't Claude look down?

The core differentiator wasn't intelligence - it was altitude control. Creatures sit on the ground. To identify them, you need to descend.

This left me puzzled. Claude Opus is arguably the most capable model in the lineup. It knows it needs to identify creatures. It tries - aggressively. But it never adjusts its approach angle.

The two-creature anomaly

Run 13 (seed 72) was the only run where any model found 2 creatures. Why? They happened to spawn near each other. Gemini Flash found one, turned around, and spotted the second.

Image

In most other runs, Flash found one creature quickly but ran out of iterations searching for the others. The world is big. 50 iterations isn't a lot of time.

Bigger ≠ better

This was the most surprising finding. I expected:

Instead, the cheapest model beat models costing 10x more.

What's going on here? A few theories:

I genuinely don't know. But if you're building an LLM-powered agent that needs to navigate physical or virtual space, the most expensive model might not be your best choice.

Color theory, maybe

Anecdotally, creatures with higher contrast (gray sheep, pink pigs) seemed easier to spot than brown-ish creatures that blended into the terrain. A future version might normalize creature visibility. Or maybe that's the point - real-world object detection isn't normalized either.

Prior work

Before this, I tried having LLMs pilot a real DJI Tello drone.

Results: it flew straight up, hit the ceiling, and did donuts until I caught it. (I was using Haiku 4.5, which in hindsight explains a lot.)

The Tello is now broken. I've ordered a BetaFPV and might get another Tello since they're so easy to program. Now that I know Gemini Flash can actually navigate, a real-world follow-up might be worth revisiting.

Rough edges

This is half-serious research, half "let's see what happens."

Try it yourself

Prerequisites

You'll also need an OpenRouter API key.

Setup

Running the simulation manually

Running the benchmark suite

Results get saved to data/run_.csv.

Where this could go

Attribution

Donated to Poly Pizza to support the platform.

About

📸 gotta find 'em all; spatial reasoning benchmark for LLMs

Resources

Uh oh!

There was an error while loading. Please reload this page.

Stars

Watchers

Forks

Releases

Packages

0

Languages

Footer

Footer navigation