LLM-as-a-Courtroom | Falconer Notes
LLM-as-a-Courtroom
By Aryaman Agrawal
Shared knowledge—decisions, plans, documentation, processes—is a living, breathing resource that evolves as companies do. Falconer is building a shared memory layer for cross-functional teams and their agents to maintain their shared knowledge. A core part of our solution means watching for code changes and proposing documentation updates automatically.
“Documentation rot” is a term used a lot by our users. The code changes, documents go stale, and the knowledge that was once accurate becomes a liability. Many tools now allow users to find information with AI. But findability alone doesn’t equate to accuracy. An easily findable doc that’s out of date doesn’t solve the problem. The hardest part of the problem isn’t finding right quanta of knowledge, rather it is the ability to trust it when you do find it.
Automating documentation updates
Falconer automatically updates documents based on changes to the code. But when a PR merges, how do you decide which documents to update?
This isn’t a simple pattern-matching problem. Yes, you can filter out obvious non-candidates—test files, lockfiles, CI configs. But after that, you’re depending on judgment. The PR code is complex and contextual, as are the documents. Different teams have different priorities. A change that’s critical for customer support documentation is probably irrelevant for engineering specs. And the audience reading the document matters as much as the change itself.
If a human were reading all code changes and updating docs manually—reading each PR, understanding the diffs, searching for affected documents, deliberating on whether each one needs updating, and making the actual updates—it would take days for a relatively small number of PRs. Falconer’s agent does this in seconds.
However, building an intelligent agent is just one piece of the puzzle. We also needed to build the infrastructure to process tens of thousands of PRs daily for our enterprise customers. And more importantly, we needed to build a judgment engine with a strong opinion of its own.
Society’s best decision framework
We kept obsessing over the word “judgment,” and how it’s subjective, yet backed by reason. LLM-as-a-judge has been a useful technique, but far too rudimentary for our use case. Then it clicked. What better way to deliver better judgment than by constructing an entire courtroom? This led us to designing and building our “LLM-as-a-Courtroom” evaluation system.
Our first attempt was straightforward: categorical scoring. We asked our model to rate factors—relevance, feature addition, harm level—on numerical scales, then compared scores against configurable thresholds. We could handle the comparison logic ourselves; we just needed the model to assess.
It didn’t work as effectively as we needed it to. The ratings were inconsistent, often biased, and didn’t reflect the nuance of the actual decision. We were asking the model to do something it’s fundamentally not good at: rating things. Assigning a percentage or a hierarchical degree to a characteristic requires a kind of calibrated internal scale that LLMs simply don’t have.
But here’s what LLMs are good at: describing things in detail, providing explanations, and constructing arguments.
Unlike humans, who can often judge something quickly at a glance, models need a trail of thought in verbose terms to reach sound conclusions. They walk a path before reaching the end, facing gnarly crevices along the way. When you ask a model to output a single number, you’re asking it to skip that walk entirely. When you ask it to argue a position, you’re giving it the space to reason.
From requesting ratings to constructing arguments
So we shifted the objective. Instead of “rate this document’s relevance from 1-10,” we asked: “argue whether this document needs updating, and provide your evidence.”
As I constructed these argumentative prompts, the vernacular started to feel familiar. I was asking one agent to build a case. Another to challenge it. A third to weigh both sides and render a judgment. I realized I was mimicking a courtroom.
This wasn’t accidental. The legal system is society’s most rigorous framework for checks and balances, argumentation, deliberation, judgment—and most of all, justice. It’s been refined over centuries for exactly the class of problem we faced: binary decisions under uncertainty, where evidence is imperfect and the costs of being wrong are asymmetric.
As a philosophy student, I was taught how to examine what makes for a sound argument. There are three components:
Legal argumentation, I’d argue, bears the closest resemblance to rigorous philosophical argument in the real world. It demands both logical structure and factual grounding. It enforces explicit evidence. It requires addressing counterarguments.
And there’s a key benefit to this approach: LLMs have consumed vast amounts of legal content in pre-training—journals, rulings, court transcripts, case analyses. When you frame a task using legal terminology, you’re activating a rich constellation of learned behaviors about how to argue, deliberate, and reason under scrutiny.
Legal argumentation elicits deep reasoning while making efficient use of language—and in turn, tokens.
The architecture: Prosecution, Defense, Jury, Judge
The courtroom paradigm is a structural framework with distinct roles, each designed to elicit specific reasoning behaviors.
Here’s how the roles interact.
The Prosecutor
The prosecutor is our main GitHub agent. When a PR merges, this agent interprets the diffs, searches for potentially affected documents, and builds the case for updates. Like a real-world prosecutor, it’s responsible for constructing a burden of proof: what changed in the code, which documents are now inaccurate or incomplete, and what’s the harm of leaving them unchanged.
Critically, the prosecutor must provide exhibits—structured evidence with three enforced components:
The prosecutor must cite exact text from both the PR and the document, and articulate a specific harm. This requirement originates from the tenets of Retrieval Augmented Generation (RAG): an LLM’s context must be enriched with ground-truth information precisely curated toward a specific action.
But we’re also using the terminology strategically. Words like “exhibit” and “evidence” carry weight. They implicitly trigger learned behaviors from pre-training—the scrutiny, rigor, and specificity that legal contexts demand.
The prosecutor must prove that a document needs updating.
The Defense
The defense is the adversarial counterweight. After reviewing the prosecution’s evidence and argumentation, the defense agent mounts a rebuttal. Its job is to challenge the prosecutor: Is the evidence actually conclusive? Does the document already cover this case? Is the alleged harm overstated?
The defense generates a structured rebuttal for each document, containing:
The presence of opposing perspectives is considered critical for reaching sound judgment. The defense provides the logical contrast that prevents groupthink and surfaces weaknesses in the prosecution’s case.
The roots of this go deep. In philosophy, we call it Socratic elenchus: a logical dialogue that chips away at abstractions slowly, reaching toward the core of a thesis. The defense forces the courtroom to stress-test its assumptions before rendering judgment.
The Jury
The jury consists of multiple independent agents that evaluate the case after hearing both sides. They’re designed to be holistic bystanders—not advocates for either position, but impartial assessors weighing evidence and arguments.
Technical choices here are deliberate:
Each juror must explain their reasoning before casting a vote—guilty, not guilty, or abstain. This deliberate-then-vote pattern forces the model to think through the evidence before committing to a position.
The default configuration runs 5 jurors, requiring 3 guilty votes (a majority) to proceed to the judge. But this is tunable—some use cases might call for unanimous juries, others might accept a single guilty vote.
The Judge
The judge serves a different role than in traditional jury trials. While the jury deliberates and reaches a preliminary verdict through majority vote, the judge acts as the final arbiter. It synthesizes all perspectives and renders an independent judgment, then determines the appropriate “sentencing.”
The judge operates differently from other agents:
The judge produces a structured ruling containing: a full analysis synthesizing the prosecution’s case, the defense’s rebuttal, and the jury’s votes; a verdict (guilty, not guilty, or dismissed); a one-sentence rationale; and if guilty—specific edits to make to the document.
If the verdict is guilty, these edits are consolidated (max 2 per document by default) to avoid overwhelming document owners with granular changes.
Terminology as a tool
One design principle runs throughout: the courtroom terminology is a structural tool to exploit LLM training sets, not serve as user-facing language. Internally, we talk about prosecutors, exhibits, verdicts. But this language is isolated and contained—it never leaks into the actual outputs that document owners see. They receive concise notifications about proposed updates.
What we learned
We’ve built a useful evaluation system for our use case by leaning on what LLMs are already good at: legal comprehension. But we’re under no illusion that it’s a complete solution.
Jury bias is real. Despite running independently with high temperature, jury agents sometimes converge on the same conclusion. This isn’t catastrophic—if evidence is genuinely strong or weak, agreement makes sense—but we’re investigating when convergence reflects true consensus versus shared bias. The variance we designed for isn’t always materializing.
New paradigms need new testing infrastructure. You can’t unit test a courtroom simulation the way you test a function. The outputs are probabilistic. The “right answer” is often debatable. We’re developing evaluation strategies focused on observability: when a bad verdict happens, we need to trace back through the chain. Was it a weak prosecution? An ineffective defense? A judge that ignored the jury?
Real usage surfaces real edge cases. The PRs that slip through, the documents flagged unnecessarily, the configurations that behave unexpectedly—these emerge from actual use, not internal testing. Our design partners have already surfaced issues we didn’t anticipate. That feedback loop is essential.
The legal system turned out to be a near-perfect match. And because LLMs have extensive exposure to legal reasoning through pre-training, we could activate that framework through terminology and structure rather than complex fine-tuning.
We’re doing more research to apply the LLM-as-a-Courtroom to a wider variety of complex problems. The complexity of the simulation should only grow from here—more nuanced roles, multi-turn debate, configurable appeals processes, domain-specific courts for different industries.
The verdict
After running LLM-as-a-Courtroom in production for 3 months, we saw 65% of PRs filtered before review, 95% of flagged PRs filtered before reaching Court, and 63% of Court cases dismissed without doc updates. When we do escalate to humans, we’re right 83% of the time.
Documentation rot thrives on neglect—the gap between code that changes and knowledge that doesn’t. The courtroom narrows that gap by filtering aggressively. We calibrated the system to skew strict, prioritizing precision over recall—false positives erode trust faster than false negatives. By keeping the bar high, we ensure every surfaced update deserves attention while building a curated dataset of high-quality updates to study and replicate at larger scale.
The architecture scales, the framework adapts.
We’re just getting started.
References
Aryaman Agrawal
Aryaman is a Founding AI Engineer at Falconer where he focuses on building agents and conducting applied research for improving accuracy. Before Falconer he built Jurist, Ironclad's AI legal assistant.
Shared memory for teams and agents
