“stochastic parrots” to AI fashions successful math contests? Whereas there may be definitely doubt that LLMs are really PhD-level thinkers as marketed, the progress in advanced reasoning conditions is simple.
A standard trick has been to combine and match LLM generative capabilities with formal verifiers, i.e. purpose-built software program that gives assured options to sure issues, when said exactly. The important thing perception is that LLMs could also be good at translating messy, ambiguous necessities into exact formal specs. Formal verifiers excel at discovering options that fulfill these specs. By combining them, we get a system that may perceive what you need and assure it delivers precisely that: just lately, AWS is utilizing this very trick to construct “guardrails” for actual time chats.
How does this work in follow? Sadly, the reason of those primary dynamics usually occurs inside bigger, advanced contexts, like reinforcement studying or mathematical proofs. As we speak, we’ll reveal this hybrid strategy utilizing Alloy, a light-weight language that’s trivial to learn, even for novices. As a substitute of the standard math-y papers and hard-to-grasp benchmarks, we’re going to remedy a way more relatable problem, impressed by a weekly crossword publication:

We now have: 5 automobiles (1-5) parked in entrance of 5 women (A-E), and 5 names (Laura, Giovanna, Bianca, Franca, Marta); we don’t know which automotive was parked by which lady however the women say one thing concerning the state of affairs. Our process is to reply this deceptively easy query: which lady is known as Marta and what’s her automotive?
Whereas extra beach-level than PhD-level considering, the answer sits at a candy spot of complexity. It may present a primer on LLM and formal strategies that’s not polluted by different themes and doesn’t require intensive area data: we preserve all the fundamental elements of real-world issues, however simplify the setup.
Prompts, screenshots, and Alloy code can be found in this open supply repo (all checks have been finished within the week of August 2025, the primary reasoning loop has been finished with Opus 4.1 on Claude Desktop).
AIs and people wrestle by themselves
A enjoyable truth about our puzzle is that, although it requires solely “beach-level considering”, prime fashions usually are not clearly good at it. Importing the authentic image and prompting Opus 4.1 for an answer, the mannequin incorrectly assumed C is carrying pants: how can we then belief its conclusion – that Marta is Lady A, and her automotive is quantity 5?
Issues get attention-grabbing after we attempt to examine fashions. We summary away the puzzle in a textual description, however LLMs nonetheless can not discover consensus: DeepSeek’s 4.1 reply (A and a couple of) is totally different than the one given by Opus; Opus’s personal reply with textual prompting (A and a couple of) is totally different from Opus above, and ChatGPT5 has yet one more opinion (A and 5).
That is what makes the puzzle a fantastic motivating instance: people wrestle at this combinatorial reasoning (homework query: how lengthy did it take you to resolve it?), however it’s unclear how significantly better frontier fashions are. How will we construct confidence in any of the solutions above? How can we cause with the AI as a substitute of delegating solely the method?
Reasoning as “eliminating potentialities”
Advanced reasoning challenges can usually be solved following the recommendation from that well-known detective: ‘When you’ve gotten eradicated the not possible, then no matter stays, nevertheless inconceivable, have to be the reality’. As a substitute of attempting to resolve the issue suddenly, we will consider our puzzle as the mixture of three essential issues:
- An preliminary state of affairs, randomly mapping women to automobiles and labels.
- A set of constraints, within the type of statements by the exact same women: these statements will make sure mapping not possible.
- A ultimate state of affairs, wherein women are re-mapped to names and automobiles.
Our preliminary data is appropriate with this actuality:

But additionally this (and plenty of extra):

We are able to think about that each time we add a woman assertion, we get rid of some preparations from presumably being the ultimate one. In different phrases, we improve our data concerning the state of affairs as we progressively prohibit the set of possible options (this primary perception is identical underlying epistemic logic and knowledge concept). In actual fact, the very first assertion, “Lady A states that Laura isn’t subsequent to her, and A’s automotive is now in entrance of Bianca”, guidelines out our first state of affairs, as a result of Laura is subsequent to Lady A there.
Enumerating situations is a tedious and error-prone process, even for LLMs. The magic of Alloy is their declarative nature. As a substitute of writing down the reasoning code ourselves, we state what we all know (premises in a standard proof, statements on this case), and what to search out out (a theorem, Marta’s automotive), and let Alloy do the remaining: exploring an enormous conceptual area is finished by tried and examined strategies, in order that we will deal with the devoted translations of the puzzle and (necessary!) the interpretation of the situations Alloy finds.
The division of labor ought to now be clear: as a substitute of LLM (or us) immediately fixing the issue, we translate English necessities in Alloy code with Claude, then use Alloy to generate options and at last, we, as people, verify them.
From LLM to Alloy and again: the reasoning loop
Our prompting technique is now extra refined. We not ask Claude for a direct answer; as a substitute, our immediate guides it to generate Alloy code primarily based on our preliminary state of affairs. As a substitute of “one-shotting” the answer, we at the moment are in a virtuous loop, producing more and more advanced code, and verifying that we’re getting nearer primarily based on the Alloy output:

The result’s our beginning code, which incorporates the primary elements however no constraints but. It’s straightforward to scroll via the definitions now that the tedious translation has been finished: Lady, Automobile, and Title as our essential “signatures” (i.e. units of objects), and the preliminary place for Ladies A-E is the mapping to Vehicles 1-5. We don’t but know who owns what besides that no one owns the automotive in entrance of them now:
// No lady is initially standing in entrance of her personal automotive
// Lady A (place 1) doesn't personal Car1, B doesn't personal Car2, and so forth.
A.owns != Car1
B.owns != Car2
C.owns != Car3
D.owns != Car4
E.owns != Car5
We pause right here to spotlight two nice Alloy options: first, the code maps clearly to logical statements, fairly like those to be present in mathematical proofs and casual reasoning – even if in case you have by no means seen Alloy’s syntax earlier than, the statements ought to be apparent (code feedback are your good friend!). Second, the built-in UI is helpful to visualise our progress, because it depicts an occasion chosen amongst all of the doable realities that fulfill the constraints: for instance, this can be a doable task (Giovanna is C):

Executing it once more, we might get one other one, after which one other one: as our data is restricted at this stage, a number of assignments are all doable: it’s time to begin eliminating some!
Let’s ask Claude to switch our preliminary code, and add the assertion from lady A. The beauty of this loop is that we will additionally encode “sanity checks” primarily based on incomplete however sound reasoning. Not simply LLMs, but in addition human intelligence advantages from this kind of “progressive enhancement”: having the ability to incorporate “native” constraints is each unit testing the Alloy mannequin in addition to partaking us immediately with the puzzle.
Let’s now add the assertion by Lady A as a constraint. Now add a verify to substantiate that the next mapping isn’t allowed anymore: Franca (A, 1), Laura (B, 2). If we now run the code, no counterexample is discovered, proving we efficiently excluded the undesired configuration:
pred InvalidConfiguration {
// Lady A is known as Franca and owns Car1
A.identify = Franca
A.owns = Car1
// Lady B is known as Laura and owns Car2
B.identify = Laura
B.owns = Car2
}
verify { not InvalidConfiguration } for five Int
Now that we all know the trick, our AI assistant can generate the script with all of the statements by the women. After we run it, that is the occasion that we get:

Thanks to a couple iterations and interpretable, provably appropriate reasoning we will now set up that ChatGPT5 bought this proper: Marta is Lady A in Automobile 5, and the mapping supplied by ChatGPT is appropriate (you may confirm it your self evaluating the chat outcome with the occasion above – by the way this additionally proves one other attention-grabbing truth, which is: no matter Marta’s mapping, are the opposite women uniquely decided as nicely?).
Reasoning out of the field
An incredible side-product of getting independently computable representations of the ideas at hand is that now we will discover within the symbolic area of Alloy the underlying mechanics of the puzzle, as a substitute of relying solely on opaque mappings in latent area.
For instance, we will simply verify that the answer is exclusive: within the Alloy UI, when you attempt to get a brand new occasion, a warning says that no different occasion is accessible. However we might additionally discover exterior the prevailing boundaries, and take away all of the Clothes data: does the answer change? (Attempt to reply earlier than working it!) It seems, the proper answer remains to be a legitimate occasion (homework query: why should this be the case?), however this time the UI can certainly produce a number of legitimate situations: as anticipated, much less constraints, (doubtless) extra options.
A symbolic area that we simply manipulate can be nice for checking the work of AI, which ought to by no means be taken at face worth. The primary level in case is checking Opus’ answer to start with, obtained by parsing the picture incorrectly. We are able to simply change lady C’s clothes (i.e. `C.wears = Trousers`) and check out once more: since there is no such thing as a answer, the (unhappy) conclusion is that Opus’ authentic reasoning was incorrect – it was “proper” however for the “flawed” causes, so to talk.
A second instance comes from what Claude added to verify for uniqueness (i.e.: Marta is A and 5 in all legitimate configurations). In concept, that’s a pleasant addition, however in follow this verify doesn’t do the job:
assert MartaUniqueSolution
(g1.identify = Marta and g2.identify = Marta) implies
(g1 = g2) // Marta is all the time on the similar place
The mismatch is evident, and simple to establish due to Alloy’s clear syntax: “In all legitimate configurations” is a quantifier over all situations (within the “meta-language” so to talk), whereas “all g1…” quantifies over women inside an occasion.
See you, area cowboys
Equally to cutting-edge methods like AlphaGeometry, we solved a deductive drawback (successfully, a proof) by reasoning with Claude, as a substitute of delegating the method solely.
The LLM does the mapping between English and a proper language: Alloy is simple to learn, however typically tedious to write down, so the code era capabilities of Claude come in useful. People, alternatively, can deal with checking if the formal setup is appropriate (checking is commonly simpler than doing within the first place!). Each Claude and people then delegate combinatorial reasoning to a robust, verified solver for the precise deduction.
Whereas our beach-level proof appears unimportant, and the copy-paste from Claude will get tedious shortly, this easy instance is a glimpse of the facility of formal strategies when mixed with code era and a few (human or agentic) supervision. Actual-world methods use extra expressive languages, run tighter, self-improving loops and goal much less frivolous proofs, however most of the intuitions from immediately carry over to them.
After all, fixing beach-or-PhD logical puzzles isn’t the one use case for hybrid methods reminiscent of this one. Languages like Alloy are extremely popular for modelling software program applications, and as such, they open the door for a future wherein distributed methods might be cheaply designed and verified at scale earlier than any implementation work even begins. As very sensible examples, AWS notoriously invests in verifying their cloud merchandise, and Bauplan offers an Alloy mannequin for their very own information catalog primitives.
Taking a really totally different path than what many might have predicted even simply 50 years in the past, it appears, day-to-day, that we’re lastly getting nearer to Leibniz’s dream:
If controversies have been to come up, there could be no extra want for disputation between two philosophers than between two calculators. For it will suffice for them to take their pencils of their fingers and to take a seat down on the abacus, and say to one another: Allow us to calculate.
Acknowledgments
Due to Federico Bianchi, Aldrin Montana, Patrick John Chia for preliminary suggestions over a earlier draft of this text. No LLM was used or harmed to write down the English elements of this weblog put up.
When you care about verification, simulations and AI in system and infrastructure design, you’ll love working at Bauplan: we’re hiring!