interface for interacting with LLMs is thru the traditional chat UI present in ChatGPT, Gemini, or DeepSeek. The interface is sort of easy, the place the person inputs a physique of textual content and the mannequin responds with one other physique, which can or might not comply with a particular construction. Since people can perceive unstructured pure language, this interface is appropriate and fairly efficient for the audience it was designed for.
Nevertheless, the person base of LLMs is far bigger than the 8 billion people dwelling on Earth. It expands to tens of millions of software program applications that may doubtlessly harness the facility of such giant generative fashions. Not like people, software program applications can’t perceive unstructured knowledge, stopping them from exploiting the information generated by these neural networks.
To handle this subject, numerous methods have been developed to generate outputs from LLMs following a predefined schema. This text will overview three of the preferred approaches for producing structured outputs from LLMs. It’s written for engineers fascinated with integrating LLMs into their software program functions.
Structured Output Era
Structured output era from LLMs includes utilizing these fashions to provide knowledge that adheres to a predefined schema, reasonably than producing unstructured textual content. The schema may be outlined in numerous codecs, with JSON and regex being the commonest. For instance, when using JSON format, the schema specifies the anticipated keys and the information varieties (resembling int, string, float, and many others.) for every worth. The LLM then outputs a JSON object that features solely the outlined keys and appropriately formatted values.
There are numerous conditions the place structured output is required from LLMs. Formatting unstructured our bodies of textual content is one giant software space of this expertise. You need to use a mannequin to extract particular data from giant our bodies of textual content and even photographs (utilizing VLMs). For instance, you need to use a basic VLM to extract the acquisition date, whole worth, and retailer title from receipts.
There are numerous methods to generate structured outputs from LLMs. This text will talk about three.
- Counting on API Suppliers
- Prompting and Reprompting Methods
- Constrained Decoding
Counting on API Suppliers ‘Magic’
A number of LLM service API suppliers, together with OpenAI and Google’s Gemini, permit customers to outline a schema for the mannequin’s output. This schema is often outlined utilizing a Pydantic class and offered to the API endpoint. In case you are utilizing LangChain, you may comply with this tutorial to combine structured outputs into your software.
Simplicity is the best facet of this explicit strategy. You outline the required schema in a fashion acquainted to you, move it to the API supplier, and sit again and chill out because the service supplier performs all of the magic for you.
Utilizing this system, nonetheless, will restrict you to utilizing solely API suppliers that present the described service. This limits the expansion and suppleness of your initiatives, because it shuts the door to utilizing a number of fashions, significantly open supply ones. If the API suppliers immediately determine to spike the value of the service, you can be pressured both to simply accept the additional prices or search for one other supplier.
Furthermore, it isn’t precisely Hogwarts Magic that the service supplier does. The supplier follows a sure strategy to generate the structured output for you. Data of the underlying expertise will facilitate the app growth and speed up the debugging course of and error understanding. For the talked about causes, greedy the underlying science might be well worth the effort.
Prompting and Reprompting-Primarily based Strategies
When you’ve got chatted with an LLM earlier than, then this system might be in your thoughts. In order for you a mannequin to comply with a sure construction, simply inform it to take action! Within the system immediate, instruct the mannequin to comply with a sure construction, present a number of examples, and ask it to not add any further textual content or description.
After the mannequin responds to the person request and the system receives the output, it’s best to use a parser to remodel the sequence of bytes to an applicable illustration within the system. If parsing succeeds, then congratulate your self and thank the facility of immediate engineering. If parsing fails, then your system must get well from the error.
Prompting is Not Sufficient
The issue with prompting is unreliability. By itself, prompting isn’t sufficient to belief a mannequin to comply with a required construction. It would add additional rationalization, disregard sure fields, and use an incorrect knowledge kind. Prompting may be and needs to be coupled with error restoration methods that deal with the case the place the mannequin defies the schema, which is detected by parsing failure.
Some folks would possibly assume {that a} parser acts like a boolean perform. It takes a string as enter, checks its adherence to predefined grammar guidelines, and returns a easy ‘sure’ or ‘no’ reply. In actuality, parsers are extra complicated than that and supply a lot richer data than ‘follows’ or ‘doesn’t comply with’ construction.
Parsers can detect errors and incorrect tokens in enter textual content in line with grammar guidelines (Aho et al. 2007, 192–96). This data offers us with beneficial data on the specifics of misalignments within the enter string. For instance, the parser is what detects a lacking semicolon error once you’re working Java code.
Determine 1 depicts the circulation used within the prompting-based methods.

Prompting Instruments
One of the crucial fashionable libraries for immediate primarily based structured output era from LLMs is teacher. Teacher is a Python library with over 11k stars on GitHub. It helps knowledge definition with Pydantic, integrates with over 15 suppliers, and offers automated retries on parsing failure. Along with Python, the package deal can be avillable in TypeScript, Go, Ruby, and Rust (2).
The great thing about Teacher lies in its simplicity. All you want is to outline a Pydantic class, initialize a shopper utilizing solely its title and API key (if required), and move your request. The pattern code under, from the docs, shows the simplicity of Teacher.
import teacher
from pydantic import BaseModel
from openai import OpenAI
class Individual(BaseModel):
title: str
age: int
occupation: str
shopper = teacher.from_openai(OpenAI())
particular person = shopper.chat.completions.create(
mannequin="gpt-4o-mini",
response_model=Individual,
messages=[
{
"role": "user",
"content": "Extract: John is a 30-year-old software engineer"
}
],
)
print(particular person) # Individual(title='John', age=30, occupation='software program engineer')
The Value of Reprompting
As handy because the reprompting method is likely to be, it comes at a hefty value. LLM utilization value, both service supplier API prices or GPU utilization, scales linearly with the variety of enter tokens and the variety of generated tokens.
As talked about earlier prompting primarily based methods would possibly require reprompting. The reprompt may have roughly the identical value as the unique one. Therefore, the price scales linearly with the variety of reprompts.
When you’re going to make use of this system, it’s a must to preserve the price downside in thoughts. Nobody needs to be stunned by a big invoice from an API supplier. One concept to assist lower shocking prices is to place emergency brakes into the system by making use of a hard-coded restrict on the variety of allowed reprompts. This can allow you to put an higher restrict on the prices of a single immediate and reprompt cycle.
Constrained Decoding
Not like the prompting, constrained decoding doesn’t want retries to generate a legitimate, structure-following output. Constrained decoding makes use of computational linguistics methods and information of the token era course of in LLMs to generate outputs which can be assured to comply with the required schema.
How It Works?
LLMs are autoregressive fashions. They generate one token at a time and the generated tokens are used as inputs to the identical mannequin.
The final layer of an LLM is principally a logistic regression mannequin that calculates for every token within the mannequin’s vocabulary the chance of it following the enter sequence. The mannequin calculates the logits worth for every token, then utilizing the softmax perform, these worth are scaled and reworked to chance values.
Constrained decoding produces structured outputs by limiting the accessible tokens at every era step. The tokens are picked in order that the ultimate output obeys the required construction. To determine how the set of doable subsequent tokens may be decided, we have to go to RegEx.
Common expressions, RegEx, are used to outline particular patterns of textual content. They’re used to verify if a sequence of textual content matches an anticipated construction or schema. So principally, RegEx is a language that can be utilized to outline anticipated buildings from LLMs. Due to its recognition, there’s a big selection of instruments and libraries that transforms different types of knowledge construction definition like Pydantic courses and JSON to RegEx. Due to its flexibility and the vast availability of conversion instruments, we will remodel our objective now and concentrate on utilizing LLMs to generate outputs following a RegEx sample.
Deterministic Finite Automata (DFA)
One of many methods a RegEx sample may be compiled and examined in opposition to a physique of textual content is by remodeling the sample right into a deterministic finite automata (DFA). A DFA is solely a state machine that’s used to verify if a string follows a sure construction or sample.
A DFA consists of 5 parts.
- A set of tokens (known as the alphabet of the DFA)
- A set of states
- A set of transitions. Every transition connects two states (possibly connecting a state with itself) and is annotated with a token from the alphabet
- A begin state (marked with an enter arrow)
- A number of last states (marked as double circles)
A string is a sequence of tokens. To check a string in opposition to the sample outlined by a DFA, you start firstly state and loop over the string’s tokens, taking the transition similar to the token at every transfer. If at any level you will have a token for which no corresponding transition exists from the present state, parsing fails and the string defies the schema. If parsing ends at one of many last states, then the string matches the sample; in any other case it additionally fails.

{a, b}
, states {q0, q1, q2}
, and a single last state, q2
. Generated utilizing Grpahviz by the Writer.For instance, the string abab
matches the sample in Determine 2 as a result of beginning at q0
and following the transitions marked with a
, b
, a
, and b
on this order will land us at q2
, which is a last state.
Alternatively, the string abba
doesn’t match the sample as a result of its path ends at q0
which isn’t a last state.
A beauty of RegEx is that it may be compiled right into a DFA; in spite of everything, they’re simply two other ways to specify patterns. Dialogue of such a metamorphosis is out of scope for this text. The reader can verify Aho et al. (2007, 152–66) for a dialogue of two methods to carry out the transformation.
DFA for Legitimate Subsequent Tokens Set

a(b|c)*d
. Generated utilizing Grpahviz by the Writer.Let’s recap what we now have reached up to now. We needed a way to determine the set of legitimate subsequent tokens to comply with a sure schema. We outlined the schema utilizing RegEx and reworked it right into a DFA. Now we’re going to present {that a} DFA informs us of the set of doable tokens at any level throughout parsing, becoming our necessities and desires.
After constructing the DFA, we will simply decide in O(1) the set of legitimate subsequent tokens whereas standing at any state. It’s the set of tokens annotating any transition exiting from the present state.
Contemplate the DFA in Determine 3, for instance. The next desk reveals the set of legitimate subsequent tokens for every state.
State | Legitimate Subsequent Tokens |
---|---|
q0 |
{a } |
q1 |
{b , c , d } |
q2 |
{} |
Making use of the DFA to LLMs
Getting again to our structured output from LLMs downside, we will remodel our schema to a RegEx then to a DFA. The alphabet of this DFA will probably be set to the LLM’s vocabulary (the set of all tokens the mannequin can generate). Whereas the mannequin generates tokens, we are going to transfer by means of the DFA, beginning firstly state. At every step, we can decide the set of legitimate subsequent tokens.
The trick now occurs on the softmax scaling stage. By zeroing out the logits of all tokens that aren’t within the legitimate tokens set, we are going to calculate chances just for legitimate tokens, forcing the mannequin to generate a sequence of tokens that follows the schema. That manner, we will generate structured outputs with zero further prices!
Constrained Decoding Instruments
One of the crucial fashionable Python libraries for constrained decoding is Outlines (Willard and Louf 2023). It is rather easy to make use of and integrates with many LLM suppliers like OpenAI, Anthropic, Ollama, and vLLM.
You may outline the schema utilizing a Pydantic class, for which the library handles the RegEx transformation, or immediately utilizing a RegEx sample.
from pydantic import BaseModel
from typing import Literal
import outlines
import openai
class Buyer(BaseModel):
title: str
urgency: Literal["high", "medium", "low"]
subject: str
shopper = openai.OpenAI()
mannequin = outlines.from_openai(shopper, "gpt-4o")
buyer = mannequin(
"Alice wants assist with login points ASAP",
Buyer
)
# ✓ At all times returns legitimate Buyer object
# ✓ No parsing, no errors, no retries
The code snippet above from the docs shows the simplicity of utilizing Outlines. For extra data on the library, you may verify the docs and the dottxt blogs.
Conclusion
Structured output era from LLMs is a strong instrument that expands the doable use instances of LLMs past the straightforward human chat. This text mentioned three approaches: counting on API suppliers, prompting and reprompting methods, and constrained decoding. For many eventualities, constrained decoding is the favoured technique due to its flexibility and low value. Furthermore, the existence of fashionable libraries like Outlines simplifies the introduction of constrained decoding to software program initiatives.
If you wish to be taught extra about constrained decoding, then I’d extremely advocate this course from deeplearning.ai and dottxt, the creators of Outlines library. Utilizing movies and code examples, this course will allow you to get hands-on expertise getting structured outputs from LLMs utilizing the methods mentioned on this put up.
References
[1] Aho, Alfred V., Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman, Compilers: Rules, Strategies, & Instruments (2007), Pearson/Addison Wesley
[2] Willard, Brandon T., and Rémi Louf, Environment friendly Guided Era for Giant Language Fashions (2023), https://arxiv.org/abs/2307.09702.