That was as great reading, thank you. I've a related observation. In my experien...

miki123211 · 2025-09-23T15:45:42 1758642342

What I've found is that it is very important to make structured outputs as easy for the LLM as possible. This means making your schemas LLM-friendly instead of programmer-friendly.

E.g. if the LLM hallucinates non-existing URLs, you may add a boolean "contains_url" field to your entity's JSON schema, placing it before the URL field itself. This way, the URL extraction is split into two simpler steps, checking if the URL is there and actually extracting it. If the URL is missing, the `"contains_url": false` field in the context will strongly urge the LLM to output an empty string there.

This also comes up with quantities a lot. Imagine you're trying to sort job adverts by salary ranges, which you extract via LLm. . These may be expressed as monthly instead of annual (common in some countries), in different currencies, pre / post tax etc.

Instead of having an `annual_pretax_salary_usd` field, which is what you actually want, but which the LLM is extremely ill-equipped to generate, have a detailed schema like `type: monthly|yearly, currency:str, low:float, high:float, tax: pre_tax|post_tax`.

That schema is much easier for an LLM to generate, and you can then convert it to a single number via straight code.

lubujackson · 2025-09-24T01:02:24 1758675744

Awesome insight, thanks for this!

hansvm · 2025-09-23T15:59:25 1758643165

That's definitely possible.

As you know, (most current) LLMs build text autoregressively. This allows them to generate text with _exactly_ the same distribution as the training data.

When you constrain LLM output at each token, that gives a completely different distribution from letting the LLM generate a full output and then doing something with that (trying again, returning an error, post-processing, etc).

E.g.: Suppose the LLM has a training set of (aa, ab, ab, ba), noting that "ab" appears twice. Suppose your valid grammar is the set (ab, ba). Then your output distributions are:

Baseline: {invalid: 25%, ab: 50%, ba: 25%}

Constrained: {invalid: 0%, ab: 75%, ba: 25%}

Note that _all_ the previously invalid outputs were dumped into the "ab" bucket, skewing the ratio between "ab" and "ba". That skew may or may not be desirable, but assuming the training process was any good it's likely undesirable.

You've observed it in URLs, but I see it in JSON output as well. LLMs like to truncate long strings from time to time, but when they do they're more likely to provide invalid JSON (adding an ellipsis at the end of the fragment and doing nothing else). If that truncation starts to happen in a constrained environment, a period is a valid character in a long string, and eventually the grammar constraint will force a closing quote to appear. The result is still garbage, but instead of a detectable parse failure you have an undetectable corrupt field.

matheist · 2025-09-23T17:18:46 1758647926

Why do you think the constrained percentages are 0/75/25 and not eg 0/66/33? (ie same relative likelihood for valid outputs)

hansvm · 2025-09-23T20:14:09 1758658449

The constraint algorithm looks something like:

1. Choose the first token. If well-trained you have a 75% chance of choosing "a" and a 25% chance of choosing "b". Both are valid for that grammar.

2. Choose the second token. Regardless of your first token there is exactly once choice of grammar-adhering completion. You're now at a 75% chance of "ab" and a 25% chance of "ba" (mirroring the first-token chance).

For a toy example like this you obviously wouldn't use an LLM, but techniques like you're suggesting don't work because it's infeasible to enumerate all the valid outputs and re-weight and because greedy and semi-greedy strategies aren't anywhere near sufficient to side-step the issue. At the point in time you select the "a" token at a 75% probability it's game-over unless you re-run the LLM. You can't beam search either (doing so just changes which token you'll mis-predict, and even then only for very local grammar mistakes).

Looking at my JSON example from earlier, a beam search to avoid that re-weighting requires a depth of at least 4 (going as far as the ellipsis plus the stop token), and it won't suffice to just consider locally high-weight paths (you can probably hack something together for that one issue in particular which searches high weight paths and backtracks if they're found to be low-weight due to grammar mismatches, but that has its own bias unless you fan out to all 1e19 length-4 paths, and it won't solve the general problem regardless).

Phrased slightly differently, you don't have a compute_future_grammar_adhering_weight(token) function which is tractably computable, so you can't actually redistribute the 8.3% probability from the "a" branch to the "b" branch.

matheist · 2025-09-23T21:42:40 1758663760

Oh now I understand. I thought your ab and ba were single tokens (even though that doesn't make sense in context). Once you point out they're separate tokens, I follow you. Thank you!

Edit: that's a great example

Edit 2: even more fun: training data is [ab, ab, ba, bb, bb, bb]. Then constrained sampling flips your likelihood from 1:2 to 2:1

hansvm · 2025-09-24T04:31:09 1758688269

Thanks :) My example is minimal, which is a little nice since I wind up re-deriving it in a hurry every time I need it. I do like the 1:2 to 2:1 symmetry though. Very elegant.

anentropic · 2025-09-23T16:17:07 1758644227

> let the llm generate text and then use a second llm to convert the text into the desired structured format

this sounds similar to what they discussed in the article with regards to "thinking" models, i.e. let them generate their <think>blah blah</think> preamble first before starting to constrain the output to structured format