You could simply censor invalid tokens, but that does rely on 2 assumptions.
1. There is always a valid next token.
2. This greedy algorithm doesn't result in a qualitatively different distribution from a rejection sampling algorithm.
The latter isn't too obvious, and may in fact be (very) false. Look up maze generation algorithms if you want some feeling for the effects this could have.
If you just want a quick argument, consider what happens if picking the most likely token would increase the chance of an invalid token further down the line to nearly 100%. By the time your token-picking algorithm has any effect it would be too late to fix it.
Sorry, how could there not be a valid next token? Presumably your interface would generate a state machine with appropriate masking arrays, and iirc generally speaking all 256 byte choices are in the token list. There's no way to get stuck in a place where the JSON is invalid? Can you give an example?
If you want to be really clever about your picker, a deterministic result would blat out the all the known possible strings.
For example, if you had an object with defined a defined set of properties, you could just go ahead and not bother generating tokens for all the properties and just tokenize, E.G. `{"foo":"` (6-ish tokens) without even passing through the LLM. As soon as an unescaped `"` arrives, you know the continuation must be `,"bar":"`, for example
> This greedy algorithm doesn't result in a qualitatively different distribution from a rejection sampling algorithm.
It absolutely will. But so will adding an extra newline in your prompt, for example. That sort of thing is part and parcel of how llms work
Hmm, I think any example where it can get stuck is going to be a bit contrived since really it's a question of how easy it is to recognize a valid prefix. Say for example you want the LLM to generate a valid chess match and it ends up in a situation with just 2 kings left. If you're not careful with your definitions you could end up in an endless loop that never ends.
That said if you know all valid prefixes in your language in advance then you can always realise when a token leaves no valid continuations.
> It absolutely will. But so will adding an extra newline
A newline is less likely to dramatically drop the quality, a greedy method could easily end driving itself into a dead end (if not grammatically then semantically).
Say you want it to give a weather prediction consisting of a description followed by a tag 'sunny' or 'cloudy' and your model is on its way to generate
{
desc: "Strong winds followed by heavy rainfall.",
tag: "stormy"
}
If it ever gets to the 's' in stormy it will be forced to pick 'sunny', even if that makes no sense in context.
There are quite a few open source implementations of this e.g. https://github.com/outlines-dev/outlines