Grammar models are back, baby!

From the 60s-80s, we saw fundamental contributions to the study of context-free grammar:

1961 the CYK parsing algorithm was published (Younger 1967)

1968 the Earley parser was published (Earley 1968)

1974 the GLR parser was formulated (Lang 1974), and then in 1984 it was implemented (Tomita 1986)

(… skipping a bunch of other things …)

2023 llama.cpp popularized grammar-constrained decoding for local LLMs (Maurer 2023).

2023 JSON mode finally returns valid JSON (OpenAI 2023)

2025 state-machine grammars become mainstream (Mastra Team 2025). Context-Free Grammars (CFG) too: I built a CFG parser to constrain agentic function calls at my current employer.

2025 OpenAI introduced native support for constraining LLM outputs with context-free grammar production rules (OpenAI 2025).

2026 world models have been generating some buzz (Maloo 2026)

Structure is fashionable again. It feels like we’re one good abstraction away from grammars having their moment, and I think that abstraction is finally obvious.

The purpose of this post is to outline a framework to talk about grammar generation, inference, and scoring all at once in a clean interface, with the hope that this explanation will simplify future research efforts on this topic.

#A clean interface

#Top-down generation

You can sample from the grammar (assuming uniform distribution over production choices):

#Bottom-up inference

To demonstrate this, let’s change the grammar to be more interesting, one that introduces the concept of running tests in parallel.

Abstracted away is the fact that infer is performing a complicated parsing algorithm. It’s giving you access to the posterior:

This interface is sufficient for some really powerful algorithms, as we’ll see below.

#Searching in observation space

G.infer(...) can parse partial data, like a prefix of a sequence, or the list of moves made so far in a game. In other words, it describes the next possible moves.

Monte Carlo Tree Search (MCTS), on the other hand, is an algorithm that finds you the approx. best move.

Don’t get stuck on the name, though. Maybe a better name would have been “Statistical Tree Search.” The “Monte Carlo” part is a nod to the city of Monte Carlo in Monaco, which is famous for its casinos. In the literature, “Monte Carlo” is a loaded term that also implies sampling, repeated simulations, balancing exploration vs. exploitation, and a few other relevant ideas, so unfortunately the name’s here to stay.

\[ \underbrace{ \text{Monte Carlo} \qquad \underbrace{ \text{Tree} \qquad \underbrace{\text{Search}}_{\substack{\\[1.0ex]\text{algorithm}}} }_{\substack{\\[1.3ex]\text{uses a tree data structure}}} }_{\substack{\\[1.6ex]\text{named after Monte Carlo sampling}}} \]

In the algorithm, you construct a tree from scratch, where each path from the root represents the sequence of actions you may take.

// start with just the root node
tree = Tree([
  Node({
    // initial trivial partial parse
    parse: p0,
    // observation
    obs: [],
    // frequency of usage (will be updated by `backprop`)
    visits: 0,
    // mean utility estimate (will be updated by `backprop`)
    value: 0.0,
  })
])

Let’s take everything we’ve learned, using infer, sample, render, and score to implement MCTS. The following code will be repeated in a loop many times:

// 1. pick a promising starting path
path = tree.select(node => G.score(node.parse, node.obs))
node = path.at(-1)

// 2. roll out fully to see where it could end
P = G.infer(obs => startsWith(obs, node.obs))
p = P.sample({ reject: node.children.map(c => c.obs) })
x = G.render(p)
nextMove = x.at(node.obs.length)
childObs = [...node.obs, nextMove]
child = tree.add(node, Node(
  G.infer(obs => startsWith(obs, childObs)).sample(),
  childObs
))

// 3. Tally up the results
u = utility(x)
tree.backprop(path, node, u)

I know, I threw in a bunch of undefined things in there, such as tree.select and tree.backprop, but you can imagine an API that lets you vibe-mcts 😎.

But c’mon, isn’t that so cool!? You’re basically 80% there with just this interface!

#Searching in grammar space

Similar to the algorithm above, let’s start with a root node. This time, node.obs will be grammar objects:

// start with just the root node
tree = Tree([
  Node({
    // initial trivial partial parse
    parse: p0,
    // observation, trivial grammar object
    obs: G0,
    // frequency of usage (will be updated by `backprop`)
    visits: 0,
    // mean utility estimate (will be updated by `backprop`)
    value: 0.0,
  })
])

Run the same MCTS algorithm above, but instead of searching through the space of observations constrained by grammar actions, search through the space of grammar objects constrained by grammar-edit actions.

For example, let’s say you’d like to learn how to make a peanut butter and jelly (PB&J) sandwich. Let’s establish some atomic concepts:

Sampling from that grammar will produce a sequence of words that maybe at first glance sound like gibberish, but I’ve been careful to ensure it compiles to valid Forth (Moore 1970) code. Hit the “Sample & Run” button to see for yourself!

Searching in grammar space means starting from a trivial grammar, and performing a sequence of edits to discover an optimal grammar. And, it’s harder than it sounds.

Click the edit operations above to transform the grammar. I gave up after 20 seconds.

The goal of the MCTS algorithm is to find the sequence of actions that will get us there, by mutating the grammar to maximize a pre-defined utility.

#Grammar synthesis

In the demo below, click “Start Search” to watch MCTS discover a grammar that matches a set of target sequences:

After you run the search, pay special attention to the “novel idea” section. It’ll show you an observation that doesn’t exist in the training data. You can hit run to see the idea play out.

The beauty of grammar synthesis is that the learned model’s divergence from the training data is a feature not a bug!

#References

Earley, Jay. 1968. “An Efficient Context-Free Parsing Algorithm.” PhD thesis, Carnegie Mellon University.

Lang, Bernard. 1974. “Deterministic Techniques for Efficient Non-Deterministic Parsers.” Automata, Languages and Programming, Lecture notes in computer science, 14: 255–69.

Maloo, Ankit. 2026. “World Models.” https://ankitmaloo.com/world-models/.

Mastra Team. 2025. “State-Machine Grammar for Agent Behavior.” May 2025. https://mastra.ai/blog/changelog-2025-05-07.

Maurer, Iwan. 2023. “Grammar-Constrained Decoding for Local LLMs.” 2023. https://www.imaurer.com.

Moore, Charles H. 1970. “FORTH: A New Way to Program a Mini-Computer.” In Astronomical Society of the Pacific Conference Series, 5:8–10.

OpenAI. 2023. “New Models and Developer Products Announced at DevDay.” November 6, 2023. https://openai.com/index/new-models-and-developer-products-announced-at-devday/.

———. 2025. “GPT-5 New Parameters and Tools: Context-Free Grammar Support.” August 2025. https://developers.openai.com/cookbook/examples/gpt-5/gpt-5_new_params_and_tools.

Shukla, Nishant. 2019. “Utility Learning, Non-Markovian Planning, and Task-Oriented Programming Language.” PhD thesis, University of California, Los Angeles. https://escholarship.org/uc/item/2nj5n4f1#page=83.

Silver, David, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, et al. 2017. “Mastering the Game of Go Without Human Knowledge.” Nature 550: 354–59. https://doi.org/10.1038/nature24270.

Tomita, Masaru. 1986. Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems. Kluwer Academic Publishers.

Younger, Daniel H. 1967. “Recognition and Parsing of Context-Free Languages in Time n³.” Information and Control 10 (2): 189–208.