The Nine Design Axes: Verified, and a Tenth on Enforcement Locus

Research note. Drafted 2026-05-17 from the audit work that surfaced in the mid-2026 lit pass. Working draft. Companion to the per-artifact audit table.

TL;DR

A fresh audit of the LLM-oriented programming-language design field against the nine design axes previously proposed in our working methodology paper, namely formatting compactness, AST minimalism, vocabulary redesign, morphological density, BPE-tokenizer alignment, grammar regularity, ambiguity load, human-readability priority, and ecosystem maturity, confirms the nine as a coherent set for executable language artifacts. The audit also surfaces a tenth axis the existing nine cannot describe: enforcement_locus, the point in the LLM pipeline where the language's contract is checked. Five artifacts force the addition: XGrammar, Grammar-Aligned Decoding, Type-Constrained Codegen, Quasar, Pangolin. A weaker second candidate, artifact-kind, is recommended for the methodology paper's schema layer rather than the design-axis cluster.

The nine axes, verified

The nine axes hold up well as a set for executable language artifacts. The 22 artifacts in the survey that propose their own language, drawn from the "whole executable languages", "markup and data formats", and "host-embedded DSLs" categories of the field survey, score on at least seven of the nine without strain. The ~15 position papers and empirical studies in the survey all score "n/a" on every axis, but that's a structural observation about artifact kind, not a defect in the axes.

For the axis definitions themselves, see §5 of the working methodology paper, whose source is in the Babel github repo. Briefly:

Formatting compactness. Whitespace and comments stripped; SimPy high, Python low.
AST minimalism. Small versus rich abstract syntax tree; Quasar high, Python low.
Vocabulary redesign. Keywords chosen with LLM tokenizers in mind; Pel high, SimPy low.
Morphological density. Semantic distinctions in surface morphology; Inflexión high, Python low.
BPE-tokenizer alignment. Surface tokens chosen to match the LLM's existing vocab; NanoLang high, Inflexión low.
Grammar regularity. Uniform versus irregular grammar; Pel high, Inflexión low by design.
Ambiguity load. Semantic ambiguity the LLM must resolve from context; NL-derived high, typed formal low.
Human-readability priority. Where the language sits on the axis from human to machine readability; Python high, B-IR low by design.
Ecosystem maturity. Stdlib, package manager, community, decades of code; Python high, any new artifact low.

What this audit changes about the nine: nothing about the definitions, but a clearer picture of how groups of artifacts cluster on them. SimPy / Token Sugar / ShortCoder / ShortenDoc cluster at high compactness and low on everything else, since they're rewriters. Pel / Quasar / NanoLang / Tokenese / B-IR cluster at high compactness, high AST minimalism, high vocabulary redesign, high BPE alignment, high grammar regularity, and low on everything else, since they're greenfield LLM-target languages. The markup family of TOON, LLMON, and PDL cluster differently; they're not executable, so AST-minimalism is "n/a", but an analogue based on schema complexity would apply.

The tenth axis: `enforcement_locus`

A clean way to test whether a tenth axis is justified is the test of two or more artifacts: does at least one pair of artifacts vary on a dimension the existing axes cannot describe, in a way that's decisive for evaluating their design? The audit finds five such artifacts; a tenth axis is justified.

Definition. Where in the LLM pipeline is the language's contract checked or enforced?

author-time. Type-checker, linter, or validator runs against the human or LLM author's source before generation finishes. Examples: Python's PEP 484, Pel's grammar, NanoLang's test-blocks, Pangolin's effect-handler types.
decode-time. Constraint is encoded as a grammar or type automaton against which the decoder is constrained at every token. Examples: XGrammar, Grammar-Aligned Decoding, Type-Constrained Codegen, LMQL's where-clauses.
parse-time. Generated output is parsed and rejected or repaired after generation, before execution. Examples: DSPy signature validation, SGLang's compressed FSM checks, APPL's parse-then-call pattern.
runtime. Contract is checked only on execution. Examples: Quasar's conformal-prediction uncertainty bounds, Pangolin's selection-monad assertions.

The axis is set-valued, not single-valued. A given artifact may sit at multiple loci simultaneously: Quasar enforces at parse-time and runtime; XGrammar at decode-time and parse-time; APPL at author-time and parse-time. This is the same set-valued observation surfaced in an earlier audit pass as schema-extension E4.

Why this isn't already in the nine. The nine axes describe properties of a language's surface form and ecosystem. None of them describe where in the LLM pipeline the language's contract gets checked. XGrammar, GAD, and Type-Constrained Codegen are essentially undescribed by the nine; they don't propose a language at all, only a way of constraining what the LLM emits in some other language. Quasar's lambda-calculus core is one design choice; its enforcement locus via conformal prediction is a different design choice, and the second one is what most distinguishes Quasar from non-Quasar lambda-calculus cores. Pangolin's split between selection monad and effect handler is structurally an enforcement-locus claim before it is anything else.

What changes on the per-artifact table. Artifacts at the decoder boundary that currently appear as rows of "n/a" get a real position to occupy. LMQL separates from DSPy separates from Pangolin not on the nine surface axes but on the enforcement_locus values they pick: decode plus author for LMQL, parse for DSPy, runtime plus author for Pangolin. See the per-artifact audit table for the full re-scoring with the tenth column.

The eleventh candidate that doesn't get promoted: artifact-kind

A second, weaker gap is that the nine axes assume "a language", but the field contains at least seven artifact kinds: whole languages, markups, transpilers, host-embedded DSLs, decoder-boundary constraint systems, position papers, and empirical studies. The nine axes degrade gracefully into "n/a" for the non-language categories.

The temptation is to add artifact-kind as an eleventh axis. The audit recommends against this, on the ground that artifact-kind is a categorical classifier, a way of carving the field into coherent sub-fields, rather than a continuous design choice. A new author building an LLM-oriented language makes design moves on the nine or ten axes; they do not pick "artifact kind" the way they pick "BPE alignment". They pick what they're building, and the design axes follow.

The right place for artifact-kind is the methodology paper's schema layer, not the design-axis cluster. An earlier audit pass had this as schema-extension E4; that placement still seems right.

What this means

For the methodology paper, the immediate consequence is that §5 grows from nine to ten design axes, with enforcement_locus defined as set-valued over four canonical values. The cluster on LLM-friendliness in §6 already includes ambiguity-load handling that overlaps with parse-time and decode-time enforcement; the framing of §6 carries forward without restructuring, but its three design parameters can now be more precisely stated as BPE-token alignment, token cost per semantic unit, and enforcement locus.

For the empirical cascade in §7, the addition of enforcement_locus means a candidate language has ten dimensions to characterise rather than nine; the Pareto-frontier analysis in §10's conditional framing for Installment 08 inherits the change.