Per-Artifact Audit Table: 51 Artifacts Across the 9+1 Axes
Research note. Drafted 2026-05-17 from the mid-2026 lit pass. Working draft. The table will be re-run when LMPL 2026 results land, with notification in August 2026. Companion to the design-axes verification note.
TL;DR
The raw audit data behind the field survey and the axes verification: scoring per artifact across the nine original design axes plus the tenth, enforcement_locus, that the audit surfaced. 22 artifacts that propose a language of their own get real scores; the other 29, made up of position papers, empirical studies, constraints at the decoder boundary, benchmarks, surveys, and the CFP, appear as "n/a" rows on the original nine, though the cluster at the decoder boundary gets real values on the tenth axis. The justifications section after the table covers the non-obvious scoring decisions.
Scoring legend
- low / mid / high: position on each design axis.
- n/a: the axis cannot apply to this artifact kind, such as a position paper, benchmark, or empirical study.
- For
enforcement_locus, the value is a set drawn from{author, decode, parse, runtime}; artifacts at multiple loci list both.
Axis abbreviations
Compact = formatting compactness; AST min = AST minimalism; Vocab = vocabulary redesign; Morph = morphological density; BPE = BPE-tokenizer alignment; Gram reg = grammar regularity; Ambig = ambiguity load; HumRead = human-readability priority; Ecosys = ecosystem maturity; Enforce = enforcement_locus.
The table
| Artifact | Compact | AST min | Vocab | Morph | BPE | Gram reg | Ambig | HumRead | Ecosys | Enforce |
|---|---|---|---|---|---|---|---|---|---|---|
| SimPy | high | low | low | low | mid | mid | mid | mid | high† | parse |
| Token Sugar | high | n/a‡ | mid | low | high | n/a‡ | mid | mid | high† | parse |
| ShortCoder | high | n/a‡ | mid | low | mid | n/a‡ | mid | mid | high† | parse |
| ShortenDoc | high | n/a‡ | n/a‡ | n/a‡ | mid | n/a‡ | mid | mid | high† | parse |
| Pel | high | high | high | low | high | high | low | low | low | author |
| Quasar | mid | high | mid | low | mid | high | low | low | low | parse, runtime |
| LLMON | high | n/a¶ | high | low | mid | mid | low | mid | low | parse |
| NanoLang | high | high | high | low | high | high | low | mid | low | author |
| B-IR | high | high | high | low | high | high | low | low | low | author |
| Tokenese | high | n/a¶ | high | low | high | high | low | mid | low | n/a |
| TOON | high | n/a¶ | high | low | high | high | low | mid | low | parse |
| PDL | mid | high | mid | low | mid | high | mid | high | low | parse |
| Plang | low | mid | low | low | mid | mid | high | high | low | parse |
| LMQL | mid | mid | low | low | mid | high | mid | high | low | decode, author |
| DSPy | low | mid | low | low | mid | high | low | high | mid | parse |
| SGLang | mid | mid | low | low | mid | high | low | high | mid | decode, parse |
| APPL | low | mid | low | low | mid | high | mid | high | low | author, parse |
| Pangolin | mid | high | mid | low | mid | high | low | mid | low | runtime, author |
| Wang effect-handler | mid | high | mid | low | mid | high | low | mid | low | runtime, author |
| CNL-P | low | mid | mid | low | mid | high | low | high | low | parse |
| Inflexión | low | low | low | high | low | low | mid | mid | low | author |
| XGrammar | n/a§ | n/a§ | n/a§ | n/a§ | n/a§ | n/a§ | n/a§ | n/a§ | n/a§ | decode, parse |
| Grammar-Aligned Decoding | n/a§ | n/a§ | n/a§ | n/a§ | n/a§ | n/a§ | n/a§ | n/a§ | n/a§ | decode |
| Type-Constrained Codegen | n/a§ | n/a§ | n/a§ | n/a§ | n/a§ | n/a§ | low** | n/a§ | n/a§ | decode |
The remaining 27 artifacts in the survey, made up of position papers, empirical studies, benchmarks, surveys, the CFP, the unverified mlds_2025 placeholder, and the original Iverson-1980 reference, appear as "n/a" rows on every design axis. They're catalogued in the field survey under their structural category.
Footnotes
- † Ecosystem score inherited from the base language Python, not the artifact's own.
- ‡ Rewriters based on rules or shorthand operate over an existing language's surface; they do not redefine an AST, vocabulary, or grammar regularity in their own right.
- § Empirical study, position paper, benchmark, CFP, or typology reference; does not propose a language artifact of its own; the design axes do not apply directly. Listed for completeness.
- ¶ Markup or data-notation language; has a schema, not an executable AST.
- ** Type-Constrained Codegen does not define a language but reduces ambiguity load of an existing language's surface at decode time; the score reflects its effect on the target language's ambiguity, not on a language of its own.
Per-artifact justifications for non-obvious scoring
SimPy. Formatting compactness high because the entire design move is stripping whitespace and formatting. AST minimalism low because the AST is identical to Python's. Vocabulary unchanged. BPE alignment mid because Python's existing keywords were already designed against English-ASCII frequencies. Human-readability drops from Python's high to mid because PEP-8 cues are gone. Enforcement locus parse: SimPy's invariant is that the generated stripped form parses back to valid Python.
Pel. Designed explicitly around constrained LLM generation: small regular S-expression grammar gives high gram-reg, uniform syntax gives high AST min, keyword vocabulary chosen to be friendly to tokenizers gives high vocab redesign, low ambiguity. Human readability deliberately mid to low, since Lisp style is an acquired taste. Ecosystem near zero. Enforcement locus author: Pel's discipline is "the LLM emits Pel and the Pel parser/checker validates before execution".
Quasar. Lambda-calculus core gives high AST min, regular grammar gives high gram-reg, some vocabulary redesign but emphasis is on smaller AST not keyword choice gives mid vocab. Ambiguity low because of type system plus uncertainty bounds from conformal prediction. Transpile from Python to Quasar means partial reuse of Python's ecosystem, but Quasar itself is fresh, so low. Enforcement locus parse plus runtime: the type system is at the parse locus; the uncertainty bounds from conformal prediction are at the runtime locus.
LLMON. Formatting compactness high versus JSON/Markdown by design. Vocab redesign high through custom tags. LLMON is markup, so AST minimalism is n/a; the relevant analogue would be schema complexity, which is mid. Enforcement locus parse.
NanoLang. Practitioner mirror of Pel: prefix notation, mandatory test blocks, transpiles to C. High compactness, high AST min, high vocab redesign, high BPE alignment since prefix operator tokens are single BPE tokens, high gram-reg, low ambiguity, low ecosystem. Enforcement locus author: tests must compile and pass before downstream use.
B-IR. The HN comment thread describes design moves consistent with high compactness, high vocab redesign, high BPE alignment. The artifact is intentionally not readable by humans, so low HumRead, and has no ecosystem. Enforcement locus author: validation happens at the boundary of the LLM author before downstream consumption.
Tokenese. Constructed natural language, not a PL; scored where the axes still apply by analogy: high vocab redesign, high BPE alignment, high gram-reg, low morph density since it is deliberately analytic, zero ecosystem. The explicit goal is in the style of Iverson, "drive ambiguity to zero". Enforcement locus n/a since there is no execution context.
TOON. Behaves like LLMON on the markup axes: high compactness versus JSON, high BPE alignment because tabular row encoding lets identifiers tokenise once, low ambiguity, mid human-readability since it is still legible. Enforcement locus parse.
PDL. YAML base means low compactness on the format itself, but high AST min through a small set of block types, high gram-reg, high human-readability; closer to the profile of DSPy as "framework not language". Enforcement locus parse.
Plang. Blends NL with control-flow markers, so ambiguity load is high by design, human-readability is high as the explicit design goal, formatting compactness is low since NL is verbose, vocab redesign is low since it uses ordinary English. Enforcement locus parse.
LMQL / DSPy / SGLang / APPL / Pangolin / Wang effect-handler. All artifacts in the family of "language that runs LLMs"; they don't redesign keyword surface, so low vocab redesign, inherit a host language's compactness profile, and their key dimensions of variation are enforcement locus and effect-system rather than alignment with the tokenizer. They score similarly on the original nine axes despite expressing very different design ideas, one of the signals that the nine alone don't characterise this family. On the tenth axis they finally separate: LMQL is decode plus author, DSPy is parse, SGLang is decode plus parse, APPL is author plus parse, Pangolin and the Wang paper are runtime plus author.
CNL-P. Controlled NL means low compactness since NL is verbose, high human-readability, low ambiguity from a strict grammar, low vocab redesign since it uses ordinary English. Enforcement locus parse.
XGrammar, Grammar-Aligned Decoding, Type-Constrained Codegen. These three appear as rows of n/a on the original nine, by construction; they don't propose languages of their own. On the tenth axis they're the central exemplars: XGrammar at decode plus parse, GAD at decode, Type-Constrained Codegen at decode with a side-effect on ambiguity for the target language. Without the tenth axis there is no way to describe them as members of the LLM-oriented PL design space at all.
Inflexión. Included as a morphologically rich data point on the design table. Bytes per op is essentially unchanged from Python, giving low compactness; the morphology is dense, not the formatting. AST minimalism low because of rich expression syntax. Vocabulary uses ordinary Spanish, so low vocab redesign. Morphological density is the language's central design move, so high. BPE alignment low because Spanish morphology fragments under BPE designed for English-ASCII. Grammar regularity low since Spanish irregularity is carried over. Ambiguity load mid. Human-readability mid: high for Spanish speakers, low for non-Spanish speakers. Ecosystem low. Enforcement locus author.
Artifacts catalogued but not scored
The 27 artifacts not in the table fall into the categories of position paper, benchmark, and empirical study detailed in the field survey. All score "n/a" on every original axis because they don't propose a language of their own; most also have no enforcement locus to score, since TokDrift describes drift and MorphBPE measures tokenizer behavior, and neither has an enforcement locus.
Full list, cross-referenced to the field-survey note: Hidden Cost by Pan, Let Me Speak Freely by Tam, LLMs Love Python by Twist, Scaling Laws Code, HumanEval-XL by Peng, MorphBPE, TokDrift, Hannecke format benchmark, Iverson 1980 as anchor reference, Cao NL-aligned, Kore Medium essay, kirancodes "Mediocrity", Du-Wang-Wang at LMPL 2025, Bayazıt-Li proof, Reasoning as a Resource at LMPL 2025, The Modular Imperative at LMPL 2025, Vibe Reasoning at LMPL 2025, CG-Bench, the translation study from C++ to Rust, RVBench/RagVerus, Ranking Formal Specifications, FirmNamer, SAST detection with LLMs, Understanding Formal Reasoning Failures, LMPA on pointer analysis, Preguss, ClearAgent, W2GPU, Vibe Coding survey, LMPL 2026 CFP, plus the unverified mlds_2025 placeholder.
The full metadata per artifact, including title, authors, year, venue, URL, and one-sentence summary for every artifact in the survey, is in the raw lit-pass data file at research-notes/lit-pass-2026-05-17.md in the Babel github repo; agent-generated working data, not polished for reading, but reproducible and complete.