The Empirical Cascade, Sequenced by Measurement Cost

Research note. Drafted 2026-05-13 as §7 of the working LLM-oriented PL design installment; published as a standalone note 2026-05-17. Working draft.

TL;DR

The methodology has to support: for a given pair of languages, controlled for task and prompt, measure each stakeholder's efficiency axes and report the Tokens × Time Pareto trade-off honestly. The empirical work is sequenced cheapest measurement first: free tokenizer measurements, then cheap toolchain instrumentation, then automated LLM benchmarks, then expensive human studies, then synthesis and honest reporting. The discipline is to drop dominated candidates at each stage so expensive measurements only happen on candidates that survive. The cascade is also how the gating question gets answered: if Stages 1 through 3 show that an existing combination dominates the Pareto frontier, the right outcome is to publish the methodology as the contribution and not build a new language. The null result is reachable cheaply, by design.

What the cascade has to support

The methodological problem the cascade solves is comparability. The mid-2026 field survey catalogues 51 artifacts, each with its own efficiency claim: SimPy's 13.5% on Python, Quasar's 42% on ViperGPT, TOON's 40% on JSON benchmarks. None of these wins are directly comparable because the contexts differ. A shared cascade, run across a shared task suite, with the candidate languages held constant, produces results that are directly comparable. The comparability is the specific contribution the methodology makes to a field that has accumulated artifacts faster than it has accumulated evaluation across contexts.

The cascade also operationalises the discipline of measurement bounded by cost. The framing of stakeholders, tokens, and time reduces efficiency claims onto two resources whose unit costs differ by orders of magnitude; the cascade picks up that fact and uses it.

Stage 1: free measurements

Token counts per LLM for each candidate language on a fixed task suite. SimPy, Pel if the open-source artifact is available, Python, Inflexión, and any others are run through the tokenizers of frontier model families: tiktoken's cl100k_base and o200k_base for the GPT family, the publicly available tokenizers of Llama-3, Qwen-2.5, Mistral-7B, and others as licensing permits.

Cost: free. No model API calls are needed; tokenizer libraries are sufficient. A preliminary run of five short programs across six tokenizers shows that stripping in the style of SimPy yields −60% versus Python on average, and Inflexión yields −28% versus Python on average. The full decomposition per stratum of those numbers is in the four-way verbosity stratification note. The numbers are suggestive but the program suite is too small for general claims; the full empirical study scales to a suite in the class of HumanEval-XL from Peng et al. 2024 of at least 100 tasks.

What Stage 1 can settle. A candidate language whose Token cost per task is dominated by an existing candidate, across every modern tokenizer, on a sufficient task suite, has nothing to gain by proceeding to later stages on the Tokens dimension. It might still survive on the Time dimension if Stages 2 through 4 surface a win there; but Stage 1 is sufficient to retire a candidate whose entire pitch was a Tokens claim that doesn't hold.

Stage 2: cheap toolchain instrumentation

Parse time, compile time, AST size. Instrument the toolchain of each language and run the same fixed task suite. These measurements answer the efficiency question for the compiler and interpreter stakeholder directly. They take an afternoon of engineering work and run for free thereafter.

Cost: low one-time engineering cost; per-run cost is machine seconds. Stage 2 is where machine-Time claims get vetted. A language whose parser is quadratic in input length will surface here; a language whose compile time is excellent will surface here. The efficiency questions for the static analyser and the IDE stakeholder also start to become measurable at this stage, via the same toolchain instrumentation extended to LSP and tree-sitter pipelines.

Stage 3: automated LLM benchmarks

Pass@1 on a benchmark in the class of HumanEval-XL for each cell of language × LLM. A few hundred API calls per cell; total cost in the low hundreds of dollars for a grid of 5 languages by 5 models. The benchmark must include both generation tasks, where the LLM writes code, and comprehension tasks, where the LLM answers questions about provided code.

Cost: hundreds of dollars in API calls per full grid. This is where Stage 3 starts to discriminate Pareto survivors from Pareto-dominated candidates on the LLM-stakeholder rows of the stakeholders inventory. Languages dominated on Stages 1 through 3 are dropped before the expensive measurements; a language whose token cost is high and whose generation accuracy is low across all models has nothing to gain from human studies.

Stage 4: expensive human studies

Surviving candidate languages get:

Time measurements with human reviewers, via a controlled audit task with timing.
Time measurements with human authors, via a controlled write task with keystroke logging.
Where the scientific question warrants, time measurements with human maintainers via held-out bug-fix suite.

Cost: about 100 times the cost per data point of Stage 3. The apparatus for the human study has to be set up only for languages where the cheap measurements already suggest a real effect. The order of magnitude difference between Stage 3 and Stage 4 cost is the reason the cascade is sequenced cheapest first: a methodology that runs Stage 4 for every candidate is uneconomical; a methodology that runs Stage 4 only for survivors of Stages 1 through 3 gets the same scientific signal at a fraction of the cost.

Stage 5: synthesis and honest reporting

No claim of language Y is most efficient is permitted. The reported result is a two-dimensional Tokens × Time Pareto frontier diagram with the Quality threshold marked, plus the underlying scores per language on nine or ten dimensions for comparison at the stakeholder level. Gaps where measurements were skipped are reported explicitly; the methodology section names every Stage 4 measurement that was not run and why.

The shape of the frontier diagram is itself part of the contribution. A field that has accumulated 51 artifacts with efficiency claims local to a workload has no shared visual against which to position a new candidate. The frontier diagram, populated honestly with measurements from Stages 1 through 4 on a shared task suite, produces the missing reference frame.

Why the empirical work is itself part of the contribution

Existing papers on LLM-oriented PL each report a win specific to a context. A shared cascade, run across a shared task suite, with the candidate languages held constant, produces results that are directly comparable. The specific contribution the methodology paper makes to the field is precisely this comparability: the field has accumulated artifacts faster than it has accumulated evaluation across contexts, and the cascade addresses that gap.

The Youvan / Inflexión tension note is the cleanest example of why this matters. Two positions read as opposed; on inspection they describe different points on the same trade-off; the open question is which point wins on a given workload. The cascade is what lets that question be answered.

How the cascade answers the gating question

The cascade also produces the answer to the gating question of the two questions. If the measurements from Stages 1 through 3 show that Python plus SimPy, or any other existing combination, dominates the Pareto frontier on the chosen stakeholder profile, the right outcome is to publish this methodology as the contribution and not build the language proposed for Installment 08. The null result is real, and the discipline of sequencing the cheap measurements first lets the null result be reached without the wasted effort of building a language whose value the empirical work would have negated.

A language for a future installment whose value cannot survive Stages 1 through 3 is a language whose author should have stopped earlier. The cascade is what gives the author the early stop signal. Without it, the default behaviour of the field is to build first and measure later, which is exactly how 51 artifacts get accumulated without comparison across contexts.

Status

The cascade has run Stage 1 in preliminary form. The four-way verbosity stratification reports the measurements per tokenizer on five programs. Stages 2 through 4 are sequenced for future research-note publication as the data lands. Stage 5, the synthesis, is the eventual destination of the research thread; it is also the form a candidate Installment 08 would have to clear before the language for Installment 08 is built at all.