The Four-Way Verbosity Stratification

Research note. Drafted 2026-05-13 as §8 of a working LLM-Oriented Programming-Language Design installment; published as a standalone note 2026-05-16. Working draft, subject to refinement as the empirical cascade continues.

The argument in one paragraph

The field of LLM-oriented programming-language design from 2024 to 2026 has been treating verbosity as a single number, bytes per logical operation, roughly. That assumption was harmless when the field was eight hundred Brainfuck derivatives, because bytes, characters, morphemes, and tokens correlated tightly: a Chicken program is long on all four; a J or K program is short on all four. The assumption is no longer harmless. The LLM-oriented field of 2024 to 2026 has stratified the relevant cost axes into four separable measures, each with its own units, its own consumer, and its own design implications. A new LLM-oriented language whose surface engages morphology decorrelates the four. This note presents the stratification, the empirical evidence that the strata can decorrelate, and our theory of what the stratification gives the field.

Why a single number stopped being enough

Babel's original verbosity parameter, bytes per logical operation, calibrated against the corpus of Brainfuck derivatives, assumed four cost axes that the LLM-oriented field has since stratified into separable measures:

Bytes per op. The cost of storing or transmitting the source. Relevant to the file-system stakeholder; to diff size in version control; to network transmission of LLM prompts, meaning bytes wire-formatted in a JSON request body, before tokenization.
Characters per op. The cost of reading the source for a human. Roughly correlated with bytes per op for languages with English keywords, but diverges when the language uses multi-byte UTF-8: Wenyan, or Inflexión with its accented characters and clitics.
Morphemes per op. The cost of cognitive processing for a reader aware of linguistics, and a proxy for the semantic density of the surface form. A morphologically rich language like Spanish, Finnish, or Classical Chinese packs more semantic distinctions per morpheme than an analytic language like Python or PLs with English keywords.
Tokens per op. The cost the LLM pays per semantic operation, against a chosen tokenizer. Different from any of the above: BPE was trained on the analytic-language frequencies of its training data and partitions morphologically rich text in ways that respect neither character boundaries nor morpheme boundaries.

In the corpus of Brainfuck derivatives, these four measures correlate tightly enough that a single number is adequate. In the LLM-oriented field they do not. Sun et al.'s SimPy result is a useful illustration: stripping Python's formatting reduces bytes, characters, and tokens roughly in proportion, but leaves morphemes per op almost unchanged, because Python's grammar was analytic before the strip and is analytic after. The concordance across the four is consistent there precisely because SimPy operates on an analytic-language base. A language with a morphologically rich substrate may decorrelate the four in non-obvious ways. The empirical measurement below shows that this is exactly what Inflexión does.

Empirical measurement

A small but controlled experiment measures the four strata across five short programs in three forms, Python, hand-stripped Python in the style of SimPy, and Inflexión as defined in the Inflexión design paper §5, against six tokenizers: tiktoken's cl100k_base and o200k_base; Hugging Face mirrors of Llama-3, Qwen-2.5, Mistral-7B; and GPT-2 for historical comparison. Aggregate ratios versus the Python baseline:

Stratum	Style of SimPy	Inflexión
bytes per op	−52.1%	−1.8%
characters per op	−52.1%	−3.4%
morphemes per op	−36.2%	+359.9%
tokens per op (cl100k_base)	−38.3%	+15.2%
tokens per op (o200k_base)	−42.4%	+17.6%
tokens per op (Llama-3)	−41.0%	+18.4%
tokens per op (Qwen-2.5)	−44.6%	+20.3%
tokens per op (Mistral)	−46.3%	+19.7%
tokens per op (GPT-2)	−38.3%	−5.4%

Two observations the table is designed to make immediately legible.

First, SimPy's four measures move together. Bytes, characters, morphemes, and tokens all drop by between 36% and 52%; the relative magnitudes vary slightly but the direction is uniform. A reader can summarise SimPy's verbosity profile with a single number, "roughly half", without losing much. The original framing on a single parameter would have described SimPy adequately.

Second, Inflexión's four measures do not move together. Bytes per op is essentially unchanged from Python, at −1.8%. Characters per op is essentially unchanged, at −3.4%. Morphemes per op is quadruple Python's at +359.9%; the morphological density that Inflexión's grammatical-semantic mappings make load-bearing shows up directly as morphemes packed into each operation. Tokens per op is higher than Python on every modern tokenizer, at +15% to +20%, and only ties Python on GPT-2, which was trained before LLM-oriented PL design was a research topic. A reader summarising Inflexión's verbosity with one number gets a different answer depending on which of the four strata they reach for.

The same empirical run, restated per program rather than per op, recovers an earlier finding: Inflexión averages −28% tokens versus Python when totals are taken over a whole program. This reconciles with the per-op finding of +15 to 20% through a single mechanism: Inflexión uses fewer ops per program than Python does, because each op packs more semantic distinctions into morphology. The token cost per op is higher; the token cost per program is lower; the morpheme density is what carries the trade. Both numbers are correct; they describe different normalizations of the same underlying surface. A designer or a reader who treats verbosity as one-dimensional cannot see this trade.

What the stratification gives the field

A single verbosity parameter implicitly assumes the four measures correlate. When they do, as in SimPy, Token Sugar, ShortCoder, and the analytic-language-base interventions that dominate the field of 2024 to 2026, the assumption is harmless and the parameter is adequate. When they do not, as in Inflexión, Tampio, Wenyan, Perligata, and presumably any future LLM-oriented design that engages a morphologically rich substrate, the single parameter forces the designer to pick which measure to optimise without realising the choice has been made implicitly. The four-way stratification raises the choice to the surface: a designer choosing "low verbosity" must now say low on which stratum, and discover, by saying it, that the strata can be traded against each other.

The trade is real and the field has been treating it as theoretical. Sun et al.'s SimPy paper argues that verbosity in the style of natural language is wasteful for LLMs. Youvan's Tokenese paper argues that natural languages are inefficient for tokenization and a constructed alternative would be better. Both arguments are correct on the per-op tokens stratum. Inflexión's win at the per-program level is correct on the per-program tokens stratum. The arguments do not conflict; they describe different points on different strata. The field has been arguing past itself for two years because the underlying measure was undecomposed.

The contribution of the four-way stratification is therefore that it makes the design trade explicit. A new LLM-oriented language can now publish four numbers and be reviewable on all of them. A reader can ask what is the morpheme count per op this language commits to, and what does it cost per op in tokens? rather than is this language verbose? The first question is answerable; the second was always a category mistake.

Open question

The empirical evidence above is enough to establish that the strata can decorrelate. It is not enough to answer the strategic question Inflexión's profile poses: does morphological density as a reduction in op count beat the cost of token fragmentation as workloads scale? At five short programs, Inflexión's win on tokens per program is real; at five thousand longer programs, could the token cost per op compound enough to invert the conclusion? We do not know. An empirical cascade, with staged measurements running tokenizers only first, then profiling of machine time, then automated LLM benchmarks, then human studies, is the right machinery to answer the question, but the present measurements per op alone are insufficient.

The right disposition for now is: introduce the stratification, present what the stratification reveals, and leave the strategic question open until the empirical cascade has the data to settle it. The four-way stratification, even if every other claim in this research thread failed, would still be a useful methodological contribution, because the field is presently making decisions on a single verbosity parameter that empirically obscures a real trade. Surfacing the trade is the contribution.