Scaling laws for large language models (LLMs) quantify the relationship between model performance and resources such as parameter count, training tokens, and computational budget. Classical formulations, including those of Kaplan et al. (2020) and Hoffmann et al. (2022), treat tokens as uniform atomic units. Empirical evidence indicates that natural language exhibits non-uniform information density. This review examines the role of token-level semantic density in scaling phenomena, focusing on three interconnected propositions:
The review synthesizes findings from recent literature on vocabulary scaling, compression-aware scaling laws, adaptive compute allocation, and information-theoretic foundations.
The term semantic density appears in literature with multiple operational definitions:
| Definition Type | Formalization | Representative Work |
|---|---|---|
| Information-Theoretic | $D_{info} = H(T) / H_{max}(T)$, channel efficiency | Zouhar et al. (2023) |
| Linguistic | Semantic component vectors per content word | Nature (2025) |
| Uncertainty-Based | Distribution density in semantic space | Qiu et al. (2024) |
Information-theoretic density based on Shannon or Rényi entropy provides a quantitative link between token statistics and downstream performance. Rényi entropy correlates strongly with BLEU scores (ρ ≈ 0.82).
Natural language exhibits substantial local variation in predictability. Dynamic Large Concept Models (DLCM, 2025) explicitly note that token-uniform computation misallocates capacity, over-assigning resources to predictable spans and under-assigning to semantically critical transitions. The Uniform Information Density (UID) hypothesis and its entropy-based extensions provide a theoretical framework for modeling such non-uniformity.
Tao et al. (NeurIPS 2024) established that optimal vocabulary size depends on compute budget via a power law: $V_{opt}(C) \propto C^{0.08\text{–}0.12}$. For Llama2-70B, the predicted optimal vocabulary exceeds 216K tokens, approximately seven times the standard 32K. Increasing vocabulary from 32K to 43K under identical FLOPs improved ARC-Challenge accuracy from 29.1% to 32.0%.
Huang et al. (ICML 2025) demonstrated a log-linear relationship between input vocabulary size and training loss. Larger input vocabularies consistently improve performance across model sizes, enabling models with smaller parameter counts to match larger baselines.
These results indicate that tokenizer scaling constitutes a distinct axis in the overall scaling framework, orthogonal to parameter count and training tokens.
DLCM (2025) introduced the first scaling law that explicitly incorporates compression ratio $R$ (tokens per concept). By separating token-level capacity, concept-level reasoning capacity, and compression rate, the formulation enables principled compute allocation under fixed FLOPs. At $R=4$, approximately one-third of inference compute can be redirected to higher-capacity reasoning backbones, yielding an average improvement of +2.69% across twelve zero-shot benchmarks. Relative FLOPs savings increase with model scale.
Other semantic compression approaches demonstrate similar effects:
Several architectures allocate compute resources according to estimated token importance or information density.
| Method | Mechanism | Efficiency Gain |
|---|---|---|
| MoHD (ICML 2025) | Sparsity in hidden dimensions | 50% activation reduction, +1.7% performance |
| MODIX (CVPR 2026) | Density-driven positional encoding | Training-free modality adaptation |
| EAGER (2025) | Entropy-aware branching | Up to 65% token reduction, +37% Pass@k |
| CALM (NeurIPS 2022) | Confidence-based early exit | Up to 3× speedup |
Token Efficiency Breakthrough (2025) achieved 72.2% efficiency improvement and 30.2% token reduction via information density estimation, stating that transitioning from uniform processing to information-theoretic optimization is necessary for further scaling gains.
Nayak et al. (2025) derived the Chinchilla scaling law using an analogy to iterative decoding of LDPC codes, modeling skill acquisition as message passing on bipartite skill–text graphs. This work provides a theoretical basis for the power-law relationships observed empirically.
A deductive chain connects token statistics to neural scaling:
This sequence, formalized in From Zipf’s Law to Neural Scaling (2025), demonstrates that scaling behavior emerges from fundamental information-theoretic properties of token distributions.
Both layers exhibit power-law scaling with shared underlying principles of information compression.
| Layer | Scale-Up Variable | Intermediate Metric | End Effect |
|---|---|---|---|
| Model | Parameters N, Data D | Semantic density (information per token) | Effective FLOPs per unit information |
| Tokenizer | Training data volume | Token efficiency (compression ratio) | Effective context capacity |
Weak isomorphism—identical functional form and information-theoretic derivability—is strongly supported. Strong isomorphism—identical coefficients and mechanisms—is not supported due to qualitative differences in learning dynamics (gradient-based optimization vs. statistical pattern extraction) and temporal scales (training-time vs. pre-training tokenizer construction).
Definitional Pluralism: The lack of a unified operational definition of semantic density complicates direct experimental validation.
Causal Isolation: Manipulating semantic density independently while controlling for vocabulary size, embedding dimension, and optimization dynamics presents substantial methodological challenges.
Structural Determinism of FLOPs: FLOPs per token are determined by architectural parameters (vocabulary size, hidden dimension, depth) rather than token content. The proposed causal pathway therefore operates indirectly: denser tokens allow smaller models to achieve equivalent performance.
Non-Uniform Scaling: Contextual entrainment studies show that different capabilities scale with opposite exponents. Semantic filtering improves with scale (negative exponent), while mechanical copying also increases (positive exponent). This divergence implies that semantic density alone cannot capture the full complexity of scaling behavior.
Informational Linkage Gaps: Montanino (2026) identifies a structural limit beyond which scaling laws do not apply, particularly for problems requiring cross-domain conceptual construction rather than interpolation within the training distribution.
Empirical and theoretical evidence supports the proposition that token-level semantic density influences scaling efficiency. Compression-aware scaling laws, vocabulary size optimization, and adaptive compute allocation based on information density all demonstrate practical gains in FLOPs efficiency and context capacity. The weak isomorphism between model scaling and tokenizer scaling suggests that information compression serves as a unifying principle across layers. Future work should address the operationalization of semantic density, the design of causal validation experiments, and the extension of these findings to multilingual and non-Latin script contexts.