Axiomatic Reasoning for LLMs

Semantic Density and Scaling Laws

1. Introduction

Scaling laws for large language models (LLMs) quantify the relationship between model performance and resources such as parameter count, training tokens, and computational budget. Classical formulations, including those of Kaplan et al. (2020) and Hoffmann et al. (2022), treat tokens as uniform atomic units. Empirical evidence indicates that natural language exhibits non-uniform information density. This review examines the role of token-level semantic density in scaling phenomena, focusing on three interconnected propositions:

  1. Model Layer Proposition: Increasing model and data scale raises semantic density per token, which subsequently improves effective computational efficiency.
  2. Tokenizer Layer Proposition: Expanding tokenizer training data enhances token efficiency, effectively increasing usable context capacity.
  3. Structural Isomorphism: The two layers exhibit a common principle of efficiency gain through information compression.

The review synthesizes findings from recent literature on vocabulary scaling, compression-aware scaling laws, adaptive compute allocation, and information-theoretic foundations.

2. Semantic Density: Definitions and Empirical Basis

2.1 Operational Definitions

The term semantic density appears in literature with multiple operational definitions:

Definition Type Formalization Representative Work
Information-Theoretic $D_{info} = H(T) / H_{max}(T)$, channel efficiency Zouhar et al. (2023)
Linguistic Semantic component vectors per content word Nature (2025)
Uncertainty-Based Distribution density in semantic space Qiu et al. (2024)

Information-theoretic density based on Shannon or Rényi entropy provides a quantitative link between token statistics and downstream performance. Rényi entropy correlates strongly with BLEU scores (ρ ≈ 0.82).

2.2 Non-Uniformity of Information Density

Natural language exhibits substantial local variation in predictability. Dynamic Large Concept Models (DLCM, 2025) explicitly note that token-uniform computation misallocates capacity, over-assigning resources to predictable spans and under-assigning to semantically critical transitions. The Uniform Information Density (UID) hypothesis and its entropy-based extensions provide a theoretical framework for modeling such non-uniformity.

3. Vocabulary Size and Scaling Laws

Tao et al. (NeurIPS 2024) established that optimal vocabulary size depends on compute budget via a power law: $V_{opt}(C) \propto C^{0.08\text{–}0.12}$. For Llama2-70B, the predicted optimal vocabulary exceeds 216K tokens, approximately seven times the standard 32K. Increasing vocabulary from 32K to 43K under identical FLOPs improved ARC-Challenge accuracy from 29.1% to 32.0%.

Huang et al. (ICML 2025) demonstrated a log-linear relationship between input vocabulary size and training loss. Larger input vocabularies consistently improve performance across model sizes, enabling models with smaller parameter counts to match larger baselines.

These results indicate that tokenizer scaling constitutes a distinct axis in the overall scaling framework, orthogonal to parameter count and training tokens.

4. Compression-Aware Scaling Laws

DLCM (2025) introduced the first scaling law that explicitly incorporates compression ratio $R$ (tokens per concept). By separating token-level capacity, concept-level reasoning capacity, and compression rate, the formulation enables principled compute allocation under fixed FLOPs. At $R=4$, approximately one-third of inference compute can be redirected to higher-capacity reasoning backbones, yielding an average improvement of +2.69% across twelve zero-shot benchmarks. Relative FLOPs savings increase with model scale.

Other semantic compression approaches demonstrate similar effects:

5. Adaptive Computation Based on Information Density

Several architectures allocate compute resources according to estimated token importance or information density.

Method Mechanism Efficiency Gain
MoHD (ICML 2025) Sparsity in hidden dimensions 50% activation reduction, +1.7% performance
MODIX (CVPR 2026) Density-driven positional encoding Training-free modality adaptation
EAGER (2025) Entropy-aware branching Up to 65% token reduction, +37% Pass@k
CALM (NeurIPS 2022) Confidence-based early exit Up to 3× speedup

Token Efficiency Breakthrough (2025) achieved 72.2% efficiency improvement and 30.2% token reduction via information density estimation, stating that transitioning from uniform processing to information-theoretic optimization is necessary for further scaling gains.

6. Information-Theoretic Foundations

Nayak et al. (2025) derived the Chinchilla scaling law using an analogy to iterative decoding of LDPC codes, modeling skill acquisition as message passing on bipartite skill–text graphs. This work provides a theoretical basis for the power-law relationships observed empirically.

A deductive chain connects token statistics to neural scaling:

This sequence, formalized in From Zipf’s Law to Neural Scaling (2025), demonstrates that scaling behavior emerges from fundamental information-theoretic properties of token distributions.

7. Structural Isomorphism: Model Layer and Tokenizer Layer

Both layers exhibit power-law scaling with shared underlying principles of information compression.

Layer Scale-Up Variable Intermediate Metric End Effect
Model Parameters N, Data D Semantic density (information per token) Effective FLOPs per unit information
Tokenizer Training data volume Token efficiency (compression ratio) Effective context capacity

Weak isomorphism—identical functional form and information-theoretic derivability—is strongly supported. Strong isomorphism—identical coefficients and mechanisms—is not supported due to qualitative differences in learning dynamics (gradient-based optimization vs. statistical pattern extraction) and temporal scales (training-time vs. pre-training tokenizer construction).

8. Limitations and Open Questions

Definitional Pluralism: The lack of a unified operational definition of semantic density complicates direct experimental validation.

Causal Isolation: Manipulating semantic density independently while controlling for vocabulary size, embedding dimension, and optimization dynamics presents substantial methodological challenges.

Structural Determinism of FLOPs: FLOPs per token are determined by architectural parameters (vocabulary size, hidden dimension, depth) rather than token content. The proposed causal pathway therefore operates indirectly: denser tokens allow smaller models to achieve equivalent performance.

Non-Uniform Scaling: Contextual entrainment studies show that different capabilities scale with opposite exponents. Semantic filtering improves with scale (negative exponent), while mechanical copying also increases (positive exponent). This divergence implies that semantic density alone cannot capture the full complexity of scaling behavior.

Informational Linkage Gaps: Montanino (2026) identifies a structural limit beyond which scaling laws do not apply, particularly for problems requiring cross-domain conceptual construction rather than interpolation within the training distribution.

9. Conclusion

Empirical and theoretical evidence supports the proposition that token-level semantic density influences scaling efficiency. Compression-aware scaling laws, vocabulary size optimization, and adaptive compute allocation based on information density all demonstrate practical gains in FLOPs efficiency and context capacity. The weak isomorphism between model scaling and tokenizer scaling suggests that information compression serves as a unifying principle across layers. Future work should address the operationalization of semantic density, the design of causal validation experiments, and the extension of these findings to multilingual and non-Latin script contexts.