Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic

Yuval Reif, Guy Kaplan, Roy Schwartz
The Hebrew University of Jerusalem
TL;DR: We show that LLMs encode morphological variations (e.g., "walk"→"walked") as linear transformations in embedding space. By composing words from base forms + transformation vectors, we reduce vocabulary size by up to 10% while expanding coverage to out-of-vocabulary words—all without modifying model weights or sacrificing downstream performance.

Abstract

Large language models (LLMs) were shown to encode word form variations, such as "walk"→"walked", as linear directions in embedding space. However, standard tokenization algorithms treat these variations as distinct tokens—filling the size-capped vocabulary with surface form variants (e.g., "walk", "walking", "Walk"), at the expense of less frequent words and multilingual coverage.

We show that many of these variations can be captured by transformation vectors—additive offsets that yield the appropriate word's representation when applied to the base form word embedding—in both the input and output spaces. Building on this, we propose a compact reshaping of the vocabulary: rather than assigning unique tokens to each surface form, we compose them from shared base form and transformation vectors (e.g., "walked" = "walk" + past tense).

We apply our approach to multiple LLMs and across five languages, removing up to 10% of vocabulary entries—thereby freeing space to allocate new, more diverse tokens. Importantly, we do so while also expanding vocabulary coverage to out-of-vocabulary words, with minimal impact on downstream performance, and without modifying model weights.

Method Overview

Compositional vocabulary for LLMs

Figure 1: Compositional vocabulary for LLMs. Input tokens are decomposed into base words and transformations, while output predictions combine logits from both vocabularies.

Key Findings

📊 Vocabulary Redundancy

Analysis of the GPT-4 tokenizer reveals that 24.6k English whole-word tokens can be reduced to just 14.3k base forms (42% reduction) when accounting for case, inflection, and derivation.

🔄 Compositional Representations

LLMs naturally interpret compositional embeddings (base + transformation) as their intended surface forms, even for out-of-vocabulary words never seen as single tokens during pretraining.

🌍 Multilingual Success

The approach works across morphologically diverse languages: English, Arabic, German, Russian, and Spanish, with particularly strong results for inflectional transformations.

⚡ Minimal Performance Impact

Compositional vocabularies achieve comparable performance to baseline models across diverse benchmarks, while reducing decoding speed by only 0.8%.

Vocabulary Structure Analysis

Structure in LLM vocabularies

Figure 2: Many in-vocabulary English word tokens are surface variants of other tokens. The same base forms and transformations can compose over 98k currently out-of-vocabulary words.

Results

Patchscopes Interpretation Accuracy

We use Patchscopes to verify whether LLMs correctly interpret compositional embeddings. Results show high accuracy for inflections and capitalization, both for in-vocabulary and out-of-vocabulary words.

English results table

Table 1: Compositional embeddings are successfully resolved for most inflectional forms, even for out-of-vocabulary words.

Downstream Performance

Models with compositional vocabularies achieve comparable performance to baseline models across a suite of benchmarks including MMLU, ARC, HellaSwag, TriviaQA, and more—despite restructuring up to 10% of the vocabulary.

Downstream results on English

Table 3: Downstream performance of English compositional-vocabulary models compared to unmodified baseline (Llama-3-8B). Our framework performs on-par with the baseline model across diverse tasks.

Vocabulary Size vs. Morphological Compositionality

Scaling analysis

Figure 3: Linear representation of morphology in embeddings weakens as vocabulary size increases. Models with compact vocabularies encode morphology through consistent vector offsets, while large-vocabulary models represent inflections as individual lexical units.

Citation

@article{reif2025vocabdiet,
  title={Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic},
  author={Reif, Yuval and Kaplan, Guy and Schwartz, Roy},
  journal={arXiv preprint arXiv:},
  year={2025}
}