Ali Safaya

Email  /  Semantic Scholar  /  Google Scholar  /  Twitter  /  Github

Understanding the Impact of Token Redundancy in Language Models

31 May 2023

TL;DR

Subword tokenizers such as SentencePiece and Byte-Pair Encoding (BPE) represent the same word with different tokens depending on whether it is prefixed by a space (e.g. Hello vs Ä Hello). At first glance this looks like redundant duplication, but the leading space also carries word-boundary information. This post looks at where these multiple token forms come from, shows that models learn closely related embeddings for them (through cosine-similarity heatmaps), and argues that the real concern in practice is keeping tokenization consistent between training and inference rather than the duplication itself.

Introduction

To see where this comes from, let's look at how a subword tokenizer handles words and subwords, using the GPT2Tokenizer as an example. The GPT2Tokenizer is a byte-level BPE tokenizer commonly used in pre-trained language models like GPT-2, GPT-3, RoBERTa, and BART.

BPE learns a vocabulary by progressively merging the most frequently occurring character or token pairs, gradually building a subword vocabulary that can handle Out-of-Vocabulary (OOV) and rare terms. The GPT2Tokenizer uses byte-level BPE, which operates on raw bytes. Importantly, before merging it applies a regex-based pre-tokenization that splits text on whitespace and keeps each leading space attached to the token that follows it — and that step is exactly what produces the space-prefixed token forms we discuss below.

Both BPE and SentencePiece models reserve tokens to represent the space character. However, it's worth noting that the space character is also encoded in nearly 50% of the tokens. When a token is prefixed with a space, it signifies that the token represents a complete word rather than a subword. Typically, a special character is used to represent the prefixed space. For instance, GPT2Tokenizer uses Ġ, while sentencepiece tokenizers employ a metaspace character ▁.

Let's illustrate this with an example using the GPT2Tokenizer:

>>> gpttokenizer('Hello World')
['Hello', 'Ä World'] - [15496, 2159]

In this case, Ä World is a complete-word token, whereas Hello is not considered a complete-word token since it lacks a prefixed space. This distinction lies at the root of the problem, as it leads the BPE model to learn different tokens for the same word.

Take a look at the following examples:

>>> gpttokenizer('Hello World')
['Hello', 'Ä World'] - [15496, 2159]
>>> gpttokenizer(' Hello World')
['Ä Hello', 'Ä World'] - [18435, 2159]

Here, we can observe that the same sentence is tokenized differently based solely on the presence or absence of a leading space. When the vocabulary is built, a word that appears with no preceding space (for example at the start of the text) is seen as a bare token, while the same word occurring after a space is seen as a space-prefixed token. Because both patterns are common in the training corpus, both forms end up in the vocabulary.

Upon analyzing the GPT2Tokenizer, we can identify a category of approximately 8,535 words that exhibit this behavior. Here are a few examples:

sharing, Ä sharing 
mask, Ä mask
said, Ä said
20, Ä 20
hold, Ä hold
podcast, Ä podcast

In total, around 8,535 words appear in both forms — roughly 17% of the GPT2Tokenizer's 50,257-token vocabulary. At first glance these look like unnecessary duplicates, though, as we will see, the two forms are not quite interchangeable.

So, what does this mean in practice?

Consider the following tokenization examples using both the GPT2Tokenizer and BLOOM tokenizer:

>>> gpttokenizer("Ali is a student, Ali is a father")
[37893, 318, 257, 3710, 11, 12104, 318, 257, 2988]
>>> bloomtokenizer("Ali is a student, Ali is a father")
[86256, 632, 267, 30832, 15, 25427, 632, 267, 27782]

In the GPT2Tokenizer output, the first occurrence of Ali is encoded as 37893, while the second is encoded as 12104; the BLOOM tokenizer behaves the same way. This has two practical consequences. First, the model has to learn that these different token IDs refer to the same underlying word — though, as the next section shows, it picks this up readily. Second, each form occupies its own embedding row, so the same word can take up more than one slot in the vocabulary.

It is tempting to treat this as pure waste, but the two forms are not interchangeable: the leading space records whether a word follows whitespace, and the tokenizer relies on that to reconstruct the original text exactly when decoding. So this is less "redundancy to be removed" and more a boundary distinction that the model has to learn to handle. The genuinely practical concern — which we return to at the end — is making sure the same text is always tokenized the same way.

How do the models treat this phenomenon?

Models seem to learn to handle these two forms in a similar way. To illustrate this, let's examine the cosine similarity between the embedding vectors of such token pairs.

Cosine-similarity heatmap of 5 bare vs space-prefixed token pairs Cosine-similarity heatmap of 1,000 bare vs space-prefixed token pairs

Heatmaps of cosine similarity between GPT2Tokenizer embeddings for 5 and 1,000 words (respectively). Rows are the bare tokens and columns their space-prefixed counterparts, aligned so that cell (i, i) compares the two forms of the same word. The words are chosen at random; the bright diagonal shows that a word's two forms are far more similar to each other (about 0.5–0.6) than to other tokens (near 0).

The heatmaps show that the two forms of a word end up with closely related embeddings, even though they remain distinct tokens. This makes sense: trained on large corpora, the model sees both forms in similar contexts and learns representations that reflect their shared meaning while still encoding the boundary difference.

In principle this similarity could be exploited to shrink the vocabulary or merge these embeddings together, but that is beyond the scope of this article.

What is the solution?

The practical fix is to tokenize text consistently. The simplest step is to add a prefix space before tokenizing, so that the first word of a string takes the same space-prefixed form it would have mid-sentence. Most tokenizers support this through the add_prefix_space=True option in the huggingface/tokenizers library, which prepends a space to the string before tokenization. Note that this does not remove the bare token forms from the vocabulary — it just makes their use consistent — and it should match how the model was trained, otherwise it can introduce a train/inference mismatch.

However, this solution doesn't completely resolve the problem for multiline tokenizers. Let's consider the example of the LLaMA tokenizer:

>>> llamatokenizer("John John\nJohn")
['<s>', '▁John', '▁John', '<0x0A>', 'John'] - [1, 2259, 2259, 13, 11639]

Here, we can see that the LLaMA tokenizer applies the add_prefix_space option (after the leading <s> token), so the first two occurrences of John map to the same token ID: 2259. However, the John right after the line break (\n) is still a different token, with ID 11639.

To handle this in multiline tokenizers, an additional normalization step can be applied. Along with add_prefix_space=True, a Replace("\n", "\n ") rule can be employed before tokenization, adding a space after each line break so that the next word takes its space-prefixed form. A matching denormalization step, Replace("\n ", "\n"), reverts this during decoding. Keep in mind that this injects characters that were not in the original text, so the denormalization step is essential, and care is needed with whitespace-sensitive inputs such as source code.

Conclusion

Subword tokenizers represent a word with more than one token depending on a leading space, and in the GPT2Tokenizer roughly 17% of the vocabulary takes this paired form. These forms look redundant, but the leading space encodes word-boundary information, and models readily learn closely related embeddings for the two variants, as the cosine-similarity heatmaps show. The practical takeaway is therefore less about eliminating "redundant" tokens and more about consistency: applying a prefix space (and, for multiline text, normalizing line breaks) so that the same text is always tokenized the same way, matching how the model was trained. Handled consistently, these multiple token forms are a non-issue rather than a hidden inefficiency.