Understanding the Impact of Token Redundancy in Language Models
31 May 2023
TL;DR
Subword tokenization models, such as SentencePiece and Byte-Pair Encoding (BPE),
treat words and subwords differently, which can lead to redundancy and
inefficiency in language models. This blog explores the phenomenon of token
duplication caused by the absence of prefixed spaces in certain tokens. It
examines the implications of this issue and presents solutions to mitigate the
problems associated with duplicated tokens. Additionally, the blog highlights
the models' ability to learn and treat these tokens similarly through the
analysis of cosine similarity heatmaps of embedding vectors.
Introduction
Subword tokenization models, such as SentencePiece and Byte-Pair Encoding (BPE),
handle words and subwords differently. In this section, we will explore this
distinction using the GPT2Tokenizer as an example. The GPT2Tokenizer is a BPE
tokenizer commonly used in various pre-trained language models like GPT-2,
GPT-3, RoBERTa, BART, and others.
BPE models learn vocabulary by progressively merging the most frequently
occurring character or token pairs. This iterative process gradually constructs
a subword-level vocabulary capable of handling Out-of-Vocabulary (OOV) words and
rare terms in Natural Language Processing (NLP) tasks. Unlike other tokenization
approaches, BPE models operate directly on bytes, without any text preprocessing
prior to vocabulary learning.
Both BPE and SentencePiece models reserve tokens to represent the space
character. However, it's worth noting that the space character is also
encoded in nearly 50% of the tokens. When a token is prefixed with a space, it
signifies that the token represents a complete word rather than a subword.
Typically, a special character is used to represent the prefixed space. For
instance, GPT2Tokenizer uses Ä , while sentencepiece tokenizers
employ a metaspace character â .
Let's illustrate this with an example using the GPT2Tokenizer:
>>> gpttokenizer('Hello World')
['Hello', 'Ä World'] - [15496, 2159]
In this case, Ä World is a complete-word token, whereas
Hello is not considered a complete-word token since it lacks a
prefixed space. This distinction lies at the root of the problem, as it leads
the BPE model to learn different tokens for the same word.
Take a look at the following examples:
>>> gpttokenizer('Hello World')
['Hello', 'Ä World'] - [15496, 2159]
>>> gpttokenizer(' Hello World')
['Ä Hello', 'Ä World'] - [18435, 2159]
Here, we can observe that the same sentence is tokenized differently based solely
on the presence or absence of a space as a prefix. When constructing the
vocabulary of the BPE model, words that appear at the start of a line (not
prefixed with a space) are added to the vocabulary. Consequently, if the same
word appears elsewhere within the line, it is added to the vocabulary with a
space prefix.
Upon analyzing the GPT2Tokenizer, we can identify a category of approximately
8,535 words that exhibit this behavior. Here are a few examples:
sharing, Ä sharing
mask, Ä mask
said, Ä said
20, Ä 20
hold, Ä hold
podcast, Ä podcast
It's important to note that these examples demonstrate that approximately 20%
of the vocabulary of the GPT2Tokenizer consists of unnecessary and redundant
tokens.
So, what kind of problems does this redundancy create?
Consider the following tokenization examples using both the GPT2Tokenizer and
BLOOM tokenizer:
>>> gpttokenizer("Ali is a student, Ali is a father")
[37893, 318, 257, 3710, 11, 12104, 318, 257, 2988]
>>> bloomtokenizer("Ali is a student, Ali is a father")
[86256, 632, 267, 30832, 15, 25427, 632, 267, 27782]
In the GPT2Tokenizer output, the first occurrence of Ali is encoded
as 37893 , while the second occurrence is encoded as
12104 . Likewise, the same case applies to BLOOM tokenizer.
This presents two key problems. First, the model needs to
learn that these different tokens represent the same entity, resulting in
additional computational overhead. Second, we end up practically storing two
separate vectors to represent the same word, which leads to unnecessary memory
consumption.
This redundancy not only hampers the model's efficiency but also wastes
computational resources and memory. In an ideal scenario, the model should only
need to learn and store a single representation for each unique word, rather
than duplicating tokens for the same word with slight variations.
How do the models treat this
phenomenon?
Models seems to learn to handle these tokens in a similar manner. To illustrate
this, let's examine the cosine similarity of embedding vectors for such
tokens.
Heatmaps of cosine similarity between embedding vectors of 5 and 1,000 tokens
(respectively) from the GPT2Tokenizer vocabulary. The tokens are picked and
sorted randomly.
The heatmap demonstrate that the models learn to cluster these tokens together,
indicating their similarity despite slight variations in representation. This
can be attributed to the fact that the models are trained on a large amount of
data and are able to learn the similarity between these tokens.
This fact can be taken advantage of to reduce the vocabulary size of the model,
or to reduce the memory footprint of the model by merging these tokens together,
but this is not the topic of this article.
What is the solution?
The most effective solution to address this issue is to add a prefix space before
tokenizing each line. This approach is already implemented in most tokenizers.
Newer tokenizers often include an add_prefix_space=True option in
the huggingface/tokenizers
library, where a space is added to the beginning of the string before
tokenization.
However, this solution doesn't completely resolve the problem for multiline
tokenizers. Let's consider the example of the LLaMA tokenizer:
>>> llamatokenizer("John John\nJohn")
['âJohn', 'âJohn', '<0x0A>', 'John'] - [1, 2259, 2259, 13, 11639]
Here, we can see that the LLaMA tokenizer applies the
add_prefix_space option, and the first occurrences of
John correspond to the same token ID: 2259 . However,
the subsequent John following the line break (\n ) is
still considered a different token with ID: 11639 .
To address this issue in multiline tokenizers, an additional normalization step
can be applied. Along with add_prefix_space=True , a
Replace("\n", "\n ") rule can be employed
before tokenization. This rule adds a space after each line break, ensuring
consistent tokenization. Similarly, a denormalization step with the
Replace("\n ", "\n") rule can be used to revert
the normalization during decoding or post-processing.
Conclusion
Token duplication due to the absence of prefixed spaces in subword tokenization
can pose challenges in language models. Approximately 20% of the vocabulary in
tokenizers like GPT2Tokenizer is unnecessarily repeated, leading to wasted
computational resources and memory. However, models are capable of learning to
treat these tokens similarly, as demonstrated by cosine similarity heatmaps. The
most effective solution to address this issue is to adopt a prefix space before
tokenization, which is already applied in most tokenizers. For multiline
tokenizers, an additional normalization step involving the insertion of spaces
after line breaks can ensure consistent tokenization. By implementing these
solutions, we can improve the efficiency and effectiveness of language models
while reducing redundancy in token representations.
|