How do LLMs process text data — A deep dive into Tokenization (Part-1)

4 min readJun 13, 2024

In the realm of natural language processing (NLP) and deep learning, Large Language Models (LLMs) such as GPT-4 have revolutionized the way machines understand and generate human language. However, these models cannot process raw text directly. Instead, the text must be converted into a format that the neural networks can handle. This process of converting text into continuous valued vectors is known as Embedding.

Embedding is the process of converting raw text into continuous valued vectors. This is crucial because neural networks operate on numerical data. The embedding model takes raw input and transforms it into a vector representation, making it digestible for the model. Among the various types of embeddings, Word Embeddings (like Word2Vec) are most commonly used for language models, while Sentence or Paragraph Embeddings are preferred in Retrieval Augmented Generation (RAG) applications.

Text Processing Pipeline:

This pipeline consists of 3 steps:

Text to Tokens — Split the words into smaller chunks called Tokens. These tokens can be words, characters, numbers, symbols etc.
Token to IDs — Further, distinct tokens are assigned unique integer values. This is done by Tokenizer. The output of this step is a dictionary containing all unique words and corresponding token_ids (often referred to as Vocabulary).
Token_ID to Embeddings — Finally, these token IDs are converted into vector representations, known as token embeddings, which are used by the neural network.

Raw text -> Tokens (words or special characters) -> Token IDs -> Token Embeddings

Role of the Tokenizer: A tokenizer is a crucial component in this process, with two primary methods — encode and decode.

Encode: This method takes in text, splits it into tokens, and converts these tokens into IDs based on the vocabulary.

Decode: This method performs the reverse operation. It takes token IDs, converts them back into text tokens, and concatenates these tokens to form natural text.

**Text to Tokens** (using a simple tokenization strategy — each word or special character is considered a token)

# Implementing a simple Text Tokenizer

class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.?"\'_!()]|--|\s)', text)
        preprocessed = [token.strip() for token in preprocessed if token.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

**Tokenizer** — ***Encode*** and **Decode** methods

Everything looks good right! But take a look at below image 😑

The error is due to the fact that the word Hello is not in the training dataset and inturn doesn’t have a token_id. Another challenge commonly occured in training LLMs is how to differentiate texts from different documents.

To handle such challenges, we add special context tokens in the model vocabulary.

<|unk|> — This token replaces all out-of-vocabulary words.
<|endoftext|> — This token signifies the beginning or end of a new document.
[BOS] — (Beginning of Sequence) Indicates the start of a sequence.
[EOS] — (End of Sequence) Indicates the end of a sequence.
[PAD] — (Padding) Used to pad sequences to a uniform length.

# Implementing the modified Text Tokenizer

class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.?"\'_!()]|--|\s)', text)
        preprocessed = [token.strip() for token in preprocessed if token.strip()]
        preprocessed = [token if token in vocab else "<|unk|>" for token in preprocessed]
        ids = [self.str_to_int.get(s) for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

But most advanced LLMs like GPT, Llama don’t use many of these tokens as they increase in complexity and affect efficiency. Instead these models opt a tokenization technique called Byte Pair Encoding (BPE). Unlike the above tokenizing strategy, BPE breaks down words into subword units.

The code file used to demonstrate concepts covered in this article is linked below Github Gist. Save it and play around!

Python Notebook with code implementation of Tokenizers

More concepts & working of BPE will be covered in the next article. If you find any mistakes or if you are interested to collaborate on more such engaging topics, feel free to reach out to me on LinkedIn.

References

Building Large Language Models from scratch by Sebatian Raschka
LLMs from scratch Github

How do LLMs process text data — A deep dive into Tokenization (Part-1)

Written by Deva Kumar Gajulamandyam

No responses yet