Tokens and Text

👋

Welcome Back!

Previously, we learned how LLMs are trained on massive datasets to understand patterns in language. Now, we’ll zoom in on the building blocks of language for the model: tokens. You’ll see how LLMs break text into manageable pieces, convert them into numbers, and use these representations to understand and generate language effectively.

🔤

Tokens: Breaking Down Text

What is a token? A token is a piece of text (a word or part of a word) that the model processes. For example, "cat" is one token, and a long word like "unpredictable" might be split into tokens like "un", "predict", "able".

Tokenization: Before training or answering, the LLM breaks your input text into tokens. This lets it handle any word by building it from smaller parts.

Why split text? Using tokens makes the model's job easier. It ensures even rare or new words can be represented, and it standardizes the input.

🧩

Tokenization Example

Example: The sentence "I love puppies!" might be split into tokens like ["I", "love", "puppies", "!"] depending on the model's tokenizer. Another model might split it into subwords, e.g. ["I", " love", " puppy", "ies", "!"].

Analogy: Think of tokens like LEGO bricks for language. Just as you build big structures from small bricks, the model builds whole sentences from these token pieces.

Note: Different LLMs use different token schemes, but the idea is always to break text into manageable pieces.

❓

Why Tokenize Text?

Handling Vocabulary: By using subword tokens (like common prefixes or suffixes), LLMs can cope with new words. A brand-new word can be broken into known token pieces.

Consistency: Tokenization helps standardize text across languages and writing styles. Each token is then turned into a number (embedding) for the model to use.

Vocabulary Size: Instead of having a separate entry for every possible word, the model only needs a manageable list of common token pieces (often tens of thousands). This makes training and inference more efficient.

🔢

From Tokens to Numbers

Embeddings: After tokenization, each token is mapped to a vector (a list of numbers) called an embedding. This numerical form lets the model do mathematical calculations on words.

Semantic Space: Tokens that appear in similar contexts get similar embeddings. For instance, the token "dog" might end up close to "cat" in this vector space.

Role of Embeddings: These vectors carry meaning about words (like semantic features). They flow through the neural network layers during learning, allowing the LLM to build up its understanding.

🪟

Context Window

Definition: Each LLM has a limit on how many tokens it can consider at once. This is called the context window.

Implication: For example, early models could only see a few thousand tokens (a few pages of text), while newer models can handle tens of thousands of tokens at once.

Real-world example: If you give a very long document, it might need to be shortened or split into parts, since the model only "remembers" up to its window size.

Quiz

In The Context Of LLMs, What Is A Token?

A

A single letter in the text

B

A type of neural network layer

C

An entire paragraph

D

A word or part of a word that the model processes

Fill in the Blank

Converting raw text into tokens is called ___.

💡 Drag the correct word from below into the blank to complete the sentence.

Converting raw text into tokens is called

Embedding

Tokenization

Training

Parsing

Reflection

💭

Why do you think LLMs break text into tokens instead of using whole sentences directly?

Consider what happens with very long words or made-up words. How does tokenization help the model handle new or complex language?

Lesson Completed!

Great work on completing this lesson. Next, we will explore on tools and analytics of SEO!