Building Transformers From Scratch: Complete Guide to Architecture, Theory, and PyTorch Code

1. Introduction

Transformers have revolutionized the field of Deep Learning since their introduction in the groundbreaking 2017 paper "Attention is All You Need." Initially designed for Neural Machine Translation (NMT), these architectures rapidly became the backbone of virtually every state-of-the-art NLP system and have even expanded into computer vision, audio processing, and beyond.

What's fascinating about Transformers is that despite their remarkable capabilities, they're built upon surprisingly simple principles that are even simpler than those of their predecessors, like LSTMs and GRUs. The true innovation lies in a mechanism called "Self-Attention," an elegant solution that allows each element in a sequence to communicate directly with every other element, creating rich contextual understanding.

In this article, we'll strip away the complexity and dive into the core mechanics of Transformers. We'll explore how they work, examine each crucial component, understand mathematical operations and computations happening inside, and then put theory into practice by building a complete Transformer from scratch using PyTorch.

By the end, you'll not only understand why Transformers have become so dominant but also possess the knowledge to implement and customize them for your own projects. Let's go!

2. The Transformer Architecture

The transformer architecture consists of simple components that anyone with a basic understanding of neural networks can grasp. However, the true innovation lies in how the authors integrated and aligned these components, along with some architectural modifications. First, the specific architecture called the Encoder-Decoder Architecture is used in the original paper for NMT, which is what we will understand and implement. In this section, we are decomposing the core parts of the transformer one by one to understand them better.

2.1 Encoder-Decoder Architecture

The Encoder-Decoder architecture has been the popular choice for Neural Machine Translation (NMT) for many years, regardless of the specific model employed. The Encoder is responsible for processing information from the source language, which is the language we want to translate from, while the Decoder utilizes the contextual information provided by the Encoder to translate into the target language.

Here is the Transformer Encoder-Decoder version that is used for NMT.

Transformer Encoder Decoder Architecture

This architecture might confuse some of you at first glance, even if you have read the original paper. The diagram above offers a detailed view of the various computations that take place in Transformer networks. However, don't worry, after reading further, revisit the diagram, and you will be amazed to see the flow of data through the transformer.

First, let's start with a simple overview of input-to-output mapping in the transformer.

Formaly, The Encoder maps a source language token sequence \(x_1, \ldots, x_n\) to a continuous representation \(z_1, \ldots, z_n\) and the Decoder uses this representation to generate the target sequence \(y_1, \ldots, y_m\) one token at a time (Autoregressive-decoding).

Before diving into the core architecture components, let's understand how inputs are represented inside a Transformer!

2.2 Tokenization

Text data are usually represented as sentences or paragraphs that convey something, Inorder for language models to work with this, we need to tokenize (convert them into words or subwords), and tokenization itself requires another article because there are a variety of strategies used, like Byte-Pair Encoding, Character-level, Word-level, Subword-level, etc. Tokenization is not only specific to Transformer models, but it is common in most of the language modelling tasks.

For now, we only need to know that tokenization is the idea of converting a sequence of data into chunks. The smallest unit with which the model can process data.

After tokenization, we assign each token chunk a unique integer ID, allowing them to have an identity similar to a primary key in a database. These token IDs serve as the primary input to the model.

Tokenization Demo

2.3 Word Embeddings (Vector Representation that encodes Word Meanings)

Now, after tokenization, the next step is to represent these tokens with a high-dimensional vector representation. There are two ways to do this, one is in the form of one-hot encoding, where we set 0 everywhere except a single 1 at the corresponding token ID index. These are called Sparse vector representations. The problem with sparse representation is that it requires a lot of memory as the vocabulary grows, and it is less efficient when the data contains more continuous relations and patterns.

There is another method called Dense Embedding, which is are vector representation that is learned in a continuous space. In simpler terms, embeddings are a format that deep neural networks process efficiently. These Dense Embedding representations can capture the semantic relationships between different concepts and require less memory.

Consider we have a vocabulary size \(V\) (Vocabulary is all the unique possible tokens the model needs to process). and the model dimension \(d_{model}\), we learn an embedding matrix also called embedding weight matrix \(W_e \in \mathbb{R}^{V \times d_{\text {model }}}\). A token \(x\) which is a vector is mapped to \(W_e^{\top} x \in \mathbb{R}^{d_{\text {model }}}\). So \(x\) is not represented using one-hot encoding here, but a dense, low-dimensional learned vector that reduces the memory and allows better representation capability.

The Embedding Matrix is a large lookup table containing learned token embeddings for every token in the vocabulary. We retrieve the corresponding token embeddings using the sequence of token IDs we have, which represents the input embedding for the further components of the transformer.

Embedding Matrix Demo

Here is a visual geometric way to understand Embedding Vectors in 3D space.

Embeddings in 3D space

Here, the direction of the vector represents the semantic relationships in multiple dimensions. For example, "King" & "Queen" are more semantically similar when it comes to "Royalty Dimension" (Because both represent royalty). "Man" & "Woman" are semantically identical when it comes to "Non-Royalty Dimension".

However, "King" & "Man" would be similar when considering the "Gender Dimension". The above image is a rough understanding of how embeddings encode the meaning of particular words.

One more thing to note is that similar relationships are parallel, eg, \( \text{King} - \text{Man} \approx \text{Queen} - \text{Woman} \). And this is why the famous analogy \( \text{King} - \text{Man} + \text{Woman} \approx \text{Queen} \) works, It just says that when we take the Embedding of the King, subtract the man component, and then add the woman component, it will be approximately equal to the Emebdding to Queen.

2.3.1 Source Embedding

Source embedding is the mapping of source tokens from the source language into their corresponding embeddings.

Formaly, In the Transformer encoder, each source token \(x_t^{src}\) is mapped to a dense vector using the source embedding matrix \(W_e^{src}\). Next, a positional encoding vector of the same dimension is added to the source embedding (we'll discuss positional encoding later). This results in a summed vector, \(h_t^{(0,src)}\), which serves as the input to the encoder stack.
\[h_t^{(0,src)} = {W_e^{src}}^\top x_t^{src} + \operatorname{PE}^{src}(t),\]

where \(\operatorname{PE}^{src}(t)\) is a function that encodes the token position \(t\) within the source sequence.

2.3.2 Target Embedding

Target Embedding is the mapping of target tokens from the target language into their corresponding embeddings. However, the way target embeddings are represented is a little different from source embeddings.

The decoder is auto-regressive, means it predicts the next token based on previous ones, and hence we'll represent the target sequence shifted right by one position from the original labels (labels used to train the transformer), and a start-of-sequence token (<sos>) is prepended; this token will help the decoder to get a starting point for generating target sequences.

Now, don't need to be confused by target and labels, they are basically the same, but targets are one shifted version of the label, which are passed as an input to the decoder.

Target Embedding (Shifting Positions)

The image here represents the difference between how source embeddings, target embeddings, and labels are represented in a transformer for NMT.

Formaly, the true previous tokens are provided as inputs to the decoder, each target token \(y_{t - 1}\) is embedded using the target embedding matrix \(W_e^{tgt}\) and added to its positional encoding \(\operatorname{PE}^{tgt}(t-1)\), The resulting summed up target vector will be,

\[g_{t-1}^{(0,tgt)} = {W_e^{tgt}}^\top y_{t - 1}^{tgt} + \operatorname{PE}^{tgt}(t - 1),\]

The decoder's job is to take these offset embeddings and predict the next token one at a time. During training, the predictions made by the decoder are compared to the labels to compute the loss; this method is known as Teacher Forcing.

2.4 Positional Encoding

In our previous discussion, we just touched on positional encoding (PE), which is, believe it or not, really essential in transformer models. This encoding addresses a fundamental challenge: Transformers naturally process sequences in parallel (Take all the token embeddings in parallel and process them in batch rather than processing one after the other), which means they lack the inherent understanding of word order that recurrent neural networks possess. Without positional encoding, a transformer would treat a sequence as a bag of tokens, losing critical syntactic and semantic information that depends on the arrangement of words.

There are several ways to do this, at least in theory. We can think about adding positional indices for each token in the sequence, like (0, 1, 2,...), and then adding these scalar to the embeddings, but naively adding positional indices simply does not work because of three main reasons, the first one is the problem of relative positioning (How far does the word "school" from "going") and the second one is the continuity and differentiablity. Gradient-based learning always prefers continuous and differentiable values rather than scalar or discrete values, and the third one is the mismatch to token embeddings because raw indices live on a different scale than embeddings, so we need to encode the position information, considering,

It must be a continuous vector representation rather than a discrete scalar representation.
It must encode relative and absolute positions.
Must scale for longer sequences.

2.4.1 Sinusoidal Positional Encoding

The authors of the original paper used a clever mechanism to tackle this problem, which does not require extra parameters, scales well, is continuous, and differentiable, called the "Sinusoidal Positional Encoding," which has a range from -1 to +1.

The Sinusoidal Positional Encoding uses sin and cosine functions of varying frequencies across dimensions to encode absolute and relative positions to create a vector representation that can be added to the token embeddings. The calculations involve,

\[\text{PE}(p, 2i) = \sin\left( \frac{p}{10000^{\frac{2i}{d_{\text{model}}}}} \right)\]

\[\text{PE}(p, 2i+1) = \cos\left( \frac{p}{10000^{\frac{2i}{d_{\text{model}}}}} \right)\]

Where,
\(p\) represents position (eg, token index),
\(i\) represents dimension index,
\(d_{model}\) represents the model embedding dimension.

Here, the sine function is applied to even dimension indices \(2i\) and the cosine function is applied to odd dimension indices \(2i + 1\) respectively.

This term \(\frac{p}{10000^{\frac{2i}{d_{\text{model}}}}}\) determines the frequency of the wave. Let's dissect these in parts for better understanding.

The fundamental property of sine and cosine waves is that they repeat or cycle every \(2\pi\) radians. Since we have used the value \(10000\), the wavelength range geometrically from \(2\pi\) to \(10000 \times 2\pi\) covering multiple frequency scales. This is useful in our case, and you will understand why by reading further.

The term \(\frac{2i}{d_{\text{model}}}\) is the scaling factor, during the earlier dimensions where \(i\) is small (eg, 0, 1), this term will give a very small number, \(10000\) raised to a small power is relatively small number. Now dividing \(pos\) by a small number means the value of sine and cosine functions changes rapidly because a larger value of sine or cosine will make it cycle faster within small intervals, which means it has a very high frequency and a short wavelength.

On the other hand, if \(i\) is large where it is approaching \(\frac{d_{model}}{2}\), the exponent \(\frac{2i}{d_{\text{model}}}\) approaches \(1\). \(10000\) raised to a power closer to one is a large number (close to 10000). Dividing \(pos\) by a large number means the sine and cosine cycle will be slower as \(pos\) increases. This corresponds to a low frequency and a long wavelength.

Here is a visual illustration to understand it.

Positional Encoding Frequency on each dimension

As we progress through dimensions, the frequency of waves begins to decrease. The \(0_{th}\) dimension (represented in blue) has the highest frequency, which progressively decreases.

Note: The dots represent the values for each position, which are originally constructed using Python

The values are represented in the table below, which includes the concatenation of the token embedding and positional encoding vector.

Using varying frequencies of sine and cosine functions helps to create a unique encoding for each vector in each position along the dimension, which allows the model to understand the absolute positions of each token. This is one of the reasons for using varying frequencies, but wait, there is more!

2.4.1.1 Relative and Absolute Positions

Since the initial dimensions (0, 1, 2, etc.) have higher frequencies, the values change rapidly, effectively encoding absolute positions within the local context. The larger dimension values change very slowly, resulting in the injection of information about each token's position within the global context.

Not that tricky, the simple idea is that initial dimensions encode where the token's position is among other tokens (local context), and larger dimensions encode where the token's position is across the entire sequence (global context). The combination of these two makes the Relative Position Encoding possible. Always remember that this information is being encoded on each Positional Encoding vector. Here is a visual way to understand it.

Relative & Absolute Positions

The circles represent the increasing wavelength (decreasing frequency) as the embedding dimensions progress.

Relative positions are inferred from the whole mechanism of sinusoidal positional encoding: by comparing the multi‑scale sine and cosine values between any two token embeddings, the model can determine how close or far apart those tokens are in the sequence.

Now let's try to understand how relative position information is injected into the token embeddings. Consider two tokens in a sequence: at position \(pos\) and at position \(pos + k\). The goal of the transformer, or specifically the attention, is to figure out the relationship (like the distance \(k\)) between them, no matter what \(pos\) is.

This is because if we can directly reach another token from the current one that depends only on offset \(k\), it meaningfully encodes the relative position between the two, then the encoding between them captures exactly that “distance” `k`, independent of where in the sequence you are.

The key insight is that for any fixed offset `k`, the positional encoding vector for position `pos + k` (let's call it \(PE_{pos+k}\)) can be calculated from the positional encoding vector for position `pos` (\(PE_{pos}\)) using a linear transformation (specifically, a matrix multiplication) that depends only on the offset `k`, not on the original position `pos`.

Let's Look at a Single Frequency (One Pair of Dimensions)

Remember that for each pair of dimensions \((2i, 2i+1)\), we use the same frequency \(\omega_i = 1 / 10000^{2i/d_{model}}\):
\[PE_{(pos, 2i)} = \sin(\omega_i \cdot pos)\]
\[PE_{(pos, 2i+1)} = \cos(\omega_i \cdot pos)\]

Now let's look at position `pos + k`:
\[PE_{(pos+k, 2i)} = \sin(\omega_i \cdot (pos+k))\]
\[PE_{(pos+k, 2i+1)} = \cos(\omega_i \cdot (pos+k))\]

Using the trigonometric angle addition formulas:
\[\sin(A+B) = \sin A \cos B + \cos A \sin B\]
\[\cos(A+B) = \cos A \cos B - \sin A \sin B\]

Let \(A = \omega_i \cdot pos\) and \(B = \omega_i \cdot k\). Applying these formulas:

\[PE_{(pos+k, 2i)} = \sin(\omega_i \cdot pos) \cos(\omega_i \cdot k) + \cos(\omega_i \cdot pos) \sin(\omega_i \cdot k)\]
\[= PE_{(pos, 2i)} \cdot \cos(\omega_i \cdot k) + PE_{(pos, 2i+1)} \cdot \sin(\omega_i \cdot k)\]

\[PE_{(pos+k, 2i+1)} = \cos(\omega_i \cdot pos) \cos(\omega_i \cdot k) - \sin(\omega_i \cdot pos) \sin(\omega_i \cdot k)\]
\[= PE_{(pos, 2i+1)} \cdot \cos(\omega_i \cdot k) - PE_{(pos, 2i)} \cdot \sin(\omega_i \cdot k)\]

We can write the relationship above using matrices for this pair of dimensions:

\[\begin{pmatrix}PE_{(pos+k, 2i)} \\ PE_{(pos+k, 2i+1)} \end{pmatrix} = \begin{pmatrix} \cos(\omega_i k) & \sin(\omega_i k) \\ -\sin(\omega_i k) & \cos(\omega_i k) \end{pmatrix} \begin{pmatrix} PE_{(pos, 2i)} \\ PE_{(pos, 2i+1)} \end{pmatrix}\]

Look closely at the 2x2 matrix:
\[M_{i,k} = \begin{pmatrix} \cos(\omega_i k) & \sin(\omega_i k) \\ -\sin(\omega_i k) & \cos(\omega_ik)\end{pmatrix}\]

The interesting fact, if you followed the math, is that this matrix depends only on the offset `k` and the dimension pair index `i` (which determines \(\omega_i\)). It does NOT depend on the absolute position `pos`. This concludes that from any position `pos`, the `pos + k` only depends on the offset `k` and not the original position of that vector itself.

If you visualize the positional encoding values on a heatmap, it looks like this,

Positional Encoding Heatmap

Here is the code for Positional Encoding in PyTorch:


import torch
import torch.nn as nn

class PositionalEncoding(nn.Module):
    """Positional encoding module using sine and cosine functions."""
    
    def __init__(self, config: TransformerConfig):
        """
        Initialize positional encoding.
        
        Args:
            config: Model configuration

        The Sin Cos postional encoding is a clever way to encode the position of the words in the sentence
        with a continuous function for better gradient flow, instead of using a one-hot encoding.

        It works by creating a matrix of shape (max_seq_length, d_model) and then applying a sine function to the even indices
        and a cosine function to the odd indices. It will create continious unique indices for each word in the sentence.
        This matrix is then added to the input embeddings.
        """
        super(PositionalEncoding, self).__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(config.max_seq_length, config.d_model)
        position = torch.arange(0, config.max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, config.d_model, 2).float() * (-math.log(10000.0) / config.d_model))
        
        # Apply sine to even indices and cosine to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension and register as buffer (not a parameter)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
        self.dropout = nn.Dropout(config.dropout)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Add positional encoding to input.
        
        Args:
            x: Input tensor (batch_size, seq_len, d_model)
            
        Returns:
            Output tensor with positional encoding added
        """
        # Add positional encoding
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)

2.4.2 Positional Embedding

There is one more way we can encode positions for transformer-based models. Rather than using fixed, hand-crafted position encodings, we allocate a unique, trainable vector for each token position. We let the model learn those vectors via gradient descent, just like it learns token embeddings. In this way, the network itself discovers how to represent ordering information during training.

The only real drawback is that trainable positional embeddings introduce extra parameters, so you pay a modest increase in memory and compute compared to fixed encodings. In practice, large‑scale models absorb that overhead easily (given the massive resources they already consume) and benefit from the model’s ability to learn richer, task‑specific position representations.

2.5 The Attention Mechanism

The key feature that makes the Transformer architecture so powerful is the Attention mechanism, particularly Self-Attention. The concept of attention existed before Transformers and has been utilized in CNNs and LSTMs to significantly enhance their ability to focus on important aspects of the data relevant to their tasks. Attention itself has shown greater performance benefits in RNN-based encoder-decoder models for NMT, but still, they are just a reinforcement, not a replacement.

2.5.1 Self-Attention

What made self-attention different from vanilla attention is that the vectors (Q, K, V) are derived from the same input `X`, which avoids the need for recurrence. We'll discuss Q, K, V in a second. To simply put, Self-Attention is the idea of every token in the sequence attending to every other token, including itself. This helps each token learn about and build contextual relationships with its peers.

How can we make tokens attend to each other and pass information? For every conversation to take place, you need at least someone to ask a question, someone to answer the question, and a topic to talk about. In Transformer attention, this is being handled by three vectors: `Query (Q)`, `Key (K)`, and `Value (V)`

Query (Q): The query can be thought of as a vector that asks a question like "What is important for me?". This seeks information from other tokens in the sequence.
Key (K): The key can be thought of as a vector that answers the question "How relevant am I for the question," or "here is what I can offer". It represents the information relevance.
Value (V): The value vector represents the actual content, or the actual semantic information contained in the embedding.

Considering the sentence "The cat sat on the mat", a classic one, concentrate on the word "sat", it is a verb, so context matters a lot, like who sat? where? etc.

The Query (Q) vector for the embedding of "sat" asks something like, "I’m performing an action (a verb). Which other tokens hold the information I need, like who’s doing it and where it’s happening?".

The Key (K) vectors for "mat" and "cat" will answer something like "I'm the location, I might be relevant," and "I'm likely a subject, I'm relevant too!" respectively.

If they match, the Value (V) vector delivers the content if the relevance is high.

Formaly, the Query, Key, and Value are produced by multiplying the input embedding with three learned projection matrices `W_{Q}`, `W_{K}`, and `W_{V}`

Let the input sequence be,

\[X = \bigl[x_1; x_2; \dots; x_n\bigr] \in \mathbb{R}^{n \times d_{\text{model}}}\]

Projection matrices

\[W_Q \in \mathbb{R}^{d_{\text{model}} \times d_k},\quad W_K \in \mathbb{R}^{d_{\text{model}} \times d_k},\quad W_V \in \mathbb{R}^{d_{\text{model}} \times d_v}\]

1. `Query`

\[Q = X\,W_Q \quad\in\mathbb{R}^{n\times d_k}\]

Each row `q_i` is the "question" that token `i` asks of the entire sequence.
Dimension `d_k` sets the space in which similarities are measured.

2. `Key`

\[K = X\,W_K \quad\in\mathbb{R}^{n\times d_k}\]
Each row `k_j` is the "index" for token `j`: how it can be "found" by a query.
Same dimensionality `d_k`

3. `Value`

\[V = X\,W_V \quad\in\mathbb{R}^{n\times d_v}\]
Each row `v_j` is the "content" that token `j` contributes when it's attended to.
Dimension `d_v` controls the size of the output representations.

We have discussed the Query, Key, and Value vectors; now let's understand how they work together within the attention mechanism.

We learned that each token’s Query vector encodes what it’s looking for, and each token’s Key vector encodes what it offers. So they must interact to pass information, and they do that via dot products \(\mathbf{q}_i . \mathbf{k}_j\): A high dot product between them signifies that token `j` is exactly the kind of information token `i` is seeking. This produces an attention score matrix representing the importance of each token for that query, and the value matrix delivers the original content if the match is high.

So we can mathematically represent the attention using the following formula,

\[\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\biggl(\frac{Q K^\top}{\sqrt{d_k}}\biggr)\,V\]

\(Q \in \mathbb{R}^{n \times d_k}\): matrix of queries (one per output position)

\(K \in \mathbb{R}^{m \times d_k}\): matrix of keys (one per input position)

\(V \in \mathbb{R}^{m \times d_v}\): matrix of values (one per input position)

\(d_k\): dimensionality of queries and keys

\(\mathrm{softmax}\): applied row-wise to convert raw scores into attention weights \(\alpha_{ij}\)

\(\tfrac{1}{\sqrt{d_k}}\): scaling factor to counteract large variance in dot-product magnitudes,

This operation is particularly known as the scaled dot product attention (SDPA).

Scaled Dot Product Attention Visualization

This is the entire computation that happens within a single attention mechanism; it's quite a simple operation with some theory. In fact, the transformer originally implements multiple attention heads, with each head focusing on different parts of the input sequence.

Below is the code for Scaled Dot Product Attention:


def scaled_dot_product_attention(query: torch.Tensor,
                                key: torch.Tensor,
                                value: torch.Tensor,
                                mask: Optional[torch.Tensor] = None) -> torch.Tensor:
    """
    Compute scaled dot product attention.
    
    Args:
        query: Query tensor (batch_size, num_heads, seq_len, head_dim)
        key: Key tensor (batch_size, num_heads, seq_len, head_dim)
        value: Value tensor (batch_size, num_heads, seq_len, head_dim)
        mask: Optional mask tensor (batch_size, 1, seq_len, seq_len) or (batch_size, 1, 1, seq_len)
        
    Returns:
        Output tensor after attention (batch_size, num_heads, seq_len, head_dim)
    """
    # Get dimensions
    head_dim = query.size(-1)
    
    # Compute scaled dot product
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(head_dim)

    # Apply mask if provided, Issue in masking:
    if mask is not None:

        # Using a smaller value to mask out the scores for better numerical stability.

        # NOTE: We are inverting the masks here, this is because of the way masks are created.
        # We are using True for padding tokens and False for non-padding tokens.
        # So we need to invert the mask to get the correct masking.
        scores = scores.masked_fill(~mask, -1e+30 if scores.dtype == torch.float32 else -1e+4)
    
    # Apply softmax
    attention_weights = F.softmax(scores, dim=-1)
    
    # Compute final output
    output = torch.matmul(attention_weights, value)
    
    return output

2.5.2 Multi-Head Self-Attention

Multi‑Head Self‑Attention extends the basic scaled‑dot‑product attention by running `h` heads in parallel, splitting the embeddings into \(\frac{d_{\text{model}}}{H}\) each with its own projection matrices, so that different heads learn to focus on different aspects or subspaces of the embeddings (for example, one head might track syntactic roles, other captures semantic relations, and another captures positional relations), and their outputs are then merged to combine the information from each of them.

Concretely, for head `i` we use learned projection matrices `W_i^Q, W_i^K \in \mathbb{R}^{d_{model}\times d_k}`, `W_i^V \in \mathbb{R}^{d_{model}\times d_v}` to map the input into queries `QW_i^Q`, keys `KW_i^K`, and values `VW_i^V`. We perform attention in each head independently:

\[\mathrm{head}_i \;=\; \mathrm{Attention}\bigl(QW_i^Q,\;KW_i^K,\;VW_i^V\bigr).\]

The outputs of all heads are then concatenated and linearly projected with `W^O` back to dimension `d_{model}`:

\[\mathrm{MultiHead}(Q, K, V) \;=\; \mathrm{Concat}\bigl(\mathrm{head}_1,\dots,\mathrm{head}_h\bigr)\,W^O.\]

Multi-Head Attention has the capability to learn complex patterns that a single attention mechanism cannot capture. For example, multiple attention heads can learn the behavior of the word "orange," functioning as both a noun and an adjective in a sentence like "The orange orange rolled off the table." By incorporating more attention heads, the model can learn multiple features from the same embedding, enhancing its understanding.

Here is a visual way to understand how Multi-Head Self-Attention works. For simplification, we are only considering 2 heads.

Multi-Head Self-Attention Visualization

The above image is a simple visual illustration of how Multi-Head Attention works. In fact, modern language models will have many attention heads. One more thing to note is that in multi‑head attention, the model’s embedding dimension \(d_{\text{model}}\) must be divisible by the number of heads \(H\), so that each head works in a subspace of size \(\frac{d_{\text{model}}}{H}\).

The outputs produced by each attention head are concatenated. It's important to distinguish between concatenation and addition: in the case of positional encoding, we are adding the vectors together, while in concatenation, we combine the sub-vectors to produce an output with the same dimensions.

You might be thinking: after concatenation, the dimensions match, so why bother with the output projection \(W_o\)?

Concatenation simply stitches together each head’s output, but it doesn’t blend their insights. The projection \(W_o\) helps to fuse the diverse patterns discovered by each head into a unified, richer embedding, ready for the next layer.

Here is the code for Multihead Self-Attention:


class MultiHeadAttention(nn.Module):
    """Multi-head attention module."""
    
    def __init__(self, config: TransformerConfig):
        """
        Initialize multi-head attention.
        
        Args:
            config: Model configuration

        The multihead-attention is the core module where we split the tasks between multiple heads
        and then combine them back to the original dimension. The query, key, and value are linearly projected
        to the required dimensions and then split into multiple heads. The scaled dot product attention is applied
        to each head and then combined back to the original dimension.
        """
        super(MultiHeadAttention, self).__init__()
        
        self.d_model = config.d_model # 512
        self.num_heads = config.num_heads # 8
        assert self.d_model % self.num_heads == 0, "d_model must be divisible by num_heads"
        
        # The dimension of each head will be 512 / 8 = 64
        self.head_dim = self.d_model // self.num_heads
        
        # Linear projections

        # These are the projection weights for the query, key, and value
        self.q_proj = nn.Linear(self.d_model, self.d_model) # Query projection
        self.k_proj = nn.Linear(self.d_model, self.d_model)
        self.v_proj = nn.Linear(self.d_model, self.d_model)
        
        # The output projection that takes the concatenated heads and projects them back to the original dimension
        self.output_proj = nn.Linear(self.d_model, self.d_model)
        self.dropout = nn.Dropout(config.dropout)
        
    def forward(self,
               query: torch.Tensor,
               key: torch.Tensor,
               value: torch.Tensor,
               mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        """
        Forward pass of multi-head attention.
        
        Args:
            query: Query tensor (batch_size, seq_len, d_model)
            key: Key tensor (batch_size, seq_len, d_model)
            value: Value tensor (batch_size, seq_len, d_model)
            mask: Optional mask tensor
            
        Returns:
            Output tensor (batch_size, seq_len, d_model)
        """
        batch_size = query.size(0)

        # Linear Projections and Splitting into multiple heads

        # Each tensor will have the shape of (batch_size, num_heads, seq_len, head_dim)
        # Eg: (32, 8, 256, 64), where 64 is the head_dim

        q = self.q_proj(query).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        k = self.k_proj(key).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        v = self.v_proj(value).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)

        # Apply scaled dot product attention
        attn_output = scaled_dot_product_attention(q, k, v, mask)
        
        # Combine heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        
        # Perform the output projection to maintain the original dimension
        output = self.output_proj(attn_output)
        
        return output

2.6 Layer Norm & Residual Connections

Most of us might be familiar with Residual Connections and Normalization; they are some of the basic and often important concepts in deep learning. Residual Connections are popularly used in CNN models like ResNet to stabilize and prevent the Vanishing Gradient Problem. Since transformers need to be trained on large amounts of data with multiple stacked encoders or decoders, it is essential to ensure that the gradients are flowing correctly. Layer norm and Residual blocks will help us with that.

2.6.1 Layer Normalization

Layer normalization is one of the most important components in Transformers, yet it is often not discussed much. The central idea is to reduce the internal covariate shift. As we train deep models, the preceding layer parameters change on each step, directly changing the distribution of inputs to the next layer. That means in each iteration, the model has a burden to learn not only the patterns, but the underlying distribution, which will slow the training and even cause instability problems.

In order for deep networks to overcome this problem of overlearning, layer normalization does an amazing job of standardizing the inputs with zero mean and unit variance.

Layer normalization, unlike Batch normalization, is often simple since it is applied to each sample and computed based on sample statistics rather than batch statistics. This is better when it comes to individual scaling of each feature vector, where each vector must not be diluted, which happens with batch normalization.

In the original paper, layer normalization is applied after the attention and feed-forward blocks, but this can be different when it comes to different models, however, we are dealing with the architecture as per the original paper.

After the Multi-Head Attention sub-layer: The output of the self-attention mechanism is passed through a residual connection and then Layer Normalization. output = LayerNorm(x + MultiHeadAttention(x))
After the Feed-Forward Network (FFN) sub-layer: The output of the FFN is also passed through a residual connection and then Layer Normalization. output = LayerNorm(intermediate_output + FeedForward(intermediate_output))

Let's see how layer normalization can be computed. Consider the earlier example of the sentence "the cat sat on the mat", suppose it is being passed through the attention to get the output representation matrix given below,

\[ R = \begin{pmatrix} -\mathbf{x}_1\ (\text{representation for "the"}) \\ -\mathbf{x}_2\ (\text{representation for "cat"}) \\ -\mathbf{x}_3\ (\text{representation for "sat"}) \\ -\mathbf{x}_4\ (\text{representation for "on"}) \\ -\mathbf{x}_5\ (\text{representation for "the"}) \\ -\mathbf{x}_6\ (\text{representation for "mat"}) \end{pmatrix} = \begin{pmatrix} x_{11} & x_{12} & \dots & x_{1d} \\ x_{21} & x_{22} & \dots & x_{2d} \\ x_{31} & x_{32} & \dots & x_{3d} \\ x_{41} & x_{42} & \dots & x_{4d} \\ x_{51} & x_{52} & \dots & x_{5d} \\ x_{61} & x_{62} & \dots & x_{6d} \end{pmatrix} \]

Where \(\mathbf{x}_i = [x_{i1}, x_{i2}, \dots, x_{id}]\) is the representation vector for `i`-th token in the sentence.

Layer normalization is applied independently to each row `x_i` of this matrix `R`. The normalization happens across the dimension or `d_{model}` of the specific token representation vector.

Alright, let's take the first token representation vector \(\mathbf{x}_1 = [x_{11}, x_{12}, \dots, x_{1d}]\) for example. The first step is to calculate the mean \((\mu_1) \text{ for } \mathbf{x}_1\),

\[\mu_1 = \frac{1}{d} \sum_{j=1}^{d} x_{1j}\]

Where `\mu_1` is a scalar value representing the average activation on the representation vector `\mathbf{x}_1`. The second step is to compute the variance `\sigma_1^2` across `d` for `\mathbf{x}_1`,

\[\sigma_1^2 = \frac{1}{d} \sum_{j=1}^{d} (x_{1j} - \mu_1)^2\]

`sigma_1^2` is also a scalar value representing the variance of activation on the representation vector `\mathbf{x}_1`. The third step is to normalize each of the individual elements `x_{1j}` in the vector `\mathbf{x}_1` using `mu_1` and `sigma_1^2`.

\[\hat{x}_{1j} = \frac{x_{1j} - \mu_1}{\sqrt{\sigma_1^2 + \epsilon}}\]

Where `\epsilon` is a small constant for numerical stability. After the normalization is performed, we will get a resulting vector \(\mathbf{\hat{x}}_1 = [\hat{x}_{11}, \hat{x}_{12}, \dots, \hat{x}_{1d}]\). This vector has a mean and variance of approximately 0 and 1, respectively.

That's the whole computation required, but with less flexibility, what I meant is that when we standardize the vectors, we are compromising the representation flexibility that the model can derive from the actual distribution. Luckily, there is a clever way to solve this problem, yes, it is to make the model decide what it needs from the distribution. This is possible by introducing two learnable parameters, `\gamma` (also known as gain) and `\beta` (also known as bias).

These parameters are learned during training and allow the network to scale and shift the normalized activations if it wants to learn from the distribution. This makes sure that the normalization does not completely restrict the representation flexibility.

Here is how the normalized vector `\hat{x}_{1j}` is scaled and shifted using the gain vector `\gamma = [\gamma_1, \gamma_2, \gamma_n]` and bias vector `\beta = [\beta_1, \beta_2, ..., \beta_n]` respectively.

\[y_{1j} = \gamma_j \hat{x}_{1j} + \beta_j \quad \text{for} \quad j=1,\dots,d\]

We need to repeat this for all the representation vectors,

For \(x_2\) ("cat"): Calculate its own \(\mu_2\) and \(\sigma^2_2\), then normalize to get \(\hat{x}_2\), then scale and shift using the same \(\gamma\) and \(\beta\) to get \(y_2\).
For \(x_3\) ("sat"): Calculate its own \(\mu_3\) and \(\sigma^2_3\), then normalize to get \(\hat{x}_3\), then scale and shift using the same \(\gamma\) and \(\beta\) to get \(y_3\).
And so on for \(x_4\), \(x_5\), \(x_6\).

The final output of the Layer Normalization will be a matrix,

\[Y = \begin{pmatrix} \mathbf{y}_1 \\ \mathbf{y}_2 \\ \mathbf{y}_3 \\ \mathbf{y}_4 \\ \mathbf{y}_5 \\ \mathbf{y}_6 \end{pmatrix} = \begin{pmatrix} y_{11} & y_{12} & \cdots & y_{1d} \\ y_{21} & y_{22} & \cdots & y_{2d} \\ y_{31} & y_{32} & \cdots & y_{3d} \\ y_{41} & y_{42} & \cdots & y_{4d} \\ y_{51} & y_{52} & \cdots & y_{5d} \\ y_{61} & y_{62} & \cdots & y_{6d} \end{pmatrix}\]

2.6.2 Residual Connections

Residual Connections, also known as Skip Connections, are as simple as bypassing the inputs between one or more layers of the network to enhance training stability. As we stack larger and deeper networks, it becomes incredibly challenging to train them because, at some point, the network saturates and starts to deteriorate. This is due to the Vanishing Gradient Problem.

The interesting fact about Residual learning is that it even alters the learning objective of the model. When considering input `x` and the output function `H(x)`, usually the model learns to map the input directly to the output by learning a representation, but in residual learning, the model learn a residual function \(F(x)\) such that

\[H(x) = F(x) + x.\]

In other words, instead of forcing the layers to approximate the full mapping \(H(x)\) from scratch, the network only needs to learn the “difference” (the residual) between the desired output and its input. Concretely, each block computes

\[y = F(x) + x,\]

where \(F(x)\) is typically a small stack of convolutions, batch norms, and non-linearities like the transformer sub-blocks.

Layer Normalization

You might wonder if adding residual connections just increases complexity and computation. Things make sense when viewed from a training perspective, though it actually simplifies things: the skip pathways carry both past and current gradient signals straight back to earlier layers. This direct flow of information helps prevent vanishing gradients, since the earlier gradients reinforce the later ones rather than fading away.

2.7 Masking

Another important concept specifically for transformer models is the application of several masks. Mainly, the transformer uses two main masks called the padding and causal masks. Let's discuss them!

2.7.1 Padding Masks

Transformer models operate with specific sequence lengths, but the inputs we provide to the model may not always be of the same length. Most of the time, our inputs are shorter than the model's maximum sequence length. To accommodate these varying sequence lengths, we add padding tokens to preserve the original supported sequence length of the model.

Interestingly, the model effectively ignores these padding tokens, and these are done using Padding masks, which ensures that the model is not attending to padding tokens during training or inference.

Padding masks are typically applied in the encoder and within the encoder-decoder attention. The first step is to add padding tokens to the input sequence,

\[[\text{cat},\ \text{sat},\ \text{on},\ \text{mat},\ \langle\text{pad}\rangle,\ \langle\text{pad}\rangle,\ \ldots]\]

Then a corresponding padding mask is generated that looks like this,

\[\text{Padding Mask:}\ [1,\ 1,\ 1,\ 1,\ 0,\ 0,\ \ldots]\]

Where zero represents tokens not to attend.

Based on the padding mask, we will apply `-\infty` to the pad tokens during the computation of the attention scores, which occurs before applying the softmax function. This is done because the softmax function will squish `-\infty` values down to zero, ensuring that these pad tokens are not attended to during the training of the model.

Padding Mask Demo

But in practice, we don't usually apply `-infty` to the padding tokens, but a really, really smaller negative value like `-1e9` for better numerical stability during training.

Padding masks are used not only in attention computations but also during loss calculation to ensure the loss does not contribute to padding tokens.

Below is the implementation of src and tgt padding masks,

def _prepare_masks(self, seq_batch: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Prepare masks for source and target sequences.

    Args:
        seq_batch: Source or target Batch tensor

    Returns:
        Padding mask tensor for source and target
    """
    # Creating the source padding mask, must be of the shape: (batch_size, 1, 1, seq_len)
    # This produces a mask where the padding tokens are set to False and the rest are set to True
    mask = (seq_batch != self.config.pad_token_id).unsqueeze(1).unsqueeze(2)

    return mask

2.7.2 Causal Mask (Look-Ahead Mask or Future Mask)

The Causal Mask is not a choice, but a necessity in transformer models. As the name suggests, this is used for masking in causal or autoregressive settings, where the model should not be allowed to look at future tokens while predicting the current one.

Our goal in translation or text generation is to predict the next probable token in a sequence in an autoregressive manner. The key logic here is that the decoder should not have access to future tokens. If it did, it could easily "cheat" by viewing the answer before predicting it. It is like time-traveling to the future and knowing what happens there in order to take action in the present.

In order to prevent this from happening, we apply a causal mask on the upper right triangular part of the pre-softmax attention score matrix, which again gives the name Upper-Triangular Mask. So when this mask is applied and passed through the softmax, the attention between any token at position `i` and any token at a future position `j` (where `j > i`) is blocked (attention weight is effectively zero). In other words, each token’s query can only “see” keys from itself and all preceding positions, never from positions to its right. This guarantees strictly left-to-right information flow, ensuring the model can only attend to past and current tokens when predicting the next one.

Causal Mask Demo

When implementing the causal mask before passing it to the model, we typically use a True or False representation, where True indicates masking and False indicates no masking, as shown in the code below.

def generate_square_subsequent_mask(size: int, device: torch.device = None) -> torch.Tensor:
    """
    Generate square mask for future positions in decoder self-attention.
    
    Args:
        size: Sequence length
        device: Device to create mask on (CPU/GPU)
        
    Returns:
        Square mask tensor of shape [size, size] where:
        - False values indicate positions to attend to
        - True values indicate positions to mask (future positions)
    
    Example:
        For size=4, generates mask:
        [[False,  True,  True,  True],
         [False, False,  True,  True],
         [False, False, False,  True],
         [False, False, False, False]]
    """
    mask = torch.triu(torch.ones(size, size), diagonal=1)
    mask = mask.bool()
    if device:
        mask = mask.to(device)
    return mask

2.8 Position-wise Feed-Forward Networks (MLP)

Even though attention blends information together, the Multi-layer Perceptron (MLP) is actually doing the heavy lifting by learning complex non-linear patterns in the sequence. Even though this part is simple to understand, the complex relations, patterns, and facts live here, and most of the parameters of the model reside here.

If you have prior knowledge of Deep Learning, MLP is the fundamental Neural Network architecture you might be familiar with, also known as dense layers, which contain an interconnected network of neurons. In Transformers, this part plays a very pivotal role. If you want to know how MLPs work at a basic level, this article might help: Multi-Layer Perceptron Explained

Formaly, the forward pass through the MLP on each layer can be written as,

\[\begin{align*} \mathbf{a}^{(0)} &= \mathbf{x} \\ \mathbf{z}^{(1)} &= \mathbf{W}^{(1)} \mathbf{a}^{(0)} + \mathbf{b}^{(1)} \\ \mathbf{a}^{(1)} &= f^{(1)}(\mathbf{z}^{(1)}) \\ \mathbf{z}^{(2)} &= \mathbf{W}^{(2)} \mathbf{a}^{(1)} + \mathbf{b}^{(2)} \\ \mathbf{a}^{(2)} &= f^{(2)}(\mathbf{z}^{(2)}) \\ &\vdots \\ \mathbf{z}^{(L)} &= \mathbf{W}^{(L)} \mathbf{a}^{(L-1)} + \mathbf{b}^{(L)} \\ \mathbf{Y} &= f^{(L)}(\mathbf{z}^{(L)}) \end{align*}\]

where `\mathbf{a}^{(0)` represents the input feature vector `x`,
` \mathbf{z}^{(L)} &= \mathbf{W}^{(L)} \mathbf{a}^{(L-1)} + \mathbf{b}^{(L)}` represents the core calculation for any given layer `l`
`\mathbf{z}^{(L)}` represents the pre-activation output vector produced by the linear combination of weights and inputs plus the bias,
`\mathbf{W}^{(L)}` represents the weight matrix connecting the layer `l - 1` to layer `l` contains the learnable parameters.
`\mathbf{a}^{(L-1)}` represents the activation vector from the previous layer `l - 1` . This serves as the input to the next layer.
`\mathbf{b}^{(L)}` represents the bias vector at layer `l`.

This is the mathematical representation of a multilayer perceptron, or MLP, that you have seen as an interconnected network in images and visualizations.

Now, the next question to ask is why it is called Position-wise? We learned that attention blends information together that is related. Now we need something to decouple this knowledge and reason about what each individual token means and what it does in general (for example, the words like "Soccer" have a definitive meaning), so the MLP takes each token, enriched by attention, and processes them position-wise individually. You can think of Attention doing group work while MLP is doing in-depth individual work.

\[\text{MLP}(\mathbf{X}) = \left[ \text{MLP}(\mathbf{x}_1), \text{MLP}(\mathbf{x}_2), \dots, \text{MLP}(\mathbf{x}_n) \right]\]

Each \(\mathbf{x}_i \in \mathbb{R}^{d_{\text{model}}}\) is passed through the same two-layer feedforward network:

\[\text{MLP}(\mathbf{x}_i) = \mathbf{W}_2 \cdot \phi(\mathbf{W}_1 \cdot \mathbf{x}_i + \mathbf{b}_1) + \mathbf{b}_2\]

The weights \(\mathbf{W}_1, \mathbf{W}_2\) and biases \(\mathbf{b}_1, \mathbf{b}_2\) are shared across all positions.

So it doesn’t matter whether \(\mathbf{x}_i\) is the 1st or the 50th token, it’s processed independently and identically.

One more question might arise, like, are we passing each token one after the other to the MLP? Theoretically, yes, but in practice, we pass all the tokens in the sequence in parallel using batch matrix multiplication using the shared weight matrix, which is very efficient on GPUs.

Here is a visual way to understand it,

Multi-Layer Perceptron, Shared Weights Visualization

Here is the PyTorch code for Position-wise FFN or MLP in Transformers.

class FeedForward(nn.Module):
    """Feed forward module with two linear layers."""
    
    def __init__(self, config: TransformerConfig):
        """
        Initialize feed forward network.
        
        Args:
            config: Model configuration


        The normal Feed-Forward or Multi-Layer Perceptron layer.
        This layer learns the meaning of words based on the context of the sentence.
        It is in this layer, the transformer will hold memory of relationship between different entities
        different ways of representing the same word in different contexts, etc.
        """
        super(FeedForward, self).__init__()
        
        self.linear1 = nn.Linear(config.d_model, config.d_ff)
        self.linear2 = nn.Linear(config.d_ff, config.d_model)
        self.dropout = nn.Dropout(config.dropout)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of feed forward network.
        
        Args:
            x: Input tensor (batch_size, seq_len, d_model)
            
        Returns:
            Output tensor (batch_size, seq_len, d_model)
        """
        x = F.relu(self.linear1(x))
        x = self.dropout(x)
        x = self.linear2(x)
        return x

2.9 Encoder Internals

The internals of the encoder club's source embedding, Multi-head Self-Attention, Layer Normalization, Masking, & the Multi-Layer Perceptron.

Encoder Internals Architecture

A stack of encoder layers is usually implemented, ie, the same thing you are seeing above is stacked on top of each other, layers of abstract relations are learned using this method.

2.9.1 Encoder Forward Pass

The first step in the encoder forward pass is to take the source token embeddings added with the positional encoding as input, which can be formally represented as,

\[ h_t^{(0)} = W_e^T x_t + \text{PE}(t) \text{ for } t = 1, \dots, n\]

In this context, `x_t` represents the one-hot encoding for the `t_{th}` source token. There can be some confusion here, as we previously mentioned that we are not generating any one-hot vectors. Conceptually, the one-hot vectors are used when performing the weighted sum `W_e^T x_t`. However, libraries like PyTorch abstract this process. Instead of creating full one-hot vectors, they utilize the token indices directly.

Since `h_t^{(0)}` represents the output at the `t_{th}` position, we can represent the overall yielded output as `H^{(0)} \in \mathbb{R}^{n \times d_{\text{model}}}`,

We said that usually we stack multiple encoders and decoders. Let us denote the output of the current layer \( l \) as \( \mathbf{H}^{(l)} \), and the output of the previous layer \( l - 1 \) as \( \mathbf{H}^{(l-1)} \).

For a stack of \( N \) layers, where \( l = 1, \dots, N \), we transform \( \mathbf{H}^{(l-1)} \rightarrow \mathbf{H}^{(l)} \) using two main sublayers, the Multi-head Self-Attention and Feed Forward Network.

In Multi-head Self-Attention, we compute \(Q = H^{(l-1)}W^Q, K = H^{(l-1)}W^K, V = H^{(l-1)}W^V\) (each \(\in \mathbb{R}^{n \times d_k}\) or \(\mathbb{R}^{n \times d_v}\)), then apply attention and heads:
\[A = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V, \quad \text{MultiHead}(H^{(l-1)}) =[\text{head}_1; \dots; \text{head}_h]W^O.\]

Add residual: \(X = \text{LayerNorm}(H^{(l-1)} + \text{MultiHead}(H^{(l-1)}))\).

On the Feed Forward Layer, also called MLP, we compute independently at each position \(t\), apply a two-layer MLP with ReLU:

\[\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2,\]

where \(W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}, W_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}\) (for example \(d_{\text{ff}} = 2048\) if \(d_{\text{model}} = 512\)). Then add residual and normalize:

\[H^{(l)} = \text{LayerNorm}(X + \text{FFN}(X)).\]

Now, after `N` layers, the output from the whole encoder stack `H^{(N)}` will be produced, which serves as the memory for the decoders' Self-Attention.

Encoder PyTorch Implementation:

  class Encoder(nn.Module):
    """Transformer encoder with multiple encoder layers."""
    
    def __init__(self, config: TransformerConfig):
        """
        Initialize encoder.
        
        Args:
            config: Model configuration
        """
        super(Encoder, self).__init__()
        
        self.embedding = nn.Embedding(config.src_vocab_size, config.d_model, padding_idx=config.pad_token_id)
        self.pos_encoding = PositionalEncoding(config)
        
        self.layers = nn.ModuleList([EncoderLayer(config) for _ in range(config.num_encoder_layers)])
        self.norm = nn.LayerNorm(config.d_model)
        self.pad_token_id = config.pad_token_id
        
    def forward(self,
               src: torch.Tensor,
               src_padding_mask: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of encoder.
        
        Args:
            src: Source tensor (batch_size, src_seq_len)
            src_mask: Source mask for masking padding tokens
            
        Returns:
            Encoder output tensor (batch_size, src_seq_len, d_model)
        """
        
        # Embed tokens and add positional encoding
        x = self.embedding(src) * math.sqrt(self.embedding.embedding_dim)
        x = self.pos_encoding(x)
        # Apply encoder layers
        for layer in self.layers:
            x = layer(x, src_padding_mask)
        
        x = self.norm(x)
        return x

2.10 Decoder Internals

Similar to the encoder, the decoder layer consists of the target embedding with positional encoding, Masked Multi-Head Self-Attention, Layer Normalization, an Encoder-Decoder attention mask, and finally, the Multi-Layer Perceptron.

Decoder Internals Architecture

The decoder works a lot like the encoder, but adds two special attention steps. First, it uses masked self‑attention as we discussed earlier, so it only considers the words it has already generated and never peeks ahead. Next, it applies encoder‑decoder attention to pull in the most relevant parts of the encoder’s “memory” of the original sentence. After all the attention layers, the decoder passes its final output through a simple linear layer and a softmax, which turns the result into probabilities for each word in the vocabulary. Finally, it picks the word with the highest probability as the next word in the translation.

2.10.1 Decoder Forward Pass

The first step in the decoder forward pass is to take the target token embeddings (using a "shifted right" version of the target sequence) added with positional encoding as input, which can be formally represented as,

\[ g_t^{(0)} = W_e^T y_t + \text{PE}(t) \text{ for } t = 1, \dots, m\]

In this context, \(y_t\) represents the \(t\)-th target token. Since \(g_t^{(0)}\) represents the output at the \(t\)-th position, we can represent the overall yielded output as \(\mathbf{G}^{(0)} \in \mathbb{R}^{m \times d_{\text{model}}}\).

Let us denote the output of the current decoder layer \( l \) as \( \mathbf{G}^{(l)} \), and the output of the previous layer \( l - 1 \) as \( \mathbf{G}^{(l-1)} \).

For a stack of \( N \) layers, where \( l = 1, \dots, N \), we transform \( \mathbf{G}^{(l-1)} \rightarrow \mathbf{G}^{(l)} \) using three main sublayers.

In the Masked Multi-head Self-Attention sublayer, we compute \(Q = \mathbf{G}^{(l-1)}W^Q, K = \mathbf{G}^{(l-1)}W^K, V = \mathbf{G}^{(l-1)}W^V\). To maintain the auto-regressive property and prevent peeking, the Look-Ahead mask \(M\) is added to the scaled dot-product scores before the softmax.

\[A = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V, \quad \text{MultiHead}(\mathbf{G}^{(l-1)}) =[\text{head}_1; \dots; \text{head}_h]W^O.\]

Add residual: \(X = \text{LayerNorm}(\mathbf{G}^{(l-1)} + \text{MultiHead}_{\text{self}}(\mathbf{G}^{(l-1)}))\).

The second sublayer is the Encoder-Decoder Cross-Attention. Here, queries are generated from the previous sublayer's output, \(X\), while keys and values are generated from the encoder stack's final output, \(\mathbf{H}^{(N)}\). We compute \(Q = XW^Q, K = \mathbf{H}^{(N)}W^K, V = \mathbf{H}^{(N)}W^V\). This allows the decoder to incorporate information from the input sequence. Then add residual and normalize:

\[Y = \text{LayerNorm}(X + \text{MultiHead}_{\text{cross}}(X, \mathbf{H}^{(N)})).\]

On the Feed Forward Layer, we compute independently at each position of \(Y\), applying a two-layer MLP with ReLU:

\[\text{FFN}(y) = \max(0, yW_1 + b_1)W_2 + b_2,\]

where the weight matrices are identical to those in the encoder's FFN. Then add residual and normalize to get the final output for the layer:

\[\mathbf{G}^{(l)} = \text{LayerNorm}(Y + \text{FFN}(Y)).\]

Now, after `N` layers, the output from the whole decoder stack \(\mathbf{G}^{(N)}\) will be produced. This output is then passed through a final linear layer and a softmax function to generate a probability distribution over the vocabulary for the next token.

Decoder PyTorch Implementation:

   class Decoder(nn.Module):
    """Transformer decoder with multiple decoder layers."""
    
    def __init__(self, config: TransformerConfig):
        """
        Initialize decoder.
        
        Args:
            config: Model configuration
        """
        super(Decoder, self).__init__()
        
        self.embedding = nn.Embedding(config.tgt_vocab_size, config.d_model, padding_idx=config.pad_token_id)
        self.pos_encoding = PositionalEncoding(config)
        
        self.layers = nn.ModuleList([DecoderLayer(config) for _ in range(config.num_decoder_layers)])
        self.norm = nn.LayerNorm(config.d_model)
        self.pad_token_id = config.pad_token_id
        
    def forward(self,
               tgt: torch.Tensor,
               memory: torch.Tensor,
               tgt_padding_mask: torch.Tensor,
               future_mask: Optional[torch.Tensor],
               memory_padding_mask: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of decoder.
        
        Args:
            tgt: Target tensor (batch_size, tgt_seq_len)
            memory: Output from encoder (batch_size, src_seq_len, d_model)
            tgt_mask: Target mask for masking future positions
            memory_mask: Memory mask for masking padding tokens in encoder output
            
        Returns:
            Decoder output tensor (batch_size, tgt_seq_len, d_model)
        """

        if future_mask is not None:
            combined_mask = self._combine_mask(tgt_padding_mask, future_mask)
        else:
            combined_mask = tgt_padding_mask

        # Embed tokens and add positional encoding
        x = self.embedding(tgt) * math.sqrt(self.embedding.embedding_dim)
        x = self.pos_encoding(x)
        
        # Apply decoder layers
        for layer in self.layers:
            x = layer(x, memory, combined_mask, memory_padding_mask)
        
        x = self.norm(x)
        return x

    def _combine_mask(self, tgt_padding_mask: torch.Tensor, future_mask: torch.Tensor) -> torch.Tensor:
        """
        Combine padding and future masks for decoder self-attention.

        This will help us to only pass a single mask to the decoder self-attention layer reducing
        the overhead of passing multiple masks.
        
        Args:
            tgt_padding_mask: [batch_size, 1, 1, seq_len]
            future_mask: [seq_len, seq_len]
            
        Returns:
            Combined mask [batch_size, 1, seq_len, seq_len]
        """
        batch_size = tgt_padding_mask.size(0)
        seq_len = tgt_padding_mask.size(-1)
        
        padding_mask = tgt_padding_mask.squeeze(2) # [batch_size, 1, seq_len]
        padding_mask = padding_mask.unsqueeze(-1)  # [batch_size, 1, 1, seq_len]
        padding_mask = padding_mask.expand(-1, -1, seq_len, -1)  # [batch_size, 1, seq_len, seq_len]
        
        # 2. Prepare future mask with proper broadcasting dimensions
        future_mask = future_mask.unsqueeze(0).unsqueeze(0)  # [1, 1, seq_len, seq_len]
        future_mask = future_mask.expand(batch_size, 1, -1, -1)  # [batch_size, 1, seq_len, seq_len]
        
        # 3. Combine masks using logical AND
        # Both masks: False means "mask this position", True means "keep this position"
        combined_mask = padding_mask & ~future_mask
        
        return combined_mask

2.11 Auto-Regressive Decoding

Auto-regressive decoding is not exclusive to Transformer models. In fact, it is not a part of any language model architecture, but an output generation tactic that is most commonly used in language modeling tasks, where the decoder model generates one token at a time, conditioned on all previous tokens. Mathematically, it can be represented by,

\[P(y_1, \dots, y_m|x) = \prod_{t=1}^{m} P(y_t|y_1, y_2, \dots, y_{t-1}, x)\]

Let's concentrate on the right-hand side, \(P(y_t|y_1, y_2, \dots, y_{t-1}, x)\). Intuitively, what will be the probability of getting `y_t` given all the previous history, given the prompt `x`? To compute this probability, we have to compute the probability of getting all the sequences \(y_1, \dots, y_t\) given `x` divided by the probability of the previous history \(y_1, \dots, y_{t-1}\) given `x`. This gives the ratio or fraction of `y_t` among all ways we could generate the first `t-1` tokens given `x`.

The transformer decoder is autoregressive in nature, where we feed into the decoder the first `y_{t-1}` values, and it predicts `y_t` based on this conditional generation. During training, as we discussed earlier in section 2.2.2, the model uses Teacher Forcing.

2.12 The Entire Transformer Model Architecture Code

# The transformer Encoder-Decoder Architecture 

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from typing import Optional, Dict, Tuple

class TransformerConfig:
    """Configuration class to store hyperparameters for the transformer model."""
    
    def __init__(self,
                 src_vocab_size: int = 32000,
                 tgt_vocab_size: int = 32000,
                 d_model: int = 256,
                 num_heads: int = 8,
                 num_encoder_layers: int = 6,
                 num_decoder_layers: int = 6,
                 d_ff: int = 1024,
                 dropout: float = 0.2,
                 max_seq_length: int = 128,
                 pad_token_id: int = 0,
                 shared_embeddings: bool = False):
        """
        Initialize the configuration.
        
        Args:
            src_vocab_size: Size of source vocabulary
            tgt_vocab_size: Size of target vocabulary
            d_model: Dimension of model embeddings
            num_heads: Number of attention heads
            num_encoder_layers: Number of encoder layers
            num_decoder_layers: Number of decoder layers
            d_ff: Dimension of feed forward network
            dropout: Dropout rate
            max_seq_length: Maximum sequence length
            pad_token_id: Padding token ID
            shared_embeddings: Whether to share embeddings between encoder and decoder
        """
        self.src_vocab_size = src_vocab_size
        self.tgt_vocab_size = tgt_vocab_size
        self.d_model = d_model
        self.num_heads = num_heads
        self.num_encoder_layers = num_encoder_layers
        self.num_decoder_layers = num_decoder_layers
        self.d_ff = d_ff
        self.dropout = dropout
        self.max_seq_length = max_seq_length
        self.pad_token_id = pad_token_id
        self.shared_embeddings = shared_embeddings


def scaled_dot_product_attention(query: torch.Tensor,
                                key: torch.Tensor,
                                value: torch.Tensor,
                                mask: Optional[torch.Tensor] = None) -> torch.Tensor:
    """
    Compute scaled dot product attention.
    
    Args:
        query: Query tensor (batch_size, num_heads, seq_len, head_dim)
        key: Key tensor (batch_size, num_heads, seq_len, head_dim)
        value: Value tensor (batch_size, num_heads, seq_len, head_dim)
        mask: Optional mask tensor (batch_size, 1, seq_len, seq_len) or (batch_size, 1, 1, seq_len)
        
    Returns:
        Output tensor after attention (batch_size, num_heads, seq_len, head_dim)
    """
    # Get dimensions
    head_dim = query.size(-1)
    
    # Compute scaled dot product
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(head_dim)

    # Apply mask if provided, Issue in masking:
    if mask is not None:

        # Using a smaller value to mask out the scores for better numerical stability.

        # NOTE: We are inverting the masks here, this is because of the way masks are created.
        # We are using True for padding tokens and False for non-padding tokens.
        # So we need to invert the mask to get the correct masking.
        scores = scores.masked_fill(~mask, -1e+30 if scores.dtype == torch.float32 else -1e+4)
    
    # Apply softmax
    attention_weights = F.softmax(scores, dim=-1)
    
    # Compute final output
    output = torch.matmul(attention_weights, value)
    
    return output

class MultiHeadAttention(nn.Module):
    """Multi-head attention module."""
    
    def __init__(self, config: TransformerConfig):
        """
        Initialize multi-head attention.
        
        Args:
            config: Model configuration

        The multihead-attention is the core module where we split the tasks between multiple heads
        and then combine them back to the original dimension. The query, key, and value are linearly projected
        to the required dimensions and then split into multiple heads. The scaled dot product attention is applied
        to each head and then combined back to the original dimension.
        """
        super(MultiHeadAttention, self).__init__()
        
        self.d_model = config.d_model # 512
        self.num_heads = config.num_heads # 8
        assert self.d_model % self.num_heads == 0, "d_model must be divisible by num_heads"
        
        # The dimension of each head will be 512 / 8 = 64
        self.head_dim = self.d_model // self.num_heads
        
        # Linear projections

        # These are the projection weights for the query, key, and value
        self.q_proj = nn.Linear(self.d_model, self.d_model) # Query projection
        self.k_proj = nn.Linear(self.d_model, self.d_model)
        self.v_proj = nn.Linear(self.d_model, self.d_model)
        
        # The output projection that takes the concatenated heads and projects them back to the original dimension
        self.output_proj = nn.Linear(self.d_model, self.d_model)
        self.dropout = nn.Dropout(config.dropout)
        
    def forward(self,
               query: torch.Tensor,
               key: torch.Tensor,
               value: torch.Tensor,
               mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        """
        Forward pass of multi-head attention.
        
        Args:
            query: Query tensor (batch_size, seq_len, d_model)
            key: Key tensor (batch_size, seq_len, d_model)
            value: Value tensor (batch_size, seq_len, d_model)
            mask: Optional mask tensor
            
        Returns:
            Output tensor (batch_size, seq_len, d_model)
        """
        batch_size = query.size(0)

        # Linear Projections and Splitting into multiple heads

        # Each tensor will have the shape of (batch_size, num_heads, seq_len, head_dim)
        # Eg: (32, 8, 256, 64), where 64 is the head_dim

        q = self.q_proj(query).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        k = self.k_proj(key).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        v = self.v_proj(value).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)

        # Apply scaled dot product attention
        attn_output = scaled_dot_product_attention(q, k, v, mask)
        
        # Combine heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        
        # Perform the output projection to maintain the original dimension
        output = self.output_proj(attn_output)
        
        return output


class PositionalEncoding(nn.Module):
    """Positional encoding module using sine and cosine functions."""
    
    def __init__(self, config: TransformerConfig):
        """
        Initialize positional encoding.
        
        Args:
            config: Model configuration

        The Sin Cos postional encoding is a clever way to encode the position of the words in the sentence
        with a continuous function for better gradient flow, instead of using a one-hot encoding.

        It works by creating a matrix of shape (max_seq_length, d_model) and then applying a sine function to the even indices
        and a cosine function to the odd indices. It will create continious unique indices for each word in the sentence.
        This matrix is then added to the input embeddings.
        """
        super(PositionalEncoding, self).__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(config.max_seq_length, config.d_model)
        position = torch.arange(0, config.max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, config.d_model, 2).float() * (-math.log(10000.0) / config.d_model))
        
        # Apply sine to even indices and cosine to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension and register as buffer (not a parameter)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
        self.dropout = nn.Dropout(config.dropout)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Add positional encoding to input.
        
        Args:
            x: Input tensor (batch_size, seq_len, d_model)
            
        Returns:
            Output tensor with positional encoding added
        """
        # Add positional encoding
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)


class FeedForward(nn.Module):
    """Feed forward module with two linear layers."""
    
    def __init__(self, config: TransformerConfig):
        """
        Initialize feed forward network.
        
        Args:
            config: Model configuration


        The normal Feed-Forward or Multi-Layer Perceptron layer.
        This layer learns the meaning of words based on the context of the sentence.
        It is in this layer, the transformer will hold memory of relationship between different entities
        different ways of representing the same word in different contexts, etc.
        """
        super(FeedForward, self).__init__()
        
        self.linear1 = nn.Linear(config.d_model, config.d_ff)
        self.linear2 = nn.Linear(config.d_ff, config.d_model)
        self.dropout = nn.Dropout(config.dropout)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of feed forward network.
        
        Args:
            x: Input tensor (batch_size, seq_len, d_model)
            
        Returns:
            Output tensor (batch_size, seq_len, d_model)
        """
        x = F.relu(self.linear1(x))
        x = self.dropout(x)
        x = self.linear2(x)
        return x


class EncoderLayer(nn.Module):
    """Single encoder layer with self-attention and feed forward network."""
    
    def __init__(self, config: TransformerConfig):
        """
        Initialize encoder layer.
        
        Args:
            config: Model configuration
        """
        super(EncoderLayer, self).__init__()
        
        self.self_attn = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)
        
        self.norm1 = nn.LayerNorm(config.d_model)
        self.norm2 = nn.LayerNorm(config.d_model)
        
        self.dropout1 = nn.Dropout(config.dropout)
        self.dropout2 = nn.Dropout(config.dropout)

        self.pad_token_id = config.pad_token_id
        
    def forward(self, 
                x: torch.Tensor,
                src_padding_mask: torch.Tensor) -> torch.Tensor:
        
        """
        Forward pass of encoder layer.
        
        Args:
            x: Input tensor (batch_size, seq_len, d_model)
            src_mask: Optional source mask for masking future positions
            src_padding_mask: Optional padding mask for masking padding tokens
            
        Returns:
            Output tensor (batch_size, seq_len, d_model)
        """
    
        # Self attention block with residual connection and layer norm
        # For the encoder, we will pass the padding mask to the self attention layer
        # The padding mask will be of the shape: (batch_size, 1, 1, seq_len)
        attn_output = self.self_attn(x, x, x, src_padding_mask)
        x = self.norm1(x + self.dropout1(attn_output))
        
        # Feed forward block with residual connection and layer norm
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout2(ff_output))
        
        return x



class DecoderLayer(nn.Module):
    """Single decoder layer with self-attention, cross-attention and feed forward network."""
    
    def __init__(self, config: TransformerConfig):
        """
        Initialize decoder layer.
        
        Args:
            config: Model configuration
        """
        super(DecoderLayer, self).__init__()
        
        self.self_attn = MultiHeadAttention(config)
        self.cross_attn = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)
        
        self.norm1 = nn.LayerNorm(config.d_model)
        self.norm2 = nn.LayerNorm(config.d_model)
        self.norm3 = nn.LayerNorm(config.d_model)
        
        self.dropout1 = nn.Dropout(config.dropout)
        self.dropout2 = nn.Dropout(config.dropout)
        self.dropout3 = nn.Dropout(config.dropout)
        
    def forward(self,
               x: torch.Tensor,
               memory: torch.Tensor,
               combined_mask: torch.Tensor,
               memory_padding_mask: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of decoder layer.
        
        Args:
            x: Input tensor (batch_size, seq_len, d_model)
            memory: Output from encoder (batch_size, src_seq_len, d_model)
            tgt_padding_mask: Target padding mask for masking padding tokens
            memory_padding_mask: Memory padding mask for masking padding tokens in encoder output
            future_mask: Mask for masking future positions
            
        Returns:
            Output tensor (batch_size, seq_len, d_model)
        """

        # Self attention block with residual connection and layer norm
        attn_output = self.self_attn(x, x, x, combined_mask)
        x = self.norm1(x + self.dropout1(attn_output))
        
        # Cross attention block with residual connection and layer norm
        attn_output = self.cross_attn(x, memory, memory, memory_padding_mask)
        x = self.norm2(x + self.dropout2(attn_output))
        
        # Feed forward block with residual connection and layer norm
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout3(ff_output))
        
        return x


class Encoder(nn.Module):
    """Transformer encoder with multiple encoder layers."""
    
    def __init__(self, config: TransformerConfig):
        """
        Initialize encoder.
        
        Args:
            config: Model configuration
        """
        super(Encoder, self).__init__()
        
        self.embedding = nn.Embedding(config.src_vocab_size, config.d_model, padding_idx=config.pad_token_id)
        self.pos_encoding = PositionalEncoding(config)
        
        self.layers = nn.ModuleList([EncoderLayer(config) for _ in range(config.num_encoder_layers)])
        self.norm = nn.LayerNorm(config.d_model)
        self.pad_token_id = config.pad_token_id
        
    def forward(self,
               src: torch.Tensor,
               src_padding_mask: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of encoder.
        
        Args:
            src: Source tensor (batch_size, src_seq_len)
            src_mask: Source mask for masking padding tokens
            
        Returns:
            Encoder output tensor (batch_size, src_seq_len, d_model)
        """
        
        # Embed tokens and add positional encoding
        x = self.embedding(src) * math.sqrt(self.embedding.embedding_dim)
        x = self.pos_encoding(x)
        # Apply encoder layers
        for layer in self.layers:
            x = layer(x, src_padding_mask)
        
        x = self.norm(x)
        return x
    

class Decoder(nn.Module):
    """Transformer decoder with multiple decoder layers."""
    
    def __init__(self, config: TransformerConfig):
        """
        Initialize decoder.
        
        Args:
            config: Model configuration
        """
        super(Decoder, self).__init__()
        
        self.embedding = nn.Embedding(config.tgt_vocab_size, config.d_model, padding_idx=config.pad_token_id)
        self.pos_encoding = PositionalEncoding(config)
        
        self.layers = nn.ModuleList([DecoderLayer(config) for _ in range(config.num_decoder_layers)])
        self.norm = nn.LayerNorm(config.d_model)
        self.pad_token_id = config.pad_token_id
        
    def forward(self,
               tgt: torch.Tensor,
               memory: torch.Tensor,
               tgt_padding_mask: torch.Tensor,
               future_mask: Optional[torch.Tensor],
               memory_padding_mask: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of decoder.
        
        Args:
            tgt: Target tensor (batch_size, tgt_seq_len)
            memory: Output from encoder (batch_size, src_seq_len, d_model)
            tgt_mask: Target mask for masking future positions
            memory_mask: Memory mask for masking padding tokens in encoder output
            
        Returns:
            Decoder output tensor (batch_size, tgt_seq_len, d_model)
        """

        if future_mask is not None:
            combined_mask = self._combine_mask(tgt_padding_mask, future_mask)
        else:
            combined_mask = tgt_padding_mask

        # Embed tokens and add positional encoding
        x = self.embedding(tgt) * math.sqrt(self.embedding.embedding_dim)
        x = self.pos_encoding(x)
        
        # Apply decoder layers
        for layer in self.layers:
            x = layer(x, memory, combined_mask, memory_padding_mask)
        
        x = self.norm(x)
        return x

    def _combine_mask(self, tgt_padding_mask: torch.Tensor, future_mask: torch.Tensor) -> torch.Tensor:
        """
        Combine padding and future masks for decoder self-attention.

        This will help us to only pass a single mask to the decoder self-attention layer reducing
        the overhead of passing multiple masks.
        
        Args:
            tgt_padding_mask: [batch_size, 1, 1, seq_len]
            future_mask: [seq_len, seq_len]
            
        Returns:
            Combined mask [batch_size, 1, seq_len, seq_len]
        """
        batch_size = tgt_padding_mask.size(0)
        seq_len = tgt_padding_mask.size(-1)
        
        padding_mask = tgt_padding_mask.squeeze(2) # [batch_size, 1, seq_len]
        padding_mask = padding_mask.unsqueeze(-1)  # [batch_size, 1, 1, seq_len]
        padding_mask = padding_mask.expand(-1, -1, seq_len, -1)  # [batch_size, 1, seq_len, seq_len]
        
        # 2. Prepare future mask with proper broadcasting dimensions
        future_mask = future_mask.unsqueeze(0).unsqueeze(0)  # [1, 1, seq_len, seq_len]
        future_mask = future_mask.expand(batch_size, 1, -1, -1)  # [batch_size, 1, seq_len, seq_len]
        
        # 3. Combine masks using logical AND
        # Both masks: False means "mask this position", True means "keep this position"
        combined_mask = padding_mask & ~future_mask
        
        return combined_mask



class Transformer(nn.Module):
    """Transformer model with encoder and decoder."""
    
    def __init__(self, config: TransformerConfig):
        """
        Initialize transformer model.
        
        Args:
            config: Model configuration
        """
        super(Transformer, self).__init__()
        
        self.config = config
        
        # Create embedding layers or use shared embeddings

        # Shared embeddings are useful when the source and target vocabularies are of the same size
        # This will reduce the number of parameters in the model, thus reducing the memory footprint
        if config.shared_embeddings:
            assert config.src_vocab_size == config.tgt_vocab_size, "Vocab sizes must match for shared embeddings"
            self.encoder_embedding = nn.Embedding(config.src_vocab_size, config.d_model, padding_idx=config.pad_token_id)
            self.decoder_embedding = self.encoder_embedding
        else:
            self.encoder_embedding = None
            self.decoder_embedding = None
        
        # Create encoder and decoder
        self.encoder = Encoder(config)
        self.decoder = Decoder(config)
        
        if config.shared_embeddings:
            self.encoder.embedding = self.encoder_embedding
            self.decoder.embedding = self.decoder_embedding
        
        # Output projection
        self.output_projection = nn.Linear(config.d_model, config.tgt_vocab_size)
        
        # Initialize parameters
        self._reset_parameters()
        
    def forward(self,
               src: torch.Tensor,
               tgt: torch.Tensor,
               future_mask: Optional[torch.Tensor] = None,
               ) -> torch.Tensor:
        """
        Forward pass of transformer model.
        
        Args:
            src: Source tensor (batch_size, src_seq_len)
            tgt: Target tensor (batch_size, tgt_seq_len)
            src_mask: Source mask for masking padding tokens
            tgt_mask: Target mask for masking future positions
            memory_mask: Memory mask for masking padding tokens in encoder output
            
        Returns:
            Output logits (batch_size, tgt_seq_len, tgt_vocab_size)
        """
        # Padding tokens are masked out in the source and targer sequences
        src_padding_mask = self._prepare_masks(src)
        tgt_padding_mask = self._prepare_masks(tgt)
        # Encode source: Encoder receives the source sequence and returns the output
        memory = self.encoder(
            src = src, 
            src_padding_mask = src_padding_mask
            )
        
        # Decode target: Decoder receives the target sequence, encoder output, and the future_mask
        output = self.decoder(
            tgt = tgt, 
            memory = memory, 
            tgt_padding_mask = tgt_padding_mask, 
            future_mask = future_mask,
            memory_padding_mask = src_padding_mask
        )
        
        # Project to vocabulary
        logits = self.output_projection(output)
        
        return logits
    
    def _prepare_masks(self, seq_batch: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Prepare masks for source and target sequences.
        
        Args:
            seq_batch: Source or target Batch tensor
            
        Returns:
            Padding mask tensor for source and target
        """
        # Creating the source padding mask, must be of the shape: (batch_size, 1, 1, seq_len)
        # This produces a mask where the padding tokens are set to False and the rest are set to True
        mask = (seq_batch != self.config.pad_token_id).unsqueeze(1).unsqueeze(2)
        
        return mask
    
    def _reset_parameters(self):
        """Initialize model parameters."""
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)



def generate_square_subsequent_mask(size: int, device: torch.device = None) -> torch.Tensor:
    """
    Generate square mask for future positions in decoder self-attention.
    
    Args:
        size: Sequence length
        device: Device to create mask on (CPU/GPU)
        
    Returns:
        Square mask tensor of shape [size, size] where:
        - False values indicate positions to attend to
        - True values indicate positions to mask (future positions)
    
    Example:
        For size=4, generates mask:
        [[False,  True,  True,  True],
         [False, False,  True,  True],
         [False, False, False,  True],
         [False, False, False, False]]
    """
    mask = torch.triu(torch.ones(size, size), diagonal=1)
    mask = mask.bool()
    if device:
        mask = mask.to(device)
    return mask


def count_parameters(model: nn.Module) -> int:
    """
    Count the number of trainable parameters in the model.
    
    Args:
        model: PyTorch model
        
    Returns:
        Number of trainable parameters
    """
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


# ==============================================================================
# Simple Inference / Demonstration
# ==============================================================================

if __name__ == '__main__':
    # 1. Setup device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

    # 2. Define special tokens and build vocabulary
    PAD_TOKEN = ""
    BOS_TOKEN = ""
    EOS_TOKEN = ""
    
    # Example sentences
    src_sentence_text = "cat sat on the mat"
    # For this demo, target is same as source (simple copy task)
    # In a real translation task, tgt_sentence_text would be the translation
    tgt_sentence_text = "cat sat on the mat" 

    # Build vocabulary using simple split() tokenization
    all_sentences_for_vocab = [src_sentence_text, tgt_sentence_text]
    special_tokens_list = [PAD_TOKEN, BOS_TOKEN, EOS_TOKEN]
    
    word_to_idx = {token: i for i, token in enumerate(special_tokens_list)}
    current_idx = len(special_tokens_list)

    for sentence in all_sentences_for_vocab:
        for word in sentence.split(): # Tokenize by splitting
            if word not in word_to_idx:
                word_to_idx[word] = current_idx
                current_idx += 1
    
    idx_to_word = {i: token for token, i in word_to_idx.items()}
    vocab_size = len(word_to_idx)
    PAD_ID = word_to_idx[PAD_TOKEN]

    print("\n--- Vocabulary ---")
    print(f"Vocabulary size: {vocab_size}")
    print(f"Word to ID mapping: {word_to_idx}")
    print(f"PAD ID: {PAD_ID}")

    # 3. Configure the Transformer
    # Using small dimensions for quick demonstration
    config = TransformerConfig(
        src_vocab_size=vocab_size,
        tgt_vocab_size=vocab_size,
        d_model=32,                 # Dimension of model embeddings
        num_heads=4,                # Number of attention heads
        num_encoder_layers=2,       # Number of encoder layers
        num_decoder_layers=2,       # Number of decoder layers
        d_ff=64,                    # Dimension of feed forward network
        dropout=0.0,                # Dropout rate (set to 0 for deterministic demo)
        max_seq_length=15,          # Maximum sequence length (must be >= actual seq lengths)
        pad_token_id=PAD_ID,
        shared_embeddings=True      # Share embeddings since src/tgt vocabs are the same here
    )
    print("\n--- Model Configuration ---")
    for key, value in config.__dict__.items():
        print(f"{key}: {value}")

    # 4. Prepare data
    # Source sentence processing: tokens + EOS + padding
    src_token_list = src_sentence_text.split()
    src_ids_temp = [word_to_idx[token] for token in src_token_list]
    src_ids_temp.append(word_to_idx[EOS_TOKEN]) # Add EOS to source
    
    # Pad to max_seq_length
    padding_needed_src = config.max_seq_length - len(src_ids_temp)
    if padding_needed_src < 0:
        raise ValueError(f"Source sentence too long for max_seq_length={config.max_seq_length}")
    src_padded_ids = src_ids_temp + [PAD_ID] * padding_needed_src
    src_tensor = torch.tensor([src_padded_ids], dtype=torch.long, device=device)

    # Target sentence processing (for decoder input): BOS + tokens + padding
    tgt_token_list = tgt_sentence_text.split()
    tgt_ids_input_temp = [word_to_idx[BOS_TOKEN]] # Start with BOS
    tgt_ids_input_temp.extend([word_to_idx[token] for token in tgt_token_list])
    
    padding_needed_tgt = config.max_seq_length - len(tgt_ids_input_temp)
    if padding_needed_tgt < 0:
        raise ValueError(f"Target sentence too long for max_seq_length={config.max_seq_length}")
    tgt_padded_ids_input = tgt_ids_input_temp + [PAD_ID] * padding_needed_tgt
    tgt_tensor_input = torch.tensor([tgt_padded_ids_input], dtype=torch.long, device=device)

    # Expected output sequence (for conceptual comparison): tokens + EOS + padding
    # This is what the model *should* predict at each step.
    expected_output_ids_temp = [word_to_idx[token] for token in tgt_token_list]
    expected_output_ids_temp.append(word_to_idx[EOS_TOKEN]) # Ends with EOS
    # (No explicit padding needed here for `expected_output_words` list, just for tensor comparison if any)


    # 5. Initialize model
    model = Transformer(config).to(device)
    model.eval() # Set to evaluation mode (e.g., disables dropout)

    # 6. Prepare masks
    # Padding masks for source and target are created internally by model.forward()
    # Future mask (causal mask) for decoder self-attention:
    tgt_seq_len = tgt_tensor_input.size(1) # This will be config.max_seq_length
    future_mask = generate_square_subsequent_mask(tgt_seq_len, device=device)

    # 7. Perform a forward pass
    # This simulates a single step of "teacher-forced" generation or a forward pass during training.
    # For true auto-regressive inference, you'd generate tokens one by one in a loop.
    print("\n--- Input Tensors ---")
    print(f"Source input tensor shape: {src_tensor.shape}")
    print(f"Source input tensor: {src_tensor}")
    print(f"Target input tensor (decoder input) shape: {tgt_tensor_input.shape}")
    print(f"Target input tensor (decoder input): {tgt_tensor_input}")
    # print(f"Future mask for target (sample, True means mask): \n{future_mask}")


    with torch.no_grad(): # No need to compute gradients for this demonstration
        logits = model(src=src_tensor, tgt=tgt_tensor_input, future_mask=future_mask)

    # 8. Interpret output
    print("\n--- Output ---")
    print(f"Logits shape: {logits.shape}") # Expected: (batch_size, tgt_seq_len, tgt_vocab_size)

    # Get predicted token IDs using greedy decoding (taking the argmax)
    predicted_ids = torch.argmax(logits, dim=-1) # Shape: (batch_size, tgt_seq_len)
    print(f"Predicted token IDs tensor (first batch): {predicted_ids[0]}")

    # Convert predicted IDs back to words
    predicted_words_list = [idx_to_word[idx.item()] for idx in predicted_ids[0]]
    
    print("\n--- Tokenized Inputs and Predicted Output ---")
    source_words_with_eos = src_token_list + [EOS_TOKEN]
    print(f"Source input: '{' '.join(source_words_with_eos)}'")
    print(f"   Padded IDs: {src_tensor[0].tolist()}")
    
    target_input_words = [BOS_TOKEN] + tgt_token_list
    print(f"Target input (decoder): '{' '.join(target_input_words)}'")
    print(f"   Padded IDs: {tgt_tensor_input[0].tolist()}")
    
    expected_output_words_list = tgt_token_list + [EOS_TOKEN]
    print(f"Expected output sequence: '{' '.join(expected_output_words_list)}'")
    # (The predicted sequence will include padding up to max_seq_length)
    
    print(f"Predicted output sequence (greedy): '{' '.join(predicted_words_list)}'")
    
    print("\nNote: The model is untrained, so the predicted output will likely be random.")
    print("This demonstration shows the data flow and shapes through the Transformer model.")

    # Count parameters
    num_params = count_parameters(model)
    print(f"\nTotal trainable parameters in the model: {num_params:,}")

    print("\n--- For actual auto-regressive generation (not implemented here): ---")
    print("1. Encode `src_tensor` once to get `memory` from the encoder.")
    print("2. Initialize `tgt_input` with `BOS_TOKEN` ID.")
    print("3. Loop for `max_seq_length` steps:")
    print("   a. Pass current `tgt_input` and `memory` to the decoder.")
    print("   b. Get logits for the *last* token position.")
    print("   c. Select the next token ID (e.g., argmax or sampling).")
    print("   d. If token is `EOS_TOKEN` or max length reached, stop.")
    print("   e. Append the new token ID to `tgt_input` and repeat.")

That's the entire Transformer Encoder Decoder model architecture. For training and inference code, the trained model, datasets, and tokenizers are available at the following links.

We are not discussing the training techniques here; The model is trained using torch DDP, you can refer to the source code to understand the training setup done in PyTorch, loss functions, etc. If you have any queries, please post them in the comment box below.

Note: Even though the pretrained model performed well on the translation, this model is only for study and experimental purposes, not suitable for commercial use.

3. Conclusion

We've taken a closer look at the Transformer Encoder Decoder architecture, explored core concepts like embeddings, Self-Attention, Layer Normalization, Positional Encoding, Masking, etc, including theory, math, and code. We also explored how to build a whole transformer in PyTorch from scratch, which can be trained for Neural Machine Translation

Transformers are cutting-edge architectures that currently dominate the field of Natural Language Processing (NLP). Understanding the fundamentals of how these models are built will enable you to implement them more quickly, experiment with different prototypes, create impressive use cases for NLP, and even contribute to new research breakthroughs. So, get ready to explore further!

Thanks For Reading!

Quark Machine Learning

Search Suggest