# Understanding Self-Attention in Transformers: An In-Depth Analysis

Written on

As we delve into the realm of transformers, it's essential to grasp the concept of attention, which lies at the heart of their functionality. This article aims to unpack the nuances of the attention mechanism, particularly self-attention, and its pivotal role in enhancing neural network performance.

In the world of artificial intelligence today, transformers have become a force to be reckoned with, not in the form of fictional robots but as advanced neural networks. The breakthrough behind their effectiveness is rooted in the principle of 'attention.' So, what does 'attention' signify in transformers? Let's explore.

## What are Transformers?

Transformers are specialized neural networks designed to understand context from data, akin to our pursuit of meaning in the terms 'attention and context' when discussing transformers.

## How Do Transformers Learn Context from Data?

They achieve this through the attention mechanism.

## What is the Attention Mechanism?

The attention mechanism enables the model to evaluate all components of a sequence at each stage, identifying which elements require focus. This method was introduced as a more flexible alternative to the rigid fixed-length vectors used in the encoder-decoder framework, offering a 'soft' solution that emphasizes relevant sections.

## What is Self-Attention?

Initially, the attention mechanism enhanced the capabilities of Recurrent Neural Networks (RNNs) and subsequently influenced Convolutional Neural Networks (CNNs). However, with the advent of transformer architecture in 2017, the reliance on RNNs and CNNs diminished significantly, primarily due to the innovation of self-attention.

The self-attention mechanism is unique as it integrates the context of the input sequence to refine the attention process, allowing for the capture of intricate linguistic nuances.

As an illustration: When I ask my young child about transformers, he responds with "robots and cars," reflecting the limited context available to him. In contrast, I associate transformers with neural networks due to my broader experience. This highlights how differing contexts lead to varying interpretations and solutions.

The term 'self' indicates that the attention mechanism analyzes the same input sequence it is processing.

Multiple methodologies exist for implementing self-attention, with the **scaled dot-product** approach being particularly prevalent. This technique was outlined in the seminal transformer paper of 2017, titled "Attention is All You Need."

## Where and How Does Self-Attention Feature in Transformers?

I envision the transformer architecture as comprising two layers: an outer layer and an inner layer.

- The outer layer encompasses the attention-weighting mechanism and the feed-forward layer, which I will detail further in this article.
- The inner layer comprises the self-attention mechanism and contributes to the attention-weighting function.

Let’s delve deeper into the self-attention mechanism and uncover its underlying mechanics. The **Query-Key module** and the **SoftMax** function are critical components of this technique.

This discussion draws inspiration from Prof. Tom Yeh’s enlightening AI by Hand Series on Self-Attention. (All images below, unless otherwise stated, are credited to Prof. Tom Yeh from the aforementioned LinkedIn post, used with his permission.)

## Self-Attention

To set the context, let's examine the process of **Attention-Weighting** in the outer shell of the transformer.

## Attention Weight Matrix (A)

The attention weight matrix **A** is generated by inputting the features into the Query-Key (QK) module. This matrix aims to identify the most pertinent sections of the input sequence. Self-attention becomes integral in creating the attention weight matrix **A** through the QK module.

## How Does the QK-Module Work?

Let's examine the various elements of Self-Attention: **Query (Q), Key (K),** and **Value (V)**.

I find the spotlight analogy particularly useful for visualizing how the model illuminates each element of the sequence to discern the most relevant parts. To further this analogy, let’s consider a grand stage set for a Macbeth production.

**Query (Q)**: The lead actor steps onto the stage, asking, "Should I seize the crown?" This inquiry represents the Query, driving the narrative forward.**Key (K)**: The spotlight then shifts to other significant characters whose actions influence Macbeth’s decisions, akin to the Key, which reveals different story facets.**Value (V)**: Finally, these characters provide essential insights through their actions, representing Value, which guides Macbeth toward his choices.

This collaboration results in a memorable performance, engrained in the minds of the captivated audience.

Now that we have explored the roles of **Q**, **K**, and **V** in the context of performance, let’s turn our attention back to the mathematical fundamentals of the **QK-module**. Our roadmap is as follows:

## The Process Begins

We start with: A set of 4-feature vectors (Dimension 6)

Our objective:
Transform the given features into **Attention Weighted Features**.

**Create Query, Key, Value Matrices**:We achieve this by multiplying the features with linear transformation matrices

**W_Q**,**W_K**, and**W_V**to derive query vectors (q1, q2, q3, q4), key vectors (k1, k2, k3, k4), and value vectors (v1, v2, v3, v4) respectively.- To obtain
**Q**, we multiply**W_Q**with**X**: - For
**K**, we multiply**W_K**with**X**: - Similarly, to derive
**V**, we multiply**W_V**with**X**.

- To obtain

**Important Notes**:
1. The same set of features is utilized for both queries and keys, embodying the concept of 'self.'
2. The **query vector** represents the current word (or token) for which we wish to compute attention scores against other words in the sequence.
3. The **key vector** signifies other words (or tokens) in the input sequence, allowing us to calculate attention scores relative to the current word.

**Matrix Multiplication**:The next step involves multiplying the transpose of

**K**with**Q**(i.e.,**K**^T . **Q**). This process calculates the dot product for every pair of query and key vectors, estimating the matching score between each “key-query” pair, utilizing the concept of**Cosine Similarity**between the vectors, forming the basis of the scaled dot-product attention.**Cosine-Similarity**:Cosine similarity measures the cosine of the angle between vectors, calculated as the dot product divided by the product of their lengths, indicating how closely aligned two vectors are.

- If the dot product is near 1, the vectors are almost parallel.
- If it’s close to 0, the vectors are orthogonal, suggesting dissimilarity.
- If approximately -1, the vectors are nearly opposite.

**Scaling**:We then normalize each element by the square root of the dimension ‘

**d_k**.’ In this case, that dimension is 3. This scaling ensures that the influence of the dimension on the matching score remains manageable.**Softmax**:This step involves three components:

- Raising
**e**to the power of each cell's value. - Summing these values across each column.
- Dividing each element by its column sum (normalization), resulting in a
**probability distribution**of attention, forming our**Attention Weight Matrix (A)**.

- Raising

The Softmax step is crucial as it assigns probabilities to the previously calculated scores, guiding the model on how much importance to assign to each word based on the current query. Higher attention weights indicate greater relevance, enhancing the model's ability to capture dependencies accurately.

**Matrix Multiplication**:Finally, we multiply the value vectors (

**V**s) with the Attention Weight Matrix (**A**). These value vectors hold the information tied to each word in the sequence.

The outcome of this multiplication yields the **attention-weighted features (Z)**, which represent a refined representation of the features, assigning greater weights to those deemed more relevant in context.

Armed with this information, we progress to the next phase of the transformer architecture, where the feed-forward layer further processes this data.

In summary, we reviewed the essential concepts discussed:
1. The attention mechanism emerged to enhance RNN performance, addressing the limitations of fixed-length vector representations in encoder-decoder models, allowing for more adaptable lengths focused on relevant sequence parts.
2. Self-attention was introduced to incorporate the idea of context into the model, evaluating the same input sequence it processes.
3. While numerous variants of self-attention exist, the scaled dot-product attention remains one of the most prominent, contributing significantly to the transformer architecture's strength.
4. The scaled dot-product self-attention process combines the **Query-Key module (QK-module)** with the **Softmax function**, which assigns probabilities to the attention scores.
5. Once computed, the attention scores are multiplied with the value vectors to produce the attention-weighted features, which are then relayed to the feed-forward layer.

## Multi-Head Attention

To achieve a more diverse and comprehensive representation of the sequence, multiple instances of the self-attention mechanism operate in parallel, which are then concatenated to generate the final attention-weighted values. This approach is known as Multi-Head Attention.

## Transformer in a Nutshell

This encapsulates the workings of the inner layer of the transformer architecture. Integrating it with the outer layer, we summarize the Transformer mechanism:
1. The two core concepts in the Transformer architecture are **attention-weighting** and the **feed-forward layer (FFN)**. Together, they allow the Transformer to analyze the input sequence from dual perspectives: **attention** evaluates the sequence by **positions**, while the **FFN** assesses it based on the **dimensions** of the feature matrix.
2. The driving force behind the attention mechanism is the **scaled dot-product Attention**, comprising the **QK-module**, which produces the attention-weighted features.

## 'Attention Is Really All You Need'

Transformers have only been around for a few years, yet the AI landscape has witnessed remarkable advancements owing to them, with ongoing efforts for further innovation. The authors of the original paper were indeed serious when they claimed that.

It’s fascinating to observe how a foundational concept like the 'dot product,' enhanced with additional elements, can yield such profound results!

P.S. If you wish to attempt this exercise independently, here are the blank templates for your use.

## Blank Template for Hand-Exercise

Now, enjoy working through the exercise while paying homage to your **Robtimus Prime**!

## References:

- Vaswani, Ashish, et al. “Attention is all you need.”
*Advances in Neural Information Processing Systems*30 (2017). - Bahdanau, Dzmitry, et al. “Neural Machine Translation by Jointly Learning to Align and Translate.”
*CoRR*abs/1409.0473 (2014).