In transformers, attention is calculated using a mechanism called scaled dot-product attention. This process allows the model to determine how much focus to place on different parts of the input sequence when processing each token. The core idea is to compute relationships between every pair of tokens by comparing their queries, keys, and values—three vectors derived from the input embeddings. Each token’s query vector is compared with all key vectors to produce attention scores, which are normalized and used to weight the corresponding value vectors. The weighted sum of these values becomes the output for the token.
The calculation occurs in four concrete steps. First, the input embeddings are projected into query (Q), key (K), and value (V) matrices using learned linear transformations. For example, if the input is a sentence like “The cat sat on the mat,” each word embedding is multiplied by three separate weight matrices to create Q, K, and V. Second, the dot product of Q and K is computed, resulting in a matrix of attention scores. These scores are scaled by dividing by the square root of the key dimension (√dₖ), which prevents gradient issues caused by large values. Third, a softmax function is applied row-wise to convert scores into probabilities, ensuring they sum to 1. Finally, these probabilities are multiplied by the V matrix to produce the final output, where each token’s output is a combination of all tokens’ values weighted by their relevance.
A practical detail is the use of multi-head attention, which splits Q, K, and V into smaller subspaces (“heads”) to capture diverse relationships. For instance, one head might focus on syntactic relationships (e.g., subject-verb agreement), while another tracks positional dependencies. After processing, the outputs of all heads are concatenated and linearly projected to form the final result. The scaling factor (√dₖ) ensures stable gradients during training, as larger key dimensions lead to larger dot products, which can push softmax into regions of low sensitivity. This mechanism allows transformers to efficiently model long-range dependencies, outperforming earlier architectures like RNNs that process sequences sequentially.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word