Why Does Softmax Attention in Transformers Divide by √d_k?

In the Transformer architecture, the attention mechanism computes attention scores using the formula:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

But why do we divide by $\sqrt{d_k}$? Let's explore the mathematical reasoning behind this scaling factor.

Variance Analysis

Consider the query vector $q$ and key vector $k$, each with dimension $d_k$. Assume that each element of q and k follows a standard normal distribution, i.e., $q_i, k_i ~ N(0, 1)$.

Expected Value:

$E\left[\sum_{i=1}^{d_k} q_i k_i\right] = \sum_{i=1}^{d_k} E[q_i k_i]$

Since $q_i$ and $k_i$ are independent with zero mean, $E[q_i k_i] = E[q_i]E[k_i] = 0$. Therefore:

$E\left[\sum_{i=1}^{d_k} q_i k_i\right] = 0$

Variance:

$\text{Var}\left[\sum_{i=1}^{d_k} q_i k_i\right] = \sum_{i=1}^{d_k} \text{Var}[q_i k_i]$

This holds because the products $q_i k_i$ for different indices are independent.

Using the variance formula for the product of independent random variables:

$\text{Var}(XY) = \text{Var}(X)\text{Var}(Y) + \text{Var}(X)[E(Y)]^2 + \text{Var}(Y)[E(X)]^2$

We get:

$\text{Var}[q_i k_i] = \text{Var}(q_i)\text{Var}(k_i) = 1 \times 1 = 1$

Therefore:

$\text{Var}\left[\sum_{i=1}^{d_k} q_i k_i\right] = d_k$

The Scaling Solution

The analysis shows that the dot product $q·k$ has mean 0 and variance $d_k$. As dimensionality increases, the variance grows linearly, causing the logits to have increasingly large magnitudes.