Transformer Attention Toy (Encoder Block)
This interactive tool walks through a single Transformer encoder layer: Multi-Head Attention → Residual → LayerNorm → Feed-Forward → Residual → LayerNorm.
Use it to see the tensors, shapes, and attention weights for a concrete setup.
Quick Start (defaults)
- Sequence length L = 3
- d_model = 32
- n_heads = 4, so d_head = d_model / n_heads = 8
- d_v = 8
- d_ff = 4 × d_model = 128
- Scaled dot-product factor:
1 / sqrt(d_head)
L=3 d_model=32 heads=4 d_head=8 d_v=8 d_ff=128
Constraint: d_model = heads × d_head
Input X
X (token embeddings) shape [3 × 32]
| 0.101 | -0.052 | 0.352 | 0.170 | -0.325 | 0.027 | -0.227 | 0.125 | 0.365 | -0.028 | … |
| -0.012 | 0.309 | -0.181 | -0.050 | -0.463 | -0.449 | 0.057 | 0.097 | -0.255 | 0.146 | … |
| 0.112 | 0.400 | -0.422 | 0.220 | 0.435 | -0.222 | 0.207 | -0.320 | 0.020 | 0.322 | … |
Q = X·Wq, K = X·Wk, V = X·Wv
Q [L × d_model] shape [3 × 32]
| -0.239 | 0.498 | -0.370 | -0.073 | -0.241 | 0.089 | 0.081 | -0.371 | 0.415 | -0.330 | … |
| -0.429 | 0.715 | 0.402 | 0.604 | 0.492 | -0.538 | 0.112 | 0.049 | -1.180 | -0.154 | … |
| -0.011 | -0.148 | 0.160 | 0.759 | 0.402 | -0.464 | 0.590 | 0.439 | -0.084 | 0.215 | … |
K [L × d_model] shape [3 × 32]
| 0.018 | 0.496 | 1.055 | 0.050 | 0.123 | -0.184 | 0.523 | 0.638 | -0.299 | 0.586 | … |
| 0.083 | -0.233 | 1.235 | 1.001 | -0.322 | -0.036 | -0.570 | -0.336 | 0.628 | 0.470 | … |
| 0.707 | -0.476 | 0.269 | 0.612 | 0.017 | -0.215 | -0.606 | 0.055 | 0.642 | -0.830 | … |
V [L × d_model] shape [3 × 32]
| 0.165 | 0.397 | -0.573 | 0.189 | 0.361 | -0.042 | 1.014 | -0.046 | 0.575 | -0.150 | … |
| 0.175 | 0.201 | -0.516 | 0.251 | -0.359 | 0.391 | -0.021 | -0.094 | -0.374 | -0.380 | … |
| 0.142 | 0.118 | 0.208 | -0.237 | -0.659 | -0.530 | -0.909 | 0.023 | -0.031 | 0.505 | … |
Split into heads
Each head is [3 × 8]Head 1
Qᵗʰ head shape [3 × 8]
| -0.239 | 0.498 | -0.370 | -0.073 | -0.241 | 0.089 | 0.081 | -0.371 |
| -0.429 | 0.715 | 0.402 | 0.604 | 0.492 | -0.538 | 0.112 | 0.049 |
| -0.011 | -0.148 | 0.160 | 0.759 | 0.402 | -0.464 | 0.590 | 0.439 |
Kᵗʰ head shape [3 × 8]
| 0.018 | 0.496 | 1.055 | 0.050 | 0.123 | -0.184 | 0.523 | 0.638 |
| 0.083 | -0.233 | 1.235 | 1.001 | -0.322 | -0.036 | -0.570 | -0.336 |
| 0.707 | -0.476 | 0.269 | 0.612 | 0.017 | -0.215 | -0.606 | 0.055 |
Vᵗʰ head shape [3 × 8]
| 0.165 | 0.397 | -0.573 | 0.189 | 0.361 | -0.042 | 1.014 | -0.046 |
| 0.175 | 0.201 | -0.516 | 0.251 | -0.359 | 0.391 | -0.021 | -0.094 |
| 0.142 | 0.118 | 0.208 | -0.237 | -0.659 | -0.530 | -0.909 | 0.023 |
Head 2
Qᵗʰ head shape [3 × 8]
| 0.415 | -0.330 | 0.416 | 0.267 | -0.406 | 0.491 | -0.573 | -0.540 |
| -1.180 | -0.154 | -0.107 | 0.309 | -0.183 | 0.030 | 0.317 | 0.413 |
| -0.084 | 0.215 | -0.039 | 0.503 | -0.706 | -0.392 | 0.146 | 0.160 |
Kᵗʰ head shape [3 × 8]
| -0.299 | 0.586 | -0.421 | 0.164 | -0.299 | -0.295 | -0.334 | -0.389 |
| 0.628 | 0.470 | -0.327 | -0.213 | -0.481 | 0.278 | 0.194 | 0.165 |
| 0.642 | -0.830 | 0.516 | 0.540 | -0.350 | 0.288 | -0.140 | 0.490 |
Vᵗʰ head shape [3 × 8]
| 0.575 | -0.150 | -0.274 | 0.796 | 0.215 | -0.042 | 0.408 | -0.618 |
| -0.374 | -0.380 | -0.749 | 0.364 | -0.242 | 0.430 | -0.291 | 0.197 |
| -0.031 | 0.505 | 0.046 | -0.067 | -0.309 | 1.076 | 0.168 | -0.013 |
Head 3
Qᵗʰ head shape [3 × 8]
| -0.547 | 0.060 | 0.383 | 0.350 | -0.251 | 0.167 | -0.572 | 0.315 |
| -0.354 | 0.283 | 0.183 | 0.428 | 0.617 | -0.544 | -0.116 | -0.197 |
| 0.445 | 0.464 | -0.536 | 0.513 | 0.006 | -0.377 | -0.007 | -0.056 |
Kᵗʰ head shape [3 × 8]
| 0.355 | -0.048 | 0.222 | 0.551 | 0.351 | 0.050 | -0.243 | 0.900 |
| -0.060 | 0.203 | 0.256 | 0.452 | 0.512 | 0.172 | 0.136 | -0.457 |
| -0.014 | 0.219 | -0.178 | 0.378 | -0.178 | 0.136 | -0.447 | 0.179 |
Vᵗʰ head shape [3 × 8]
| -0.147 | 0.210 | -0.183 | -0.262 | 0.202 | -0.904 | 0.043 | 0.099 |
| 0.220 | 0.334 | -0.861 | -0.753 | -0.276 | -0.473 | -0.844 | -0.411 |
| -0.310 | -0.441 | -0.242 | -1.228 | -0.649 | 0.425 | -0.118 | -0.046 |
Head 4
Qᵗʰ head shape [3 × 8]
| -0.009 | -0.642 | -0.281 | -1.058 | 0.139 | -0.062 | -0.349 | 0.548 |
| -0.187 | -0.354 | -0.002 | 0.528 | 0.114 | -0.421 | 0.195 | 0.188 |
| -0.432 | 0.070 | -0.411 | -0.084 | -0.297 | 0.500 | -1.000 | -0.337 |
Kᵗʰ head shape [3 × 8]
| -0.186 | -0.518 | -1.203 | 0.079 | -0.116 | -0.937 | 0.685 | -0.148 |
| -0.166 | -0.366 | 0.412 | -0.443 | 0.616 | 0.828 | -0.149 | -0.368 |
| 0.551 | -0.057 | 0.553 | 0.045 | 0.306 | 0.206 | -1.171 | -0.191 |
Vᵗʰ head shape [3 × 8]
| -0.409 | -0.522 | -0.250 | -0.238 | 0.396 | -0.512 | 0.492 | -0.545 |
| 0.531 | -0.816 | -0.134 | 0.432 | 0.048 | 0.543 | 0.295 | -0.783 |
| 0.666 | -0.352 | 0.613 | 0.294 | -0.386 | 0.842 | 0.059 | -0.657 |
Scaled dot-product scores per head
Head 1 scores [L × L] shape [3 × 3]
Head 2 scores [L × L] shape [3 × 3]
Head 3 scores [L × L] shape [3 × 3]
Head 4 scores [L × L] shape [3 × 3]
Softmax attention weights per head
Head 1 weights [L × L] (rows sum to 1) shape [3 × 3]
Head 2 weights [L × L] (rows sum to 1) shape [3 × 3]
Head 3 weights [L × L] (rows sum to 1) shape [3 × 3]
Head 4 weights [L × L] (rows sum to 1) shape [3 × 3]
Per-head outputs = weights · V
Head 1 out [L × d_head] shape [3 × 8]
| 0.161 | 0.243 | -0.305 | 0.074 | -0.204 | -0.053 | 0.057 | -0.040 |
| 0.162 | 0.257 | -0.349 | 0.099 | -0.154 | -0.020 | 0.154 | -0.045 |
| 0.161 | 0.248 | -0.312 | 0.076 | -0.184 | -0.056 | 0.089 | -0.040 |
Head 2 out [L × d_head] shape [3 × 8]
| 0.040 | 0.055 | -0.282 | 0.308 | -0.139 | 0.564 | 0.100 | -0.123 |
| 0.100 | -0.002 | -0.307 | 0.387 | -0.090 | 0.462 | 0.126 | -0.182 |
| 0.071 | -0.014 | -0.325 | 0.377 | -0.103 | 0.472 | 0.104 | -0.158 |
Head 3 out [L × d_head] shape [3 × 8]
| -0.095 | 0.017 | -0.406 | -0.750 | -0.241 | -0.306 | -0.278 | -0.104 |
| -0.059 | 0.058 | -0.455 | -0.740 | -0.236 | -0.338 | -0.338 | -0.136 |
| -0.085 | 0.025 | -0.421 | -0.753 | -0.245 | -0.307 | -0.298 | -0.115 |
Head 4 out [L × d_head] shape [3 × 8]
| 0.261 | -0.572 | 0.063 | 0.166 | 0.027 | 0.287 | 0.286 | -0.664 |
| 0.166 | -0.556 | 0.032 | 0.104 | 0.072 | 0.176 | 0.312 | -0.644 |
| 0.349 | -0.558 | 0.135 | 0.209 | -0.038 | 0.398 | 0.251 | -0.673 |
Concat heads
Concat [L × (heads×d_head)] = [3 × 32] shape [3 × 32]
| 0.161 | 0.243 | -0.305 | 0.074 | -0.204 | -0.053 | 0.057 | -0.040 | 0.040 | 0.055 | … |
| 0.162 | 0.257 | -0.349 | 0.099 | -0.154 | -0.020 | 0.154 | -0.045 | 0.100 | -0.002 | … |
| 0.161 | 0.248 | -0.312 | 0.076 | -0.184 | -0.056 | 0.089 | -0.040 | 0.071 | -0.014 | … |
Output projection Wₒ
MHA output = Concat · Wₒ [3 × 32] shape [3 × 32]
| 0.512 | 0.081 | -0.108 | -0.194 | -0.798 | -0.024 | 0.441 | 0.459 | -0.412 | 0.506 | … |
| 0.481 | 0.106 | -0.094 | -0.165 | -0.980 | -0.001 | 0.430 | 0.533 | -0.498 | 0.489 | … |
| 0.626 | 0.066 | -0.113 | -0.144 | -0.989 | 0.001 | 0.470 | 0.377 | -0.395 | 0.592 | … |
Residual add (x + MHA(x))
Residual 1 [L × d_model] shape [3 × 32]
| 0.613 | 0.029 | 0.245 | -0.024 | -1.123 | 0.003 | 0.214 | 0.584 | -0.046 | 0.478 | … |
| 0.469 | 0.415 | -0.275 | -0.215 | -1.442 | -0.450 | 0.486 | 0.630 | -0.753 | 0.635 | … |
| 0.738 | 0.465 | -0.536 | 0.075 | -0.554 | -0.221 | 0.677 | 0.057 | -0.375 | 0.914 | … |
LayerNorm after residual
LN1 [L × d_model] shape [3 × 32]
| 1.254 | -0.002 | 0.462 | -0.117 | -2.480 | -0.058 | 0.396 | 1.192 | -0.164 | 0.964 | … |
| 0.823 | 0.728 | -0.474 | -0.371 | -2.510 | -0.779 | 0.853 | 1.103 | -1.308 | 1.111 | … |
| 1.086 | 0.596 | -1.203 | -0.105 | -1.236 | -0.637 | 0.976 | -0.138 | -0.914 | 1.403 | … |
Feed-Forward (expand → GELU → project)
Pre-GELU (x·W₁) [L × d_ff] shape [3 × 128]
| -1.394 | 0.068 | -2.503 | -0.428 | 1.034 | 1.156 | 0.889 | 0.190 | -0.297 | -1.551 | … |
| -2.338 | -0.810 | -1.925 | 0.121 | 1.392 | 2.767 | 2.105 | -0.028 | 1.704 | 1.288 | … |
| -2.801 | -0.555 | -0.280 | 1.448 | 0.607 | 1.381 | 2.705 | -0.985 | 0.561 | 0.521 | … |
GELU applied element-wise (hidden not shown in full for brevity)
FFN output (after W₂), residual & LayerNorm
FFN out [L × d_model] shape [3 × 32]
| 1.294 | 2.964 | 1.910 | 4.092 | -0.193 | -3.869 | 1.574 | 3.027 | -0.483 | -1.656 | … |
| 3.415 | 2.022 | -0.088 | 4.224 | -0.751 | -5.703 | 1.711 | 2.389 | -0.422 | 0.185 | … |
| 1.817 | 3.307 | 2.490 | 2.966 | 1.244 | -2.441 | 1.366 | 1.812 | -3.119 | -2.223 | … |
Residual 2 [L × d_model] shape [3 × 32]
| 2.548 | 2.963 | 2.372 | 3.975 | -2.673 | -3.928 | 1.970 | 4.219 | -0.647 | -0.692 | … |
| 4.238 | 2.750 | -0.563 | 3.853 | -3.261 | -6.482 | 2.563 | 3.492 | -1.729 | 1.297 | … |
| 2.903 | 3.903 | 1.287 | 2.861 | 0.009 | -3.078 | 2.342 | 1.674 | -4.032 | -0.820 | … |
LN2 (Layer output) [L × d_model] shape [3 × 32]
| 0.282 | 0.387 | 0.238 | 0.643 | -1.037 | -1.354 | 0.136 | 0.705 | -0.525 | -0.537 | … |
| 0.983 | 0.570 | -0.349 | 0.876 | -1.098 | -1.992 | 0.518 | 0.776 | -0.673 | 0.167 | … |
| 0.584 | 0.847 | 0.159 | 0.573 | -0.178 | -0.990 | 0.436 | 0.260 | -1.241 | -0.396 | … |
What you can do here
- Toggle visibility of Q/K/V, scores, softmax weights, per-head outputs, concat, Wₒ output, residuals, norms, and FFN steps.
- Change seed to regenerate deterministic toy numbers.
- Adjust L, d_model, heads, d_head/d_v, d_ff (must obey
d_model = n_heads × d_head).
Tip from KaizenX: Start by expanding just Scores and Softmax Weights to grok attention, then reveal Residual + LayerNorm and FFN to see the full encoder block.