Skip to main content

Transformer Attention Toy (Encoder Block)

This interactive tool walks through a single Transformer encoder layer: Multi-Head Attention → Residual → LayerNorm → Feed-Forward → Residual → LayerNorm.
Use it to see the tensors, shapes, and attention weights for a concrete setup.

Quick Start (defaults)

  • Sequence length L = 3
  • d_model = 32
  • n_heads = 4, so d_head = d_model / n_heads = 8
  • d_v = 8
  • d_ff = 4 × d_model = 128
  • Scaled dot-product factor: 1 / sqrt(d_head)
L=3   d_model=32   heads=4   d_head=8   d_v=8   d_ff=128
Constraint: d_model = heads × d_head

Input X

X (token embeddings) shape [3 × 32]
0.101-0.0520.3520.170-0.3250.027-0.2270.1250.365-0.028
-0.0120.309-0.181-0.050-0.463-0.4490.0570.097-0.2550.146
0.1120.400-0.4220.2200.435-0.2220.207-0.3200.0200.322

Q = X·Wq, K = X·Wk, V = X·Wv

Q [L × d_model] shape [3 × 32]
-0.2390.498-0.370-0.073-0.2410.0890.081-0.3710.415-0.330
-0.4290.7150.4020.6040.492-0.5380.1120.049-1.180-0.154
-0.011-0.1480.1600.7590.402-0.4640.5900.439-0.0840.215
K [L × d_model] shape [3 × 32]
0.0180.4961.0550.0500.123-0.1840.5230.638-0.2990.586
0.083-0.2331.2351.001-0.322-0.036-0.570-0.3360.6280.470
0.707-0.4760.2690.6120.017-0.215-0.6060.0550.642-0.830
V [L × d_model] shape [3 × 32]
0.1650.397-0.5730.1890.361-0.0421.014-0.0460.575-0.150
0.1750.201-0.5160.251-0.3590.391-0.021-0.094-0.374-0.380
0.1420.1180.208-0.237-0.659-0.530-0.9090.023-0.0310.505

Split into heads

Each head is [3 × 8]
Head 1
Qᵗʰ head shape [3 × 8]
-0.2390.498-0.370-0.073-0.2410.0890.081-0.371
-0.4290.7150.4020.6040.492-0.5380.1120.049
-0.011-0.1480.1600.7590.402-0.4640.5900.439
Kᵗʰ head shape [3 × 8]
0.0180.4961.0550.0500.123-0.1840.5230.638
0.083-0.2331.2351.001-0.322-0.036-0.570-0.336
0.707-0.4760.2690.6120.017-0.215-0.6060.055
Vᵗʰ head shape [3 × 8]
0.1650.397-0.5730.1890.361-0.0421.014-0.046
0.1750.201-0.5160.251-0.3590.391-0.021-0.094
0.1420.1180.208-0.237-0.659-0.530-0.9090.023
Head 2
Qᵗʰ head shape [3 × 8]
0.415-0.3300.4160.267-0.4060.491-0.573-0.540
-1.180-0.154-0.1070.309-0.1830.0300.3170.413
-0.0840.215-0.0390.503-0.706-0.3920.1460.160
Kᵗʰ head shape [3 × 8]
-0.2990.586-0.4210.164-0.299-0.295-0.334-0.389
0.6280.470-0.327-0.213-0.4810.2780.1940.165
0.642-0.8300.5160.540-0.3500.288-0.1400.490
Vᵗʰ head shape [3 × 8]
0.575-0.150-0.2740.7960.215-0.0420.408-0.618
-0.374-0.380-0.7490.364-0.2420.430-0.2910.197
-0.0310.5050.046-0.067-0.3091.0760.168-0.013
Head 3
Qᵗʰ head shape [3 × 8]
-0.5470.0600.3830.350-0.2510.167-0.5720.315
-0.3540.2830.1830.4280.617-0.544-0.116-0.197
0.4450.464-0.5360.5130.006-0.377-0.007-0.056
Kᵗʰ head shape [3 × 8]
0.355-0.0480.2220.5510.3510.050-0.2430.900
-0.0600.2030.2560.4520.5120.1720.136-0.457
-0.0140.219-0.1780.378-0.1780.136-0.4470.179
Vᵗʰ head shape [3 × 8]
-0.1470.210-0.183-0.2620.202-0.9040.0430.099
0.2200.334-0.861-0.753-0.276-0.473-0.844-0.411
-0.310-0.441-0.242-1.228-0.6490.425-0.118-0.046
Head 4
Qᵗʰ head shape [3 × 8]
-0.009-0.642-0.281-1.0580.139-0.062-0.3490.548
-0.187-0.354-0.0020.5280.114-0.4210.1950.188
-0.4320.070-0.411-0.084-0.2970.500-1.000-0.337
Kᵗʰ head shape [3 × 8]
-0.186-0.518-1.2030.079-0.116-0.9370.685-0.148
-0.166-0.3660.412-0.4430.6160.828-0.149-0.368
0.551-0.0570.5530.0450.3060.206-1.171-0.191
Vᵗʰ head shape [3 × 8]
-0.409-0.522-0.250-0.2380.396-0.5120.492-0.545
0.531-0.816-0.1340.4320.0480.5430.295-0.783
0.666-0.3520.6130.294-0.3860.8420.059-0.657

Scaled dot-product scores per head

Head 1 scores [L × L] shape [3 × 3]
Head 2 scores [L × L] shape [3 × 3]
Head 3 scores [L × L] shape [3 × 3]
Head 4 scores [L × L] shape [3 × 3]

Softmax attention weights per head

Head 1 weights [L × L] (rows sum to 1) shape [3 × 3]
Head 2 weights [L × L] (rows sum to 1) shape [3 × 3]
Head 3 weights [L × L] (rows sum to 1) shape [3 × 3]
Head 4 weights [L × L] (rows sum to 1) shape [3 × 3]

Per-head outputs = weights · V

Head 1 out [L × d_head] shape [3 × 8]
0.1610.243-0.3050.074-0.204-0.0530.057-0.040
0.1620.257-0.3490.099-0.154-0.0200.154-0.045
0.1610.248-0.3120.076-0.184-0.0560.089-0.040
Head 2 out [L × d_head] shape [3 × 8]
0.0400.055-0.2820.308-0.1390.5640.100-0.123
0.100-0.002-0.3070.387-0.0900.4620.126-0.182
0.071-0.014-0.3250.377-0.1030.4720.104-0.158
Head 3 out [L × d_head] shape [3 × 8]
-0.0950.017-0.406-0.750-0.241-0.306-0.278-0.104
-0.0590.058-0.455-0.740-0.236-0.338-0.338-0.136
-0.0850.025-0.421-0.753-0.245-0.307-0.298-0.115
Head 4 out [L × d_head] shape [3 × 8]
0.261-0.5720.0630.1660.0270.2870.286-0.664
0.166-0.5560.0320.1040.0720.1760.312-0.644
0.349-0.5580.1350.209-0.0380.3980.251-0.673

Concat heads

Concat [L × (heads×d_head)] = [3 × 32] shape [3 × 32]
0.1610.243-0.3050.074-0.204-0.0530.057-0.0400.0400.055
0.1620.257-0.3490.099-0.154-0.0200.154-0.0450.100-0.002
0.1610.248-0.3120.076-0.184-0.0560.089-0.0400.071-0.014

Output projection Wₒ

MHA output = Concat · Wₒ [3 × 32] shape [3 × 32]
0.5120.081-0.108-0.194-0.798-0.0240.4410.459-0.4120.506
0.4810.106-0.094-0.165-0.980-0.0010.4300.533-0.4980.489
0.6260.066-0.113-0.144-0.9890.0010.4700.377-0.3950.592

Residual add (x + MHA(x))

Residual 1 [L × d_model] shape [3 × 32]
0.6130.0290.245-0.024-1.1230.0030.2140.584-0.0460.478
0.4690.415-0.275-0.215-1.442-0.4500.4860.630-0.7530.635
0.7380.465-0.5360.075-0.554-0.2210.6770.057-0.3750.914

LayerNorm after residual

LN1 [L × d_model] shape [3 × 32]
1.254-0.0020.462-0.117-2.480-0.0580.3961.192-0.1640.964
0.8230.728-0.474-0.371-2.510-0.7790.8531.103-1.3081.111
1.0860.596-1.203-0.105-1.236-0.6370.976-0.138-0.9141.403

Feed-Forward (expand → GELU → project)

Pre-GELU (x·W₁) [L × d_ff] shape [3 × 128]
-1.3940.068-2.503-0.4281.0341.1560.8890.190-0.297-1.551
-2.338-0.810-1.9250.1211.3922.7672.105-0.0281.7041.288
-2.801-0.555-0.2801.4480.6071.3812.705-0.9850.5610.521
GELU applied element-wise (hidden not shown in full for brevity)

FFN output (after W₂), residual & LayerNorm

FFN out [L × d_model] shape [3 × 32]
1.2942.9641.9104.092-0.193-3.8691.5743.027-0.483-1.656
3.4152.022-0.0884.224-0.751-5.7031.7112.389-0.4220.185
1.8173.3072.4902.9661.244-2.4411.3661.812-3.119-2.223
Residual 2 [L × d_model] shape [3 × 32]
2.5482.9632.3723.975-2.673-3.9281.9704.219-0.647-0.692
4.2382.750-0.5633.853-3.261-6.4822.5633.492-1.7291.297
2.9033.9031.2872.8610.009-3.0782.3421.674-4.032-0.820
LN2 (Layer output) [L × d_model] shape [3 × 32]
0.2820.3870.2380.643-1.037-1.3540.1360.705-0.525-0.537
0.9830.570-0.3490.876-1.098-1.9920.5180.776-0.6730.167
0.5840.8470.1590.573-0.178-0.9900.4360.260-1.241-0.396

What you can do here

  • Toggle visibility of Q/K/V, scores, softmax weights, per-head outputs, concat, Wₒ output, residuals, norms, and FFN steps.
  • Change seed to regenerate deterministic toy numbers.
  • Adjust L, d_model, heads, d_head/d_v, d_ff (must obey d_model = n_heads × d_head).

Tip from KaizenX: Start by expanding just Scores and Softmax Weights to grok attention, then reveal Residual + LayerNorm and FFN to see the full encoder block.