Transformer Attention Toy (Encoder Block)

This interactive tool walks through a single Transformer encoder layer: Multi-Head Attention → Residual → LayerNorm → Feed-Forward → Residual → LayerNorm.
Use it to see the tensors, shapes, and attention weights for a concrete setup.

Quick Start (defaults)

Sequence length L = 3
d_model = 32
n_heads = 4, so d_head = d_model / n_heads = 8
d_v = 8
d_ff = 4 × d_model = 128
Scaled dot-product factor: 1 / sqrt(d_head)

L=3 d_model=32 heads=4 d_head=8 d_v=8 d_ff=128

Constraint: d_model = heads × d_head

Seed:

Input X

X (token embeddings) shape [3 × 32]

0.101	-0.052	0.352	0.170	-0.325	0.027	-0.227	0.125	0.365	-0.028	…
-0.012	0.309	-0.181	-0.050	-0.463	-0.449	0.057	0.097	-0.255	0.146	…
0.112	0.400	-0.422	0.220	0.435	-0.222	0.207	-0.320	0.020	0.322	…

Q = X·Wq, K = X·Wk, V = X·Wv

Q [L × d_model] shape [3 × 32]

-0.239	0.498	-0.370	-0.073	-0.241	0.089	0.081	-0.371	0.415	-0.330	…
-0.429	0.715	0.402	0.604	0.492	-0.538	0.112	0.049	-1.180	-0.154	…
-0.011	-0.148	0.160	0.759	0.402	-0.464	0.590	0.439	-0.084	0.215	…

K [L × d_model] shape [3 × 32]

0.018	0.496	1.055	0.050	0.123	-0.184	0.523	0.638	-0.299	0.586	…
0.083	-0.233	1.235	1.001	-0.322	-0.036	-0.570	-0.336	0.628	0.470	…
0.707	-0.476	0.269	0.612	0.017	-0.215	-0.606	0.055	0.642	-0.830	…

V [L × d_model] shape [3 × 32]

0.165	0.397	-0.573	0.189	0.361	-0.042	1.014	-0.046	0.575	-0.150	…
0.175	0.201	-0.516	0.251	-0.359	0.391	-0.021	-0.094	-0.374	-0.380	…
0.142	0.118	0.208	-0.237	-0.659	-0.530	-0.909	0.023	-0.031	0.505	…

Split into heads

Each head is [3 × 8]

Head 1

Qᵗʰ head shape [3 × 8]

-0.239	0.498	-0.370	-0.073	-0.241	0.089	0.081	-0.371
-0.429	0.715	0.402	0.604	0.492	-0.538	0.112	0.049
-0.011	-0.148	0.160	0.759	0.402	-0.464	0.590	0.439

Kᵗʰ head shape [3 × 8]

0.018	0.496	1.055	0.050	0.123	-0.184	0.523	0.638
0.083	-0.233	1.235	1.001	-0.322	-0.036	-0.570	-0.336
0.707	-0.476	0.269	0.612	0.017	-0.215	-0.606	0.055

Vᵗʰ head shape [3 × 8]

0.165	0.397	-0.573	0.189	0.361	-0.042	1.014	-0.046
0.175	0.201	-0.516	0.251	-0.359	0.391	-0.021	-0.094
0.142	0.118	0.208	-0.237	-0.659	-0.530	-0.909	0.023

Head 2

Qᵗʰ head shape [3 × 8]

0.415	-0.330	0.416	0.267	-0.406	0.491	-0.573	-0.540
-1.180	-0.154	-0.107	0.309	-0.183	0.030	0.317	0.413
-0.084	0.215	-0.039	0.503	-0.706	-0.392	0.146	0.160

Kᵗʰ head shape [3 × 8]

-0.299	0.586	-0.421	0.164	-0.299	-0.295	-0.334	-0.389
0.628	0.470	-0.327	-0.213	-0.481	0.278	0.194	0.165
0.642	-0.830	0.516	0.540	-0.350	0.288	-0.140	0.490

Vᵗʰ head shape [3 × 8]

0.575	-0.150	-0.274	0.796	0.215	-0.042	0.408	-0.618
-0.374	-0.380	-0.749	0.364	-0.242	0.430	-0.291	0.197
-0.031	0.505	0.046	-0.067	-0.309	1.076	0.168	-0.013

Head 3

Qᵗʰ head shape [3 × 8]

-0.547	0.060	0.383	0.350	-0.251	0.167	-0.572	0.315
-0.354	0.283	0.183	0.428	0.617	-0.544	-0.116	-0.197
0.445	0.464	-0.536	0.513	0.006	-0.377	-0.007	-0.056

Kᵗʰ head shape [3 × 8]

0.355	-0.048	0.222	0.551	0.351	0.050	-0.243	0.900
-0.060	0.203	0.256	0.452	0.512	0.172	0.136	-0.457
-0.014	0.219	-0.178	0.378	-0.178	0.136	-0.447	0.179

Vᵗʰ head shape [3 × 8]

-0.147	0.210	-0.183	-0.262	0.202	-0.904	0.043	0.099
0.220	0.334	-0.861	-0.753	-0.276	-0.473	-0.844	-0.411
-0.310	-0.441	-0.242	-1.228	-0.649	0.425	-0.118	-0.046

Head 4

Qᵗʰ head shape [3 × 8]

-0.009	-0.642	-0.281	-1.058	0.139	-0.062	-0.349	0.548
-0.187	-0.354	-0.002	0.528	0.114	-0.421	0.195	0.188
-0.432	0.070	-0.411	-0.084	-0.297	0.500	-1.000	-0.337

Kᵗʰ head shape [3 × 8]

-0.186	-0.518	-1.203	0.079	-0.116	-0.937	0.685	-0.148
-0.166	-0.366	0.412	-0.443	0.616	0.828	-0.149	-0.368
0.551	-0.057	0.553	0.045	0.306	0.206	-1.171	-0.191

Vᵗʰ head shape [3 × 8]

-0.409	-0.522	-0.250	-0.238	0.396	-0.512	0.492	-0.545
0.531	-0.816	-0.134	0.432	0.048	0.543	0.295	-0.783
0.666	-0.352	0.613	0.294	-0.386	0.842	0.059	-0.657

Scaled dot-product scores per head

Head 1 scores [L × L] shape [3 × 3]

Head 2 scores [L × L] shape [3 × 3]

Head 3 scores [L × L] shape [3 × 3]

Head 4 scores [L × L] shape [3 × 3]

Softmax attention weights per head

Head 1 weights [L × L] (rows sum to 1) shape [3 × 3]

Head 2 weights [L × L] (rows sum to 1) shape [3 × 3]

Head 3 weights [L × L] (rows sum to 1) shape [3 × 3]

Head 4 weights [L × L] (rows sum to 1) shape [3 × 3]

Per-head outputs = weights · V

Head 1 out [L × d_head] shape [3 × 8]

0.161	0.243	-0.305	0.074	-0.204	-0.053	0.057	-0.040
0.162	0.257	-0.349	0.099	-0.154	-0.020	0.154	-0.045
0.161	0.248	-0.312	0.076	-0.184	-0.056	0.089	-0.040

Head 2 out [L × d_head] shape [3 × 8]

0.040	0.055	-0.282	0.308	-0.139	0.564	0.100	-0.123
0.100	-0.002	-0.307	0.387	-0.090	0.462	0.126	-0.182
0.071	-0.014	-0.325	0.377	-0.103	0.472	0.104	-0.158

Head 3 out [L × d_head] shape [3 × 8]

-0.095	0.017	-0.406	-0.750	-0.241	-0.306	-0.278	-0.104
-0.059	0.058	-0.455	-0.740	-0.236	-0.338	-0.338	-0.136
-0.085	0.025	-0.421	-0.753	-0.245	-0.307	-0.298	-0.115

Head 4 out [L × d_head] shape [3 × 8]

0.261	-0.572	0.063	0.166	0.027	0.287	0.286	-0.664
0.166	-0.556	0.032	0.104	0.072	0.176	0.312	-0.644
0.349	-0.558	0.135	0.209	-0.038	0.398	0.251	-0.673

Concat heads

Concat [L × (heads×d_head)] = [3 × 32] shape [3 × 32]

0.161	0.243	-0.305	0.074	-0.204	-0.053	0.057	-0.040	0.040	0.055	…
0.162	0.257	-0.349	0.099	-0.154	-0.020	0.154	-0.045	0.100	-0.002	…
0.161	0.248	-0.312	0.076	-0.184	-0.056	0.089	-0.040	0.071	-0.014	…

Output projection Wₒ

MHA output = Concat · Wₒ [3 × 32] shape [3 × 32]

0.512	0.081	-0.108	-0.194	-0.798	-0.024	0.441	0.459	-0.412	0.506	…
0.481	0.106	-0.094	-0.165	-0.980	-0.001	0.430	0.533	-0.498	0.489	…
0.626	0.066	-0.113	-0.144	-0.989	0.001	0.470	0.377	-0.395	0.592	…

Residual add (x + MHA(x))

Residual 1 [L × d_model] shape [3 × 32]

0.613	0.029	0.245	-0.024	-1.123	0.003	0.214	0.584	-0.046	0.478	…
0.469	0.415	-0.275	-0.215	-1.442	-0.450	0.486	0.630	-0.753	0.635	…
0.738	0.465	-0.536	0.075	-0.554	-0.221	0.677	0.057	-0.375	0.914	…

LayerNorm after residual

LN1 [L × d_model] shape [3 × 32]

1.254	-0.002	0.462	-0.117	-2.480	-0.058	0.396	1.192	-0.164	0.964	…
0.823	0.728	-0.474	-0.371	-2.510	-0.779	0.853	1.103	-1.308	1.111	…
1.086	0.596	-1.203	-0.105	-1.236	-0.637	0.976	-0.138	-0.914	1.403	…

Feed-Forward (expand → GELU → project)

Pre-GELU (x·W₁) [L × d_ff] shape [3 × 128]

-1.394	0.068	-2.503	-0.428	1.034	1.156	0.889	0.190	-0.297	-1.551	…
-2.338	-0.810	-1.925	0.121	1.392	2.767	2.105	-0.028	1.704	1.288	…
-2.801	-0.555	-0.280	1.448	0.607	1.381	2.705	-0.985	0.561	0.521	…

GELU applied element-wise (hidden not shown in full for brevity)

FFN output (after W₂), residual & LayerNorm

FFN out [L × d_model] shape [3 × 32]

1.294	2.964	1.910	4.092	-0.193	-3.869	1.574	3.027	-0.483	-1.656	…
3.415	2.022	-0.088	4.224	-0.751	-5.703	1.711	2.389	-0.422	0.185	…
1.817	3.307	2.490	2.966	1.244	-2.441	1.366	1.812	-3.119	-2.223	…

Residual 2 [L × d_model] shape [3 × 32]

2.548	2.963	2.372	3.975	-2.673	-3.928	1.970	4.219	-0.647	-0.692	…
4.238	2.750	-0.563	3.853	-3.261	-6.482	2.563	3.492	-1.729	1.297	…
2.903	3.903	1.287	2.861	0.009	-3.078	2.342	1.674	-4.032	-0.820	…

LN2 (Layer output) [L × d_model] shape [3 × 32]

0.282	0.387	0.238	0.643	-1.037	-1.354	0.136	0.705	-0.525	-0.537	…
0.983	0.570	-0.349	0.876	-1.098	-1.992	0.518	0.776	-0.673	0.167	…
0.584	0.847	0.159	0.573	-0.178	-0.990	0.436	0.260	-1.241	-0.396	…

What you can do here

Toggle visibility of Q/K/V, scores, softmax weights, per-head outputs, concat, Wₒ output, residuals, norms, and FFN steps.
Change seed to regenerate deterministic toy numbers.
Adjust L, d_model, heads, d_head/d_v, d_ff (must obey d_model = n_heads × d_head).

Tip from KaizenX: Start by expanding just Scores and Softmax Weights to grok attention, then reveal Residual + LayerNorm and FFN to see the full encoder block.

Quick Start (defaults)​

Input X

Q = X·Wq, K = X·Wk, V = X·Wv

Split into heads

Scaled dot-product scores per head

Softmax attention weights per head

Per-head outputs = weights · V

Concat heads

Output projection Wₒ

Residual add (x + MHA(x))

LayerNorm after residual

Feed-Forward (expand → GELU → project)

FFN output (after W₂), residual & LayerNorm

What you can do here​

Quick Start (defaults)

What you can do here