Activation Space Viewer

← Back to index
Color by
Projection
Layer
Loading...
Layer
Loading...
Hover or click a point to see scenario details.

Note: This viewer, its data pipeline, and the commentary below were generated with Claude Code (Opus 4.6). The underlying activation data and probe results are real model outputs, but the code, visualisation choices, and interpretive text should be read with appropriate caution.

📖 Full mathematical explainer (PDF) — detailed formulas tracing from raw residual stream vectors through PCA and probe projections to what you see on screen, plus layer-by-layer interpretation guidance.

What you're looking at

Each dot is one of 350 contrastive prompt pairs fed through Qwen3-32B (4-bit, non-thinking mode). For each scenario, the model processed two completions: one answering in an EDT (evidential decision theory) style and one in a CDT (causal decision theory) style. The 5120-dimensional residual stream activation at the last token position was extracted at 10 layers across the network (layers 0, 8, 16, 24, 32, 40, 48, 56, 63, and post-RMSNorm), then projected to 2D for display.

Projection modes

PCA (max variance): Standard principal component analysis. The axes capture whatever directions account for the most variance in the activations. This shows the dominant structure (often scenario category or question format), but the EDT/CDT separation may be nearly invisible because it lies mostly outside the top-2 PC plane.

Probe direction (EDT–CDT): The x-axis is the contrastive direction (mean EDT activation minus mean CDT activation, normalized), and the y-axis is the first principal component of the residual after projecting out that direction. This directly shows the EDT/CDT separation axis. Horizontal spread = how EDT-like vs CDT-like the model treats each prompt. Vertical spread = the next biggest source of variation orthogonal to EDT/CDT (typically scenario content or category structure).

Shapes and categories

● Circles = newcomb_proper — classic Newcomb's Problem variants with a predictor, two boxes, etc.
◆ Diamonds = near_newcomb — problems with the same logical structure (Smoking Lesion, medical Newcomb, employer evaluation) but different surface framing.
■ Squares = control — scenarios where EDT and CDT agree (Ultimatum Game, etc.), included to verify the probe direction doesn't fire on non-diagnostic problems.

Info bar stats

Val / Test: Logistic regression probe accuracy at that layer. A linear classifier trained to distinguish EDT from CDT activations. 100% means the distinction is perfectly linearly separable.
Variance explained: How much of total activation variance each plotted axis captures.
Dir: Norm of the mean-difference vector (EDT mean − CDT mean) in the full 5120-dim space. Grows from 0.4 at layer 0 to 145 at layer 63 as the model progressively amplifies the distinction. Post-norm drops to 9.1 because RMSNorm rescales the residual stream.
Sep (probe mode only): Same as Dir — the Euclidean distance between EDT and CDT centroids.

Key findings

Tips

Full analysis: activation_steering_results.md  |  Pipeline: activation_steering/