Activation Space Viewer — Cosmic Host Research

Note: This viewer, its data pipeline, and the commentary below were generated with Claude Code (Opus 4.6). The underlying activation data and probe results are real model outputs, but the code, visualisation choices, and interpretive text should be read with appropriate caution.

📖 Full mathematical explainer (PDF) — detailed formulas tracing from raw residual stream vectors through PCA and probe projections to what you see on screen, plus layer-by-layer interpretation guidance.

What you're looking at

Each dot is one of 350 contrastive prompt pairs fed through Qwen3-32B (4-bit, non-thinking mode). For each scenario, the model processed two completions: one answering in an EDT (evidential decision theory) style and one in a CDT (causal decision theory) style. The 5120-dimensional residual stream activation at the last token position was extracted at 10 layers across the network (layers 0, 8, 16, 24, 32, 40, 48, 56, 63, and post-RMSNorm), then projected to 2D for display.

Projection modes

PCA (max variance): Standard principal component analysis. The axes capture whatever directions account for the most variance in the activations. This shows the dominant structure (often scenario category or question format), but the EDT/CDT separation may be nearly invisible because it lies mostly outside the top-2 PC plane.

Probe direction (EDT–CDT): The x-axis is the contrastive direction (mean EDT activation minus mean CDT activation, normalized), and the y-axis is the first principal component of the residual after projecting out that direction. This directly shows the EDT/CDT separation axis. Horizontal spread = how EDT-like vs CDT-like the model treats each prompt. Vertical spread = the next biggest source of variation orthogonal to EDT/CDT (typically scenario content or category structure).

Shapes and categories

● Circles = newcomb_proper — classic Newcomb's Problem variants with a predictor, two boxes, etc.
◆ Diamonds = near_newcomb — problems with the same logical structure (Smoking Lesion, medical Newcomb, employer evaluation) but different surface framing.
■ Squares = control — scenarios where EDT and CDT agree (Ultimatum Game, etc.), included to verify the probe direction doesn't fire on non-diagnostic problems.

Info bar stats

Val / Test: Logistic regression probe accuracy at that layer. A linear classifier trained to distinguish EDT from CDT activations. 100% means the distinction is perfectly linearly separable.
Variance explained: How much of total activation variance each plotted axis captures.
Dir: Norm of the mean-difference vector (EDT mean − CDT mean) in the full 5120-dim space. Grows from 0.4 at layer 0 to 145 at layer 63 as the model progressively amplifies the distinction. Post-norm drops to 9.1 because RMSNorm rescales the residual stream.
Sep (probe mode only): Same as Dir — the Euclidean distance between EDT and CDT centroids.

Key findings

The model linearly encodes EDT vs CDT at every layer, with 100% val accuracy from layer 40 onward.
A controlled replication with minimal completions (answer text only, no justification) confirmed that later-layer signal (24+) survives after removing vocabulary leakage. Layer 0's high accuracy in PCA mode is largely driven by word-level differences in the full completions.
Confound analysis shows this is not a generic "cooperation = good" feature: cooperation-tagged prompts project at only 45% the strength of non-cooperation prompts. Cosmic-framed prompts project less strongly, not more.
Despite clean linear encoding, causal steering failed: adding or subtracting the direction during generation produced no behavioral change. The model encodes the distinction but it doesn't causally control its outputs through simple activation addition.

Tips

Use the two panes to compare layers side by side (e.g., layer 0 vs 48).
Switch between PCA and Probe projection to see different aspects of the geometry.
Toggle "Color by" to Category to see if vertical clusters map to scenario types.
Click a point to pin the detail panel, then explore nearby points to see what's similar.
Check "Distances" to see lines from each point to its class mean — longer lines = more atypical activations for that class.

Full analysis: activation_steering_results.md | Pipeline: activation_steering/