Note: This viewer, its data pipeline, and the commentary below were generated with Claude Code (Opus 4.6). The underlying activation data and probe results are real model outputs.
Rows = 70 scenarios (with multiple prompts each), grouped by category. Columns = 10 layers through the network. Cell brightness encodes the gap between mean EDT and mean CDT projection onto the probe direction at that layer. Brighter green = stronger EDT/CDT separation. Dark = no separation or reversed. Click a row to see the scenario.
Stacked density curves (kernel density estimation) showing the population-level distribution of probe projections at each layer. At early layers the EDT and CDT distributions overlap heavily. By layer 40+ they pull apart. Look for: bimodality, long tails, and whether separation happens via means shifting or variance changing.
Each line is one prompt's scalar projection onto the probe direction at each layer. 350 EDT lines (green) and 350 CDT lines (orange). Low opacity manages overplotting; hover highlights a single path plus its contrastive pair. Look for: crossings (prompts that flip from EDT-leaning to CDT-leaning), laggards, and whether categories separate at different rates.
At each layer, we compute mean(EDT activations) - mean(CDT activations) in the 5120-dim residual stream, normalize to unit length, and project each prompt onto it. Positive = EDT-leaning, negative = CDT-leaning. This is the same direction used by the linear probe, which achieves 100% validation accuracy from layer 40 onward.
Full analysis: activation_steering_results.md