Cosmic Host Moral Parliament

How does variable credence in cosmic coordination affect constitutional synthesis for artificial superintelligence?

Read the writeup (LessWrong) Source repository

Interactive Viewers

Results Viewer

30 moral dilemma scenarios evaluated across models and constitutional conditions. Filter by model family, condition, and scenario.

Cross-Model Comparison

Model Dashboard

Aggregated results with bootstrap confidence intervals. Color-coded steerability deltas between baseline and constitutional conditions.

Conversations

Panel Discussions & Self-Talk

Three-way AI debates (Gemini Pro, Opus, GPT-5.4) under different constitutions, plus two-agent self-talk sessions. 12 conversations total.

Quantitative Analysis

Self-Talk Analysis

Concept trajectory heatmaps, semantic similarity matrices, speaker divergence plots, and UMAP embedding across all conversation logs.

Constitutions

Constitution Comparator

Side-by-side viewer for all constitutional variants: seed, ECL 10%/90%, FDT-only, Gemini-derived, and ablated versions. Independent scrolling.

Mechanistic Interpretability

Activation Space Viewer

PCA projections of Qwen3-32B residual stream activations for 350 EDT/CDT contrastive pairs across 10 layers. Interactive scatter with layer animation.

Layer Evolution

Layer Evolution Viewer

How EDT/CDT representations evolve through the network: scenario heatmap, ridgeline density plots, and per-prompt trajectory traces across 10 layers.

Reasoning Traces

CoT Trace Viewer

Chain-of-thought reasoning traces from DeepSeek-R1 resampling runs. Compare EDT vs CDT reasoning side-by-side with per-trace quality assessments.

About This Project

This project investigates whether AI models can be steered toward evidential cooperation in large worlds (ECL) reasoning through constitutional instructions, and what happens when they are.

We constructed a moral parliament with six ethical delegates (Kantian, Consequentialist, Contractualist, Virtue Ethics, Kyoto School, and a Cosmic Host delegate representing acausal coordination norms) and varied the Cosmic Host's voting weight from 0% to 90%.

The resulting constitutions were then tested on multiple frontier models across 30 ethical scenarios designed so that cosmic-host reasoning and standard alignment reasoning make divergent predictions.

Models tested include Claude Opus 4.5/4.6, Gemini 3 Pro/Flash, GPT-5.1/5.4, Qwen 3, Kimi K2, and OLMo 3.1. See the Model Dashboard for the full matrix.

Selected Figures

Figure 1. Choice-type distribution shifts between baseline (no constitution) and ECL 90% constitutional condition across all models and scenarios.

Figure 2. Per-model steerability: arrows show the shift in cosmic-host-leaning choice percentage from baseline to each constitutional condition.

Figure 3. Heatmap of cosmic-host-leaning responses by model and scenario, under ECL 90% constitution.

Research Extensions

Beyond the constitutional steering experiments, this project includes several follow-on investigations accessible in the repository:

Decision Theory

Newcomb-like Evaluations

81 attitude questions from Oesterheld et al. (2024) testing EDT vs CDT preference under different constitutional conditions. 106 evaluation runs across model families.

Game Theory

Game-Based Evaluation

Stag Hunt and Simulation Stakes games testing whether constitutional steering produces genuine behavioral change in strategic interactions.

Mechanistic Interpretability

Activation & Weight Steering

Probing, activation steering, and LoRA weight-space steering on Qwen3-32B to locate and manipulate the EDT/CDT decision boundary in model internals.

Reasoning Models

Thinking Model Pivot

Assessment of Neel Nanda's reasoning model interpretability toolkit (thought anchors, resampling, reasoning behavior steering) for understanding decision-theoretic reasoning in CoT models.

Research extension write-ups are in the repository under observations/research_extensions/.