experiment plan · working document

hyperbolic geometry of reasoning: the full experiment plan

tl;dr this is the plan for extending my GRaM workshop result into a full paper. the workshop version measured that a hyperbolic probe recovers reasoning-depth structure in a reasoning model's hidden states ($\rho \approx 0.97$) at a layer where the Euclidean probe degrades sharply ($\rho \approx 0.49$). the evidence is limited: 7B models, one dataset, one probe geometry, 4-bit activations. the plan below tests how far that result generalises, across scale (7B → 14B → 32B, with a 70B check), across reasoning and standard models, across harder tasks (ProofWriter proof graphs, FOLIO, math), and across probe geometries (Poincaré, Lorentz, mixed-curvature). each branch states the runs, the rough compute, and the result that would count against the claim. the total works out to about $4,800 on rented H100s.

i am writing the plan out in full so the reasoning is legible. each branch below states what it tests, the models and datasets it runs on, the rough compute it needs, and the result that would count against the claim. the later branches depend on what the earlier ones find, so the plan will be updated as results come in.

where this starts

the workshop paper asked one question: if chain-of-thought reasoning has hierarchical structure (premises supporting conclusions, conclusions depending on several premises), can a hyperbolic probe recover that structure from a model's hidden states more accurately than an ordinary Euclidean probe? i tested it on two 7B models from the same Qwen2.5 family with different training, one reasoning-specialised (DeepSeek-R1-Distill-Qwen-7B) and one standard instruction-tuned (Qwen2.5-7B-Instruct), on PrOntoQA, which gives linear reasoning chains of depth 1 to 5. both models are 28 layers with a 3584-dim hidden state, so the comparison isolates the training regime from the architecture.

the probe maps each layer's activations into either Euclidean space or the Poincaré ball ($d=5$, curvature $c=0.5$) and learns to predict pairwise reasoning-depth distances, trained with a stress-normalised objective:

$$\mathcal{L} = \frac{\sum_{i \neq j}\big(d_{\text{pred}}(i,j) - d_{\text{true}}(i,j)\big)^2}{\sum_{i \neq j} d_{\text{true}}(i,j)^2}$$

two measurements came out of that, and one piece of supporting evidence for why:

final-layer (L27) probe quality on PrOntoQA, 5-fold cross-validation. Spearman $\rho$ higher is better; distortion (mean absolute distance error) lower is better. the reasoning model's Euclidean probe is the one cell that degrades.
model	Euclidean $\rho$	hyperbolic $\rho$	Euclidean dist.	hyperbolic dist.
DeepSeek-R1 7B (reasoning)	0.488	0.967	0.562	0.090
Qwen2.5 7B (standard)	0.955	0.967	0.139	0.104

one. at the final layer of the reasoning model the Euclidean probe drops to $\rho = 0.488$ (distortion about $6\times$ the hyperbolic probe), while the hyperbolic probe stays at $\rho = 0.967$. the standard model shows no such drop. same architecture, same probing task, same layer; the difference is the training. across layers, the Euclidean probe on the reasoning model is stable from L8 to L21, starts to degrade at L23, partly recovers at L25, then drops to $0.488$ at L27. the hyperbolic probe stays above $\rho = 0.90$ throughout.

two. the hierarchical signal at that final layer is concentrated in "thinking tokens" (reasoning markers like Wait, So, Therefore, about 6.7% of the sequence). probing those gives $\rho = 0.871$, against $0.468$ for the last token and $0.390$ for uniform pooling over all tokens.

why, probably. the reasoning model's final layer compresses sharply. from L25 to L27 the activation norm drops 41%, norm variance rises 214%, the participation ratio (effective dimensionality) falls 43%, and isotropy rises about 20-fold. the standard model compresses too, less so (participation ratio down 29% against 43%), which is consistent with its Euclidean probe holding. reduced effective dimensionality and lower directional spread are conditions under which Euclidean distances lose resolution, while the exponential volume of hyperbolic space retains room.

this is a small result, and limited in four ways a reviewer will point out: 7B only, one dataset, the Poincaré ball at one curvature, and 4-bit activations. each branch below addresses one of those.

the question the whole plan is built around

is the final-layer Euclidean degradation, with the hyperbolic probe holding, a general property of how reasoning-trained models represent hierarchy? or is it specific to the one 7B pair, the one dataset, and the quantised activations i started with?

how to read the branches

each branch has the same shape: what it tests, the runs it needs, a rough compute figure, the result expected if the effect is real, and the result that would count against it. i state the failure conditions on purpose. a negative answer is useful to me as long as it comes from a real test, and the failure conditions are how i keep that honest. the branches feed each other: branch 1 produces the activations that branches 2, 4 and 5 reuse, which is why the per-branch compute is lower than the sum of the parts.

branch 1 · model scale planned

does the effect hold at larger scale?

the first and cheapest objection is that 7B is small and the effect could weaken or strengthen at scale. so i run the same pipeline up the size ladder with the backbone held fixed, which isolates scale from architecture.

reasoning side, fixed backbone: DeepSeek-R1-Distill-Qwen 7B → 14B → 32B, all distilled from the Qwen2.5 family, so the comparison up the ladder is controlled (roughly 28, 48 and 64 layers; hidden size 3584 → 5120).
matched standard side: Qwen2.5-7B / 14B / 32B-Instruct, the instruction-tuned siblings of those distilled models.
layer sweep scaled to each depth: early, middle, late, and a denser set near the final layer where the 7B effect appears.
the same metrics as the paper: Spearman $\rho$, distortion, and the layer-wise compression statistics, so the larger models are directly comparable to the 7B numbers above.

compute: ~120–180 H100-hours (trace generation dominates; extraction and probe-fit are cheap). 4-bit to fit 32B on a single 80GB card.

expected if the effect is real the final-layer Euclidean gap holds or widens at 14B and 32B while the hyperbolic probe stays level, and the compression statistics track it as they do at 7B.

what would count against it if the Euclidean degradation is gone by 14B, the workshop result was a small-model artifact, and the "property of reasoning models" framing does not hold. this is the cheapest test in the plan, which is why it runs before the rest: ~$150 of compute settles it either way.

branch 2 · reasoning vs scale vs training method planned

is it the reasoning training, or model depth and scale?

branch 1 fixes the backbone and varies scale. this branch holds scale roughly fixed and varies how the model became a reasoner, which separates "reasoning-specialised" from "deep network", and from "distilled specifically".

the matched instruction siblings from branch 1 already give the within-backbone reasoning-versus-standard contrast at each scale.
add a non-distilled reasoner: QwQ-32B, trained with reinforcement learning (RL) rather than distillation, and a Qwen3 thinking model at a comparable size. if the compression-and-degradation signature appears in a distilled reasoner and is absent in an RL-trained one (or the reverse), that points to the training method as the cause.
repeat the token-selection test (all-token pool / last token / thinking-token pool) across all of these, since the thinking-token concentration is part of the claim.

compute: mostly reuses branch 1 activations; the extra cost is QwQ-32B and one Qwen3 model, ~30–45 H100-hours folded into the scale budget.

expected if the effect is real the final-layer degradation appears across reasoning-trained models regardless of training method, and is absent in the plain instruction-tuned siblings.

what would count against it if a standard instruction model at 32B shows the same final-layer Euclidean degradation, then the effect is driven by depth or scale, and the central claim has to be rewritten as such.

branch 3 · task structure planned

does it hold past simple linear chains?

PrOntoQA gives one number per example (chain depth), a 1D ordinal target. that is a real weakness, because a 1D target embeds isometrically in almost any geometry, so the strongest version of the claim needs a target with genuine tree or graph structure.

keep PrOntoQA as the clean, well-understood baseline.
add ProofWriter, where each example carries a proof and the proof is a directed acyclic graph, or DAG (premise nodes feeding conclusion nodes), over reasoning hops 0 through 5. instead of a depth scalar, the probe target becomes pairwise graph distance over the proof DAG, following the chains-to-DAGs probing approach i cite in the paper. this checks whether the hyperbolic advantage extends to branching structure, beyond a single ordering axis.
add FOLIO / P-FOLIO (expert-written first-order-logic problems in natural language) to test the effect where the templated-text confound is gone.
add a multi-step math set (GSM8K, a slice of MATH) and a commonsense multi-hop set, with approximate step/dependency annotations, to reach past pure formal logic.

compute: ~60–100 H100-hours, almost all of it generating reasoning traces on the new datasets across the model set.

expected if the effect is real hyperbolic probes recover the proof-DAG distances on ProofWriter where Euclidean probes degrade, which would show the advantage reaches genuine graph structure, beyond a single depth axis.

what would count against it if hyperbolic leads only on the 1D depth target and matches Euclidean once the target is a genuine DAG distance, then the honest reading is "hyperbolic helps with ordinal depth", a narrower claim. worth knowing early.

branch 4 · probe geometry and robustness checks planned

is the effect geometric, or specific to one probe?

the workshop paper used the Poincaré ball at a single curvature and a single seed, with 4-bit activations. each of those is a possible source of artifact, so this branch checks all of them.

more geometries: Euclidean (baseline), Poincaré ball (current), the Lorentz / hyperboloid model (numerically steadier at high curvature), and a product / mixed-curvature probe (factors like $\mathbb{H}\times\mathbb{H}$, $\mathbb{H}\times\mathbb{E}$, $\mathbb{S}\times\mathbb{H}$) that mixes flat and curved factors per layer.
curvature sweep $c \in \{0.1, 0.3, 0.5, 0.7, 1.0\}$ and dimension sweep $d \in \{2,4,5,8,16,32\}$, extending the small ablations already in the paper (hyperbolic was dimension-efficient there: $d=2$ reached $\rho \approx 0.936$).
at least five seeds on every headline comparison, with confidence intervals. the single-seed setup is an easy point for a reviewer to raise, and easy to fix.
quantisation check: re-extract the 7B (and ideally 14B) activations at 8-bit and fp16 and re-run the probes, to test whether the effect persists away from 4-bit and rule out a quantisation artifact.

compute: ~40–70 H100-hours. probe training runs on CPU and is effectively free; the GPU cost is the 8-bit/fp16 re-extraction and the seed reruns.

expected if the effect is real the Lorentz and mixed-curvature probes reproduce the Poincaré advantage, the gap is stable across seeds with non-overlapping intervals, and it changes little between 4-bit and fp16.

what would count against it if the advantage appears only for the Poincaré ball at $c=0.5$ and disappears under Lorentz, mixed-curvature, or fp16 extraction, then it is an implementation detail, and the paper has to report it as one.

branch 5 · mechanistic follow-up planned

why does the gap appear?

the paper links the final-layer degradation to representational compression descriptively. this branch makes that link quantitative and, if the budget allows, points at which components are responsible.

extend the layer-wise statistics (norm, norm variance, participation ratio / effective rank, isotropy) across the full model set from branches 1–2, and correlate them directly against probe quality.
check whether the hierarchical signal concentrates in reasoning-marker tokens at the same layers where compression rises, across the full model set.
if compute allows: logit-lens and activation-patching passes on the late layers, plus attention and MLP (multi-layer perceptron) attribution, to identify which blocks drive the compression. this is the most exploratory part, and the first i cut if money is short.

compute: ~30–50 H100-hours. the statistics are cheap on already-saved activations; the attribution and patching passes need fresh GPU forward passes with hooks.

a useful result here a quantitative relationship between a compression measure (say final-layer participation ratio) and the size of the Euclidean–hyperbolic gap, holding across models. that takes the account from "hyperbolic measures higher" to "here is when and why the geometry begins to matter".

scope this branch aims at evidence for the mechanism. it will not produce a full circuit-level explanation, and the paper will say so directly.

branch 6 · reproducibility planned

make the next version usable by other people.

the workshop repo is public but built for one setup. for a full paper the artifact should let someone re-run a branch without reverse-engineering my scripts.

clean up the activation-extraction code and keep every run behind a Hydra config, so a model/dataset/geometry combination is one config file.
save and document the probe-training scripts and the exact layer sets per model.
release processed task metadata (proof DAGs, depth/step annotations) where the dataset licences allow it.
ship result tables across scale, task structure, token selection, and geometry, plus a manifest of which saved activations back which numbers.

compute: little GPU; the cost here is storage and transfer of the saved activation sets, which get large at 32B and 70B across many layers.

the 70B feasibility check

separate from the controlled scale ladder, i want one look at whether the signature appears at 70B, where the backbone changes: DeepSeek-R1-Distill-Llama-70B is a Llama model, while the rest of the ladder is Qwen. because the backbone differs, this is a feasibility check; the comparison is not controlled the way the 7B–32B ladder is. it is a small number of heavily quantised passes to see whether the effect strengthens, weakens, or is absent at that size. it realistically needs two H100s, or runs very slowly on one, which is why it sits in its own cost line and is the first thing to drop if the grant comes in smaller.

compute, roughly

these are rough estimates. probe training runs on CPU and is close to free, so the GPU time is dominated by generating reasoning traces (reasoning models emit long ones) and extracting hidden states across many layers. the dollar figures include storage, data transfer, repeated re-extraction across ablations, the multi-seed reruns, and the usual waste of solo GPU work, where there is no one to catch a bad config before it wastes hours of GPU time.

estimates assume on-demand H100 80GB at roughly $2.5–3/hr. branches reuse each other's saved activations, so the totals are lower than summing the raw per-branch generation cost. these map onto the seven budget lines in the application.
branch	what costs GPU	est. H100-hrs	est. $
1 · model scale (14B, 32B)	generating traces + extracting activations across layers; reruns	120–180	~$1,550
70B feasibility	a couple of heavily-quantised 70B passes, usually on two cards	40–70	~$850
3 · task structure	trace generation + extraction on ProofWriter / FOLIO / math / commonsense	60–100	~$650
4 · probe geometry	8-bit and fp16 re-extraction for the quantisation check, plus seed reruns	40–70	~$750
5 · mechanistic	attribution and patching passes (the statistics themselves are cheap)	30–50	~$600
6 · reproducibility	storing and moving saved activations, packaging	mostly storage	~$250
contingency	preemptions, out-of-memory reruns, bad configs	—	~$150
total			$4,800

if the budget is smaller

the cut order is set in advance. the 70B feasibility branch goes first; it is expensive and not load-bearing. then the mechanistic attribution passes, since the layer statistics stand without them. the minimum version worth running is the 14B/32B scale ladder plus the probe-geometry ablations, roughly $2,000–2,500, which already answers the two objections that matter most: does the effect hold at scale, and is it geometric or specific to one probe.

what gets released

at the end, whichever way the results go: the extraction and probing code behind Hydra configs, the per-model layer sets, the processed task metadata where licences permit, result tables across every branch, and a manifest tying saved activations to the numbers they produce. a negative branch goes into the paper as a result.

a note on how i am reading this: the failure conditions carry as much weight as the positive predictions. a measured negative result is a real outcome and goes into the paper. the outcome i am trying to avoid is a claim i never actually tested.