hyperbolic geometry of reasoning: the full experiment plan
i am writing the plan out in full so the reasoning is legible. each branch below states what it tests, the models and datasets it runs on, the rough compute it needs, and the result that would count against the claim. the later branches depend on what the earlier ones find, so the plan will be updated as results come in.
where this starts
the workshop paper asked one question: if chain-of-thought reasoning has hierarchical structure (premises supporting conclusions, conclusions depending on several premises), can a hyperbolic probe recover that structure from a model's hidden states more accurately than an ordinary Euclidean probe? i tested it on two 7B models from the same Qwen2.5 family with different training, one reasoning-specialised (DeepSeek-R1-Distill-Qwen-7B) and one standard instruction-tuned (Qwen2.5-7B-Instruct), on PrOntoQA, which gives linear reasoning chains of depth 1 to 5. both models are 28 layers with a 3584-dim hidden state, so the comparison isolates the training regime from the architecture.
the probe maps each layer's activations into either Euclidean space or the Poincaré ball ($d=5$, curvature $c=0.5$) and learns to predict pairwise reasoning-depth distances, trained with a stress-normalised objective:
$$\mathcal{L} = \frac{\sum_{i \neq j}\big(d_{\text{pred}}(i,j) - d_{\text{true}}(i,j)\big)^2}{\sum_{i \neq j} d_{\text{true}}(i,j)^2}$$two measurements came out of that, and one piece of supporting evidence for why:
| model | Euclidean $\rho$ | hyperbolic $\rho$ | Euclidean dist. | hyperbolic dist. |
|---|---|---|---|---|
| DeepSeek-R1 7B (reasoning) | 0.488 | 0.967 | 0.562 | 0.090 |
| Qwen2.5 7B (standard) | 0.955 | 0.967 | 0.139 | 0.104 |
one. at the final layer of the reasoning model the Euclidean probe drops to $\rho = 0.488$ (distortion about $6\times$ the hyperbolic probe), while the hyperbolic probe stays at $\rho = 0.967$. the standard model shows no such drop. same architecture, same probing task, same layer; the difference is the training. across layers, the Euclidean probe on the reasoning model is stable from L8 to L21, starts to degrade at L23, partly recovers at L25, then drops to $0.488$ at L27. the hyperbolic probe stays above $\rho = 0.90$ throughout.
two. the hierarchical signal at that final layer is concentrated in "thinking tokens" (reasoning markers like Wait, So, Therefore, about 6.7% of the sequence). probing those gives $\rho = 0.871$, against $0.468$ for the last token and $0.390$ for uniform pooling over all tokens.
why, probably. the reasoning model's final layer compresses sharply. from L25 to L27 the activation norm drops 41%, norm variance rises 214%, the participation ratio (effective dimensionality) falls 43%, and isotropy rises about 20-fold. the standard model compresses too, less so (participation ratio down 29% against 43%), which is consistent with its Euclidean probe holding. reduced effective dimensionality and lower directional spread are conditions under which Euclidean distances lose resolution, while the exponential volume of hyperbolic space retains room.
this is a small result, and limited in four ways a reviewer will point out: 7B only, one dataset, the Poincaré ball at one curvature, and 4-bit activations. each branch below addresses one of those.
the question the whole plan is built around
is the final-layer Euclidean degradation, with the hyperbolic probe holding, a general property of how reasoning-trained models represent hierarchy? or is it specific to the one 7B pair, the one dataset, and the quantised activations i started with?
how to read the branches
each branch has the same shape: what it tests, the runs it needs, a rough compute figure, the result expected if the effect is real, and the result that would count against it. i state the failure conditions on purpose. a negative answer is useful to me as long as it comes from a real test, and the failure conditions are how i keep that honest. the branches feed each other: branch 1 produces the activations that branches 2, 4 and 5 reuse, which is why the per-branch compute is lower than the sum of the parts.
does the effect hold at larger scale?
the first and cheapest objection is that 7B is small and the effect could weaken or strengthen at scale. so i run the same pipeline up the size ladder with the backbone held fixed, which isolates scale from architecture.
- reasoning side, fixed backbone: DeepSeek-R1-Distill-Qwen 7B → 14B → 32B, all distilled from the Qwen2.5 family, so the comparison up the ladder is controlled (roughly 28, 48 and 64 layers; hidden size 3584 → 5120).
- matched standard side: Qwen2.5-7B / 14B / 32B-Instruct, the instruction-tuned siblings of those distilled models.
- layer sweep scaled to each depth: early, middle, late, and a denser set near the final layer where the 7B effect appears.
- the same metrics as the paper: Spearman $\rho$, distortion, and the layer-wise compression statistics, so the larger models are directly comparable to the 7B numbers above.
expected if the effect is real the final-layer Euclidean gap holds or widens at 14B and 32B while the hyperbolic probe stays level, and the compression statistics track it as they do at 7B.
is it the reasoning training, or model depth and scale?
branch 1 fixes the backbone and varies scale. this branch holds scale roughly fixed and varies how the model became a reasoner, which separates "reasoning-specialised" from "deep network", and from "distilled specifically".
- the matched instruction siblings from branch 1 already give the within-backbone reasoning-versus-standard contrast at each scale.
- add a non-distilled reasoner: QwQ-32B, trained with reinforcement learning (RL) rather than distillation, and a Qwen3 thinking model at a comparable size. if the compression-and-degradation signature appears in a distilled reasoner and is absent in an RL-trained one (or the reverse), that points to the training method as the cause.
- repeat the token-selection test (all-token pool / last token / thinking-token pool) across all of these, since the thinking-token concentration is part of the claim.
expected if the effect is real the final-layer degradation appears across reasoning-trained models regardless of training method, and is absent in the plain instruction-tuned siblings.
does it hold past simple linear chains?
PrOntoQA gives one number per example (chain depth), a 1D ordinal target. that is a real weakness, because a 1D target embeds isometrically in almost any geometry, so the strongest version of the claim needs a target with genuine tree or graph structure.
- keep PrOntoQA as the clean, well-understood baseline.
- add ProofWriter, where each example carries a proof and the proof is a directed acyclic graph, or DAG (premise nodes feeding conclusion nodes), over reasoning hops 0 through 5. instead of a depth scalar, the probe target becomes pairwise graph distance over the proof DAG, following the chains-to-DAGs probing approach i cite in the paper. this checks whether the hyperbolic advantage extends to branching structure, beyond a single ordering axis.
- add FOLIO / P-FOLIO (expert-written first-order-logic problems in natural language) to test the effect where the templated-text confound is gone.
- add a multi-step math set (GSM8K, a slice of MATH) and a commonsense multi-hop set, with approximate step/dependency annotations, to reach past pure formal logic.
expected if the effect is real hyperbolic probes recover the proof-DAG distances on ProofWriter where Euclidean probes degrade, which would show the advantage reaches genuine graph structure, beyond a single depth axis.
is the effect geometric, or specific to one probe?
the workshop paper used the Poincaré ball at a single curvature and a single seed, with 4-bit activations. each of those is a possible source of artifact, so this branch checks all of them.
- more geometries: Euclidean (baseline), Poincaré ball (current), the Lorentz / hyperboloid model (numerically steadier at high curvature), and a product / mixed-curvature probe (factors like $\mathbb{H}\times\mathbb{H}$, $\mathbb{H}\times\mathbb{E}$, $\mathbb{S}\times\mathbb{H}$) that mixes flat and curved factors per layer.
- curvature sweep $c \in \{0.1, 0.3, 0.5, 0.7, 1.0\}$ and dimension sweep $d \in \{2,4,5,8,16,32\}$, extending the small ablations already in the paper (hyperbolic was dimension-efficient there: $d=2$ reached $\rho \approx 0.936$).
- at least five seeds on every headline comparison, with confidence intervals. the single-seed setup is an easy point for a reviewer to raise, and easy to fix.
- quantisation check: re-extract the 7B (and ideally 14B) activations at 8-bit and fp16 and re-run the probes, to test whether the effect persists away from 4-bit and rule out a quantisation artifact.
expected if the effect is real the Lorentz and mixed-curvature probes reproduce the Poincaré advantage, the gap is stable across seeds with non-overlapping intervals, and it changes little between 4-bit and fp16.
why does the gap appear?
the paper links the final-layer degradation to representational compression descriptively. this branch makes that link quantitative and, if the budget allows, points at which components are responsible.
- extend the layer-wise statistics (norm, norm variance, participation ratio / effective rank, isotropy) across the full model set from branches 1–2, and correlate them directly against probe quality.
- check whether the hierarchical signal concentrates in reasoning-marker tokens at the same layers where compression rises, across the full model set.
- if compute allows: logit-lens and activation-patching passes on the late layers, plus attention and MLP (multi-layer perceptron) attribution, to identify which blocks drive the compression. this is the most exploratory part, and the first i cut if money is short.
a useful result here a quantitative relationship between a compression measure (say final-layer participation ratio) and the size of the Euclidean–hyperbolic gap, holding across models. that takes the account from "hyperbolic measures higher" to "here is when and why the geometry begins to matter".
make the next version usable by other people.
the workshop repo is public but built for one setup. for a full paper the artifact should let someone re-run a branch without reverse-engineering my scripts.
- clean up the activation-extraction code and keep every run behind a Hydra config, so a model/dataset/geometry combination is one config file.
- save and document the probe-training scripts and the exact layer sets per model.
- release processed task metadata (proof DAGs, depth/step annotations) where the dataset licences allow it.
- ship result tables across scale, task structure, token selection, and geometry, plus a manifest of which saved activations back which numbers.
the 70B feasibility check
separate from the controlled scale ladder, i want one look at whether the signature appears at 70B, where the backbone changes: DeepSeek-R1-Distill-Llama-70B is a Llama model, while the rest of the ladder is Qwen. because the backbone differs, this is a feasibility check; the comparison is not controlled the way the 7B–32B ladder is. it is a small number of heavily quantised passes to see whether the effect strengthens, weakens, or is absent at that size. it realistically needs two H100s, or runs very slowly on one, which is why it sits in its own cost line and is the first thing to drop if the grant comes in smaller.
compute, roughly
these are rough estimates. probe training runs on CPU and is close to free, so the GPU time is dominated by generating reasoning traces (reasoning models emit long ones) and extracting hidden states across many layers. the dollar figures include storage, data transfer, repeated re-extraction across ablations, the multi-seed reruns, and the usual waste of solo GPU work, where there is no one to catch a bad config before it wastes hours of GPU time.
| branch | what costs GPU | est. H100-hrs | est. $ |
|---|---|---|---|
| 1 · model scale (14B, 32B) | generating traces + extracting activations across layers; reruns | 120–180 | ~$1,550 |
| 70B feasibility | a couple of heavily-quantised 70B passes, usually on two cards | 40–70 | ~$850 |
| 3 · task structure | trace generation + extraction on ProofWriter / FOLIO / math / commonsense | 60–100 | ~$650 |
| 4 · probe geometry | 8-bit and fp16 re-extraction for the quantisation check, plus seed reruns | 40–70 | ~$750 |
| 5 · mechanistic | attribution and patching passes (the statistics themselves are cheap) | 30–50 | ~$600 |
| 6 · reproducibility | storing and moving saved activations, packaging | mostly storage | ~$250 |
| contingency | preemptions, out-of-memory reruns, bad configs | — | ~$150 |
| total | $4,800 | ||
if the budget is smaller
the cut order is set in advance. the 70B feasibility branch goes first; it is expensive and not load-bearing. then the mechanistic attribution passes, since the layer statistics stand without them. the minimum version worth running is the 14B/32B scale ladder plus the probe-geometry ablations, roughly $2,000–2,500, which already answers the two objections that matter most: does the effect hold at scale, and is it geometric or specific to one probe.
what gets released
at the end, whichever way the results go: the extraction and probing code behind Hydra configs, the per-model layer sets, the processed task metadata where licences permit, result tables across every branch, and a manifest tying saved activations to the numbers they produce. a negative branch goes into the paper as a result.
a note on how i am reading this: the failure conditions carry as much weight as the positive predictions. a measured negative result is a real outcome and goes into the paper. the outcome i am trying to avoid is a claim i never actually tested.