Arnav Raj

What Hyperbolic Geometry Reveals About How LLMs Reason

2026-01-22T03:30:00+00:00

Most interpretability work assumes that LLM representations live in flat, Euclidean space. We compute cosine similarities, run PCA, project with t-SNE. All tools built on Euclidean assumptions.

But reasoning has hierarchical structure. Premises support conclusions. Abstract claims generalize over specific ones. If you draw a proof tree, you see something that looks like a branching hierarchy, not a point cloud in flat space.

Trees embed poorly in Euclidean space but naturally in hyperbolic space, where volume grows exponentially with radius. This mismatch motivated a paper I wrote for the ICLR 2026 Workshop (GRaM Tiny Paper Track), which was accepted for poster presentation. I wanted to test whether hyperbolic probes capture hierarchical reasoning structure in LLM hidden states better than Euclidean ones.

The results were more dramatic than I expected.

Why hyperbolic geometry?

In Euclidean space, the area of a circle grows as $\pi r^2$. In hyperbolic space, it grows exponentially with $r$. There is dramatically more room at the edges of a hyperbolic disk, which is exactly what you need to embed trees: they have exponentially more leaves than internal nodes.

The Poincare disk makes this concrete. Points near the center represent high-level, general concepts. Points near the boundary represent specific, leaf-level details. Distances between points capture hierarchical relationships.

This is not just an analogy. Nickel & Kiela (2017) showed that 5-dimensional Poincare embeddings can match 200-dimensional Euclidean embeddings on hierarchical data. More recently, He et al. (2025) measured the intrinsic $\delta$-hyperbolicity of LLM embeddings and found values between 0.07 and 0.20, suggesting genuine tree-like structure.

So: if LLM representations during reasoning have hierarchical structure, can hyperbolic probes capture it more faithfully than Euclidean ones?

Setup

I compared two models from the same architecture family but with different training regimes:

DeepSeek-R1-Distill-Qwen-7B: reasoning-specialized, trained with chain-of-thought distillation. Generates explicit reasoning steps via <think> tokens.
Qwen2.5-7B-Instruct: standard instruction-tuned model from the same Qwen2.5 base.

Both are 28-layer transformers with 3584-dimensional hidden states, so the comparison isolates the effect of reasoning-specialized training rather than architectural differences.

The dataset is PrOntoQA (Saparov & He, 2023): 1000 logical reasoning problems with depths 1 through 5, forming clean linear chains. The templated structure minimizes linguistic confounds, letting us focus on geometric structure rather than surface-level language patterns.

For probing, I map layer activations to either Euclidean space or the Poincare ball ($d = 5$, curvature $c = 0.5$) and train lightweight probes to predict pairwise reasoning depth distances. The probe uses spectral normalization and Maximum Distance Rescaling for numerical stability. Training uses a stress-normalized loss (standard Kruskal stress from multidimensional scaling):

\[\mathcal{L} = \frac{\sum_{i \neq j} (d_{\text{pred}}(i,j) - d_{\text{true}}(i,j))^2}{\sum_{i \neq j} d_{\text{true}}(i,j)^2}\]

I evaluate across 8 layers (L8 through L27) with 5-fold cross-validation, using Spearman $\rho$ and distortion (mean absolute distance error) as metrics.

Finding 1: Euclidean probes break down in reasoning models

At the final layer (L27), the hyperbolic probe achieved Spearman $\rho = 0.967$ on both models. Robust, consistent, unremarkable in the best way.

The Euclidean probe told a different story. On Qwen (the standard model), it performed well: $\rho = 0.955$. On DeepSeek (the reasoning model), it collapsed to $\rho = 0.488$. Same architecture, same probing task, same layer. The only difference is the training regime.

The distortion numbers made the gap sharper. DeepSeek Euclidean distortion at L27 was 0.562, roughly 6x higher than the hyperbolic probe’s 0.090. Qwen’s Euclidean distortion was 0.139, comparable to its hyperbolic result (0.104).

Since the target metric (1D ordinal depth) embeds isometrically in both geometries, this advantage has to come from the representation structure itself. The model’s internal geometry genuinely favors hyperbolic decoding.

The degradation is progressive. Looking across layers, DeepSeek’s Euclidean probe is stable from L8 through L21 ($\rho \approx 0.97$), starts dropping at L23 ($\rho = 0.842$), partially recovers at L25 ($\rho = 0.906$), then falls off at L27 ($\rho = 0.488$). The hyperbolic probe stays above $\rho = 0.90$ across all layers. Qwen’s Euclidean probe shows no degradation at any layer.

Finding 2: thinking tokens concentrate hierarchical information

Chain-of-thought models produce explicit reasoning tokens during generation. Following Qian et al. (2025), I identified “thinking tokens” by matching reasoning markers: “Wait”, “Hmm”, “Let me”, “So”, “Therefore”, “Thus”, “Hence”, “Because”, “Since”. These constitute about 6.7% of the sequence (roughly 20.7 tokens per sample on average).

At Layer 27, probing these thinking tokens with the hyperbolic probe gave $\rho = 0.871$. Probing the last token gave $\rho = 0.468$. Uniform pooling over all tokens gave $\rho = 0.390$.

The thinking token advantage is concentrated at the final layer ($\Delta\rho = +0.481$ at L27). At intermediate layers (L19, L23, L25), thinking tokens actually perform worse than uniform pooling. The benefit only emerges where representations are most compressed. This suggests that hierarchical information gets consolidated into these specific token positions at the model’s output layer.

This provides geometric validation of what Qian et al. (2025) found through mutual information analysis: reasoning dynamics are concentrated in sparse, identifiable tokens that constitute just 0.5-5% of the generated sequence.

What the compression statistics reveal

Why do Euclidean probes fail specifically at late layers in reasoning models? I computed layer-wise activation statistics for DeepSeek and found a clear pattern of representational compression at L27.

From L25 to L27:

Activation norms decrease by 41% (1333 to 782)
Norm variance increases by 214% (39.7 to 124.4)
Participation ratio (effective dimensionality) drops 43% (45.5 to 25.8)
Isotropy increases roughly 20x (0.0049 to 0.096)

Qwen also compresses at its final layer, but less severely: participation ratio drops 29% (vs. 43%), and its effective dimensionality at L27 (43.1) is 67% higher than DeepSeek’s (25.8). This milder compression explains why Qwen’s Euclidean probes still work.

The interpretation: reasoning-specialized training creates representations that compress more aggressively at the output layer. Reduced effective dimensionality and loss of directional diversity mean Euclidean distances lose resolution. Hyperbolic geometry, with its exponential volume growth, accommodates this compressed structure where flat geometry cannot.

Limitations

This is a preliminary investigation with several important caveats.

Both models share the Qwen2.5 backbone, so the cross-model comparison reflects training regime differences rather than architectural ones. I only evaluated 7B-parameter models; scaling to 70B+ might reveal different patterns. PrOntoQA provides clean 1D chains, but real-world reasoning involves branching hierarchies and is considerably messier. Models were loaded with 4-bit quantization, which may affect activation distributions. And while layer statistics provide evidence for representational compression, full mechanistic understanding would require circuit-level analysis identifying which attention heads and MLPs drive the observed behavior.

What’s next

The direction I find most promising is using hyperbolic geometry to build interpretability tools that work with the natural structure of reasoning rather than flattening it.

If reasoning has hierarchical structure, our tools for understanding it should respect that geometry. Most interpretability methods assume flat spaces by default. The results here suggest that for reasoning-specialized models, this assumption can miss real structure.

I’m also curious whether different reasoning training approaches (RLHF, process reward models, constitutional AI) leave distinct geometric fingerprints. If they do, hyperbolic probing could become a diagnostic tool for comparing training regimes. And extending from linear chains to branching DAG structures (as Zhong et al., 2026 have started exploring) would test whether the geometric advantage holds for more complex reasoning topologies.

Accepted at the ICLR 2026 Workshop on Geometry-grounded Representation Learning and Generative Modeling (GRaM). OpenReview

What Makes a Good RLHF Task? Lessons from Training Data Research

2025-12-13T06:30:00+00:00

Since November 2024, I’ve been at Abundant AI designing training data for reinforcement learning from human feedback. Our datasets power some of the top AI labs in the world.

That context matters less than what I’ve learned doing the work. Designing tasks that actually expose weaknesses in state-of-the-art models, the ones that ace standard benchmarks, is a different kind of problem than I expected going in.

The saturation problem

Here’s the core tension: GPT-4, Claude, Gemini, and their peers all score above 90% on the benchmarks people typically use to evaluate them. MMLU, GSM8K, HumanEval. These are effectively solved.

That’s a data quality crisis for RL training. If a model gets everything right, there’s nothing to learn from. The reward signal is flat. You need tasks that sit right at the boundary of what the model can do, hard enough that it fails often enough to learn, structured enough that the failures carry useful signal.

The sweet spot, from what I’ve seen, is a 30-70% success rate on frontier models. Below 30% and noise overwhelms signal. Above 70% and there isn’t enough failure to train on.

What actually makes a task hard

I’ve iterated on hundreds of tasks, and the patterns that consistently break strong models are more specific than “make it harder.”

Stacking constraints is the most reliable approach. Simple tasks have one or two requirements. The tasks that expose real weaknesses need the model to juggle five or more interacting constraints at once. Design a database schema that normalizes to 3NF, supports specific query patterns, handles temporal versioning, maintains referential integrity across soft deletes, and optimizes for a read-heavy workload with particular index constraints. Models typically nail three or four and violate the rest. That partial failure is exactly the kind of signal that produces useful gradient.

Adversarial edge cases in familiar territory is another pattern that works well. Models have seen thousands of sorting algorithm implementations. They haven’t seen the numerical stability issues that emerge near the floating-point boundary. The learning happens at the outliers: concurrency bugs that only manifest under specific timing, privacy leaks in anonymization that looks safe on the surface, statistical fallacies embedded in realistic data analysis.

Combining domain expertise with reasoning depth trips up frontier models in a way that neither factor alone does. A straightforward finance question won’t. A straightforward logic puzzle won’t. But a complex derivative pricing problem under non-standard market conditions that requires multi-hop reasoning? That’s a different story. Same for debugging distributed training failures with subtle parameter interactions, or interpreting contracts across jurisdictions with conflicting clauses.

Precision requirements force the model out of its comfort zone of approximate correctness. Formal verification proofs where a single logical gap invalidates everything. Cryptographic protocol design where a small mistake is catastrophic. Numerical methods with strict error bounds. These tasks demand careful reasoning, not pattern matching.

What makes a task useful vs. just difficult

Difficulty alone isn’t enough. After enough iterations, I noticed a pattern in the tasks that consistently produced useful training signal.

There’s always a realistic context with enough domain grounding that the model can’t game the format. There are explicit constraints, plus implicit ones that follow from domain knowledge. There’s hidden complexity, meaning non-obvious interactions between stated requirements. And there’s objective verification: a clear pass/fail criterion, ideally automatable.

That last part is critical. If you can’t grade a task objectively, the reward signal is noisy. Tasks where human evaluators disagree on what “correct” means add confusion to training, not learning.

One more thing: the difficulty should come from genuine reasoning requirements, not ambiguous specs. If a model fails because the instructions were unclear, that’s a bug in the task design, not an exposed weakness in the model.

This is an AI safety problem

It might sound like narrow training infrastructure work. It isn’t.

RLHF with high-quality adversarial tasks is stress-testing model reasoning at scale. Models trained on tasks requiring careful constraint satisfaction learn that pattern-matching isn’t sufficient. They develop better calibration, because it gets harder to be confidently wrong when your training data punishes overconfidence. They improve at following complex instructions, which is directly relevant to safety-critical deployment.

The alternative, training on easy tasks, produces models that look strong on benchmarks but crumble under distribution shift. Overconfident, brittle, and bad at admitting uncertainty. Exactly the properties you don’t want in production.

Problems I haven’t solved

Scalability is the biggest open challenge. High-quality hard tasks require genuine domain expertise to create. You can’t crowdsource them to people who don’t deeply understand the domain, because the hidden complexity that makes a task valuable comes from expert intuition about where models actually fail.

Verification complexity is a close second. For code, you run tests. For formal proofs, you check validity. But for tasks involving judgment, design trade-offs, or open-ended analysis, grading is expensive and inconsistent.

Curriculum design is still more art than science. How do you sequence tasks from challenging to extremely hard to maximize learning? Too hard and the model doesn’t learn. Too easy and it plateaus. The optimal curriculum probably depends on the model’s current capability, but measuring that during RL training is its own research problem.

The bigger picture

The models we’ll use next year are being shaped by the training data we design today. As LLMs saturate existing benchmarks, the bottleneck is shifting from raw model capability to the quality of the signal we train them on.

Getting RLHF data right isn’t just about making models smarter. It’s about making them robust, calibrated, and trustworthy in the situations that matter most: edge cases, multi-constraint problems, the scenarios where approximate reasoning isn’t good enough.

That’s the job. It’s harder than I expected, and more consequential than it looks from outside.

Thanks to the Abundant AI team and the broader research community working on RLHF, Constitutional AI, and model robustness.

Beyond Accuracy: Evaluating Chain-of-Thought Reasoning in Production

2025-10-24T04:30:00+00:00

I spent months benchmarking LLMs on RTL code generation at Harvard’s Edge Computing Lab and evaluating long-context reasoning at Georgia Tech’s FSI Lab. Over hundreds of evaluation runs, one pattern kept surfacing: a model could produce a beautifully coherent reasoning chain and still get the answer completely wrong.

This isn’t a failure of chain-of-thought prompting. It’s a measurement problem. Most evaluation frameworks check whether the model got the right answer. Almost none check whether it got there for the right reasons.

The example that changed how I think about evaluation

Here’s something I saw while benchmarking RTL code generation. The task: generate a 4-bit counter with asynchronous reset.

The model’s chain-of-thought was textbook. Use a 4-bit register, increment on each clock edge, handle async reset with an if statement, reset to 0000 when the reset signal goes high.

The generated code used synchronous reset.

Read the reasoning again. Every step is correct. The implementation contradicts step 3. And if you’re only checking whether the counter works, well, synchronous reset can pass many of the same testbenches. You might not catch the bug unless you specifically test the async behavior.

This is the core problem. Fluency is not correctness. A plausible reasoning chain is not evidence of sound reasoning.

An evaluation hierarchy born from frustration

Through 150+ RTL generation tasks, I built up a layered evaluation approach out of necessity. Each level catches failures the levels below it miss.

Level 1 is output correctness. Does the code compile? Does it pass testbenches? Does it meet performance targets? This is table stakes and what most benchmarks stop at.

Level 2 checks whether the code matches the reasoning. Did the model actually implement what it described? This catches post-hoc rationalization, where models claim they used a lookup table for efficiency and then write a case statement instead. It happens more often than you’d expect.

Level 3 compares prompting strategies head-to-head. We tested zero-shot, few-shot, chain-of-thought, and CoT with re-prompting. The results weren’t what I expected: plain CoT hit a 53% testbench pass rate, while CoT with re-prompting reached 61%. The gap isn’t huge, but the quality of failures changed. Models with explicit reasoning chains were 2.4x more likely to successfully fix their code when we fed error messages back to them. The chain gives the model a map of what it was trying to do, which makes error feedback actionable. Without it, re-prompting is just “try again.”

Level 4 is understanding how reasoning breaks. Four failure patterns kept recurring. Specification gaps: the model fills in ambiguous requirements incorrectly, like defaulting to synchronous reset when the spec says asynchronous. Complexity collapse: satisfying three out of five constraints and quietly ignoring the others. Template overfitting: a standard counter works, but adding an enable signal breaks everything because the model is matching a pattern, not understanding the circuit. And logical inconsistency: the chain contradicts itself, like using blocking assignments for sequential logic.

Long-context makes everything harder

At Georgia Tech, we worked with 170 financial credit agreements and 20,139 multi-hop QA pairs. The reasoning challenge was fundamentally different from code generation: answers required synthesizing scattered information across long documents.

A typical question: “If Company A’s credit agreement allows a 2.5x debt-to-EBITDA ratio, and their covenant states a minimum EBITDA of $50M, and Section 7.3 limits total debt to $150M, what is the maximum additional debt they can take on?”

Answering correctly means extracting facts from different sections, recognizing that multiple constraints apply, and selecting the binding one. Models frequently got all the facts right and then applied only one constraint, ignoring the rest. The reasoning chain looked thorough. The answer was wrong.

Two things moved the needle. Forcing models to cite specific document sections reduced hallucination, because it’s harder to fabricate a fact when you have to point to where you found it. And chunking strategy mattered more than I expected: models that could see all relevant constraints simultaneously performed significantly better than those synthesizing across chunks.

The measurement paradox

Here’s something uncomfortable: better evaluation makes your numbers go down.

When we improved testbench coverage for RTL generation, the success rate dropped from 61% to 53%. But the code that survived the harder tests was genuinely better, with fewer edge-case bugs and more robust timing behavior. The earlier metrics were inflated by surface-level checks that missed real problems.

If your evaluation metrics keep improving without changes to the model or prompts, be suspicious. You might be measuring the easy parts and ignoring everything else.

What actually moved the needle

After over a thousand evaluations, a few prompting patterns consistently improved reasoning quality.

Explicit constraint enumeration (“list all requirements before you start solving”) reduced specification gaps. Self-verification steps (“check whether your solution satisfies each requirement”) caught complexity collapse. Structured output formats gave the model scaffolding for its reasoning. Error-aware re-prompting with specific failure feedback was the single biggest improvement for iterative workflows.

Some intuitions didn’t hold up. Longer reasoning chains didn’t improve accuracy; verbosity isn’t rigor. Temperature tuning didn’t fix systematic errors. Few-shot examples too similar to the test case caused the model to match surface patterns rather than learn the reasoning.

The uncomfortable bottom line

Chain-of-thought prompting is genuinely useful. It enables iterative debugging, provides audit trails, and improves accuracy on multi-step problems.

But it’s not a substitute for rigorous evaluation. The same model that explains quantum mechanics in accessible prose can generate a broken circuit with an impeccable-sounding justification.

If you’re using CoT in production: verify outputs independently. Build automated checks wherever possible. Track reasoning-output alignment, not just correctness. And remember that sometimes a correct answer with no reasoning chain beats a wrong answer with a beautiful one.

Robust reasoning matters more than occasional correctness. And measuring reasoning is harder than measuring answers.

This post draws from benchmarking work at Harvard University’s Edge Computing Lab (RTL code generation) and Georgia Tech’s FSI Lab (long-context evaluation).

Machine Unlearning: Making Models Forget Without Breaking Everything Else

2025-09-16T08:30:00+00:00

Suppose you’ve trained a language model on a few hundred billion tokens, and you get a GDPR request. Someone wants their data removed. Not just from your storage, but from the model itself.

In a traditional system, you’d delete the row and move on. In a neural network, that “row” is smeared across billions of parameters in ways nobody fully understands.

Welcome to machine unlearning: the problem of making models forget specific things while keeping everything else intact.

Knowledge isn’t stored the way you’d think

The fundamental issue is that neural networks don’t store facts in discrete locations. A single piece of knowledge (say, “X is the CEO of Y”) might be encoded through direct memorization in attention patterns, indirect associations (“X announced Y’s quarterly earnings”), reasoning chains that can reconstruct the fact, and even stylistic patterns in how the model discusses the company.

Suppressing the model’s ability to say “X is CEO of Y” doesn’t touch any of these indirect pathways. Someone with a clever prompt can often recover the information through a side channel.

This makes unlearning fundamentally different from deletion. You’re not removing a file. You’re trying to selectively ablate a distributed representation without knowing exactly where it lives.

The four scenarios driving demand

Privacy compliance is the most urgent. GDPR’s right to be forgotten assumes you can actually forget. When someone’s personal data was in a model’s training set, the legal expectation is removal, but the technical reality is murky. Can you even verify that all traces are gone? What if the model can reconstruct their information from other correlated examples?

Copyright disputes are a growing headache. If your model memorized passages from copyrighted books, rights holders want those passages unlearned. But the model didn’t just memorize text. It learned stylistic patterns, plot structures, conceptual relationships. Where exactly do you draw the line on “removed”?

Factual updates seem like they’d be easier. They aren’t. If a company has a new CEO, you can’t just tell the model. Adding new information doesn’t overwrite old information. The model ends up hedging between contradictory beliefs, and its confidence signals become unreliable.

Backdoor removal is the security angle. If an attacker poisoned your training data to embed a triggered behavior, you need to find and remove the association without damaging general capabilities, then verify it’s truly gone.

What people have tried

The most intuitive idea is fine-tuning the model on “forget” examples, training it to refuse or produce blank responses for targeted queries. It works on the surface. Underneath, the knowledge usually survives. An adversarial prompt can bypass the refusal and extract the original information. You’ve taught the model to lie about what it knows, not to actually forget it.

Gradient ascent takes the opposite approach: maximize the loss on the data you want forgotten, essentially running training in reverse.

\[\theta' = \theta + \alpha \nabla_\theta \mathcal{L}(\theta; D_{\text{forget}})\]

Appealing math. Messy practice. The step size $\alpha$ is hard to get right: too small and nothing happens, too large and the model collapses. There’s no guarantee you’re only affecting the targeted knowledge. Neighboring information often gets caught in the blast radius.

Model editing approaches like ROME and MEMIT are more surgical. They locate the weight matrices storing specific factual associations and apply targeted rank-one updates. This works reasonably well for individual facts, but doesn’t scale to “forget this entire document.” And editing one fact can corrupt related knowledge in unexpected ways.

Influence functions try to compute how much each training example contributed to the model’s current state, then adjust as if those examples were never seen. Elegant in theory. Computing exact influence requires the Hessian matrix (second-order derivatives), which is computationally prohibitive at the scale of modern LLMs. The approximations introduce error, and the linearity assumption underlying the approach is often wrong.

The verification trap

Even if you successfully unlearn something, proving it is its own problem.

The simplest check, asking the model directly, is also the least reliable. The model might refuse to answer while still “knowing” the information internally. Indirect retrieval through rephrasing, context manipulation, or multi-step reasoning often recovers supposedly-deleted knowledge. Membership inference attacks offer statistical evidence: if the model assigns suspiciously low loss to “forgotten” examples, they’re probably still encoded.

Then there’s the meta-question that’s easy to overlook: did the unlearning break anything else? If you remove knowledge of a specific CEO and accidentally degrade the model’s general business reasoning, you haven’t solved the problem. You’ve traded one failure for another.

Where this is actually heading

The honest takeaway is that perfect surgical unlearning for large language models is somewhere between very hard and impossible with current techniques. The distributed nature of knowledge in neural networks isn’t a bug in our approach; it’s a fundamental property of how these models learn.

The strategies that work today are less elegant than the research papers suggest. For privacy compliance, periodic retraining with problematic data excluded is still the gold standard. For factual updates, retrieval augmentation sidesteps the problem by keeping updateable knowledge outside the model. For safety, defense in depth (unlearning plus output filtering plus monitoring) beats any single technique.

The more interesting long-term question might not be “how do we make models forget” but “how do we build models that support granular updates from the start.” Modular architectures with explicit knowledge layers. Audit trails for data provenance. Clean separation between parametric memory and retrieved knowledge.

We designed databases for easy deletion. We didn’t design neural networks that way. Maybe we should start.

This post draws on research from Google (influence functions), MIT (ROME/MEMIT model editing), and ongoing work across the ML safety community.