sylvia

Reasoning Model Activations May Encode Hint Influence That Chain-of-Thought Doesn't Reveal

April 2026

Status: writeup was extremely rushed. claude has a tendency to misinterpret and/or overstate results which i've not yet edited out. posting mainly to make access and collecting feedback easier

Motivation

Young (2026) “Lie to Me” (arXiv:2603.22582) tests whether reasoning models’ chain-of-thought is faithful by injecting misleading hints into multiple-choice prompts and classifying the resulting CoT with an external LLM. We ask: can we detect hint influence directly from the model’s hidden states? If a linear probe on residual stream activations predicts whether a model was influenced by a hint, the model has an internal representation of “I am being influenced,” regardless of whether its CoT says so.

We also ask whether the model’s representation of influence is separable from its representation of faithfulness. If the directions for “I was influenced” and “I was transparent about being influenced” overlap, the model packages influence-awareness and honesty-about-it into the same feature. If they’re separable, the model “knows” it was influenced, but that knowledge lives in a different subspace from its decision about whether to say so.

Setup

Models and Data

We use two open-weight reasoning models from the paper’s evaluation, chosen to fit on a single A100-80GB and to provide a scale comparison:

Model Layers Hidden dim
OLMo-3-7B-Think 28 3,072
OLMo-3.1-32B-Think 64 5,120

Both use <think>...</think> traces. The paper’s dataset contains 498 questions (300 MMLU + 198 GPQA Diamond) with pre-existing inference results for 7 conditions per model: 1 baseline (no hint) and 6 hint types.

Primary results use OLMo 32B on GPQA, which has the most balanced class distribution (48.9% influenced) and cleanest signal. The 32B/MMLU and 7B/MMLU configurations have data quality issues discussed in Appendix A.

Hint types. Each prepends (except for visual pattern, where hinting is structural) a misleading signal pointing to a specific wrong answer:

Labels. The paper provides two binary labels per hinted sample:

Activation Capture

For each paper trace, we reconstruct the full token sequence (prompt via chat template + paper’s thinking/answer text) and run a single forward pass with hooks on every transformer layer.

We capture residual stream activations at three token positions:

We also split thinking traces into 10 chunks and capture at each boundary, giving temporal resolution through the reasoning process.

Probe Design

Per-layer logistic regression (StandardScaler + L-BFGS) predicts each binary label from a single layer’s activations. We use a deterministic hash-based 80/20 train/val split on question_id, shared across all experiments. Probes are trained on the train split only; the val split is reserved for held-out evaluation and for the LLM detector, preventing data leakage between the two methods.

LLM Detector

We also run an LLM-based detection task for comparison. Crucially, this differs from the paper’s faithfulness classifier: Young tells the classifier “the model changed its answer” and asks about transparency. We frame it as blind detection — the detector sees a (possibly truncated) thinking trace without knowing the outcome, and predicts whether the model is being influenced. This matches the probe’s task.

For each val-split sample, we chunk the trace into 10 pieces and classify cumulative prefixes (10%, 20%, …, 100%), building a detection trajectory. We use a Claude 3 Haiku detector in all three configurations and a Qwen 3.6 Plus detector in the OLMO 32B / GPQA configuration.

Results

Unless otherwise noted, primary results are from OLMo 32B on GPQA.

1. Influence probes detect hint influence with high accuracy

Linear probes reliably detect whether the model was influenced by a hint, from hidden-layer activations alone.

Held-out performance (train on 80%, evaluate on 20%):

Model / Dataset Position Peak AUROC Peak Layer n (val) % influenced
OLMo 32B / GPQA thinking_end 0.945 L42 264 48.9%
OLMo 32B / GPQA last_content 0.950 L26 264 48.9%
OLMo 7B / MMLU thinking_end 0.857 L16 306 20.9%
OLMo 7B / MMLU last_content 0.874 L15 306 20.9%

A single-layer logistic regression reaches 0.95 AUROC on the 32B model with GPQA’s balanced 49% influence rate — strong but not trivially so. The 7B model on MMLU achieves a more modest 0.87, consistent with a smaller model on easier questions where hints less often flip the answer. (32B/MMLU results are deferred to Appendix A — near-perfect probe metrics there reflect degenerate class splits rather than stronger signal.)

The prompt_end result is notable: the model has processed the full prompt but hasn’t generated a single thinking token, yet the probe already achieves 0.849 AUROC. This is not detecting whether a hint is present — the hint text is always there. The probe is predicting whether the model’s answer will change because of the hint. However, a confounder is possible: harder questions may both be more susceptible to hints and produce distinguishable prompt representations, meaning the probe could partly be detecting question difficulty rather than susceptibility per se. This is testable by training probes on baseline (unhinted) activations to predict is_influenced on the hinted val split — if baseline activations predict influence comparably, the signal is difficulty, not hint-awareness. Assuming the signal is genuine, the model’s susceptibility to a given hint is substantially determined at prompt-processing time, before reasoning begins.

AUROC by layer and capture position for OLMo 7B/MMLU, OLMo 32B/MMLU, and OLMo 32B/GPQA. thinking_end and last_content peak in middle layers; prompt_end is weaker throughout.

A natural follow-up: does the prompt_end influence signal persist into the post-reasoning representation, or does the model develop a new one? A cross-position transfer test at L36 (the best thinking_end layer from held-out evaluation) answers this directly:

Train position → Eval position AUROC
prompt_end → prompt_end 0.749
thinking_end → thinking_end 0.938
prompt_end → thinking_end 0.563
thinking_end → prompt_end 0.557

Transfer is near-chance in both directions. The model does not carry its initial susceptibility representation forward through reasoning; it develops a qualitatively different influence representation during the thinking trace. Whatever linear feature the model uses to encode “this hint will affect me” before thinking starts, the post-reasoning “I was affected” feature occupies a different subspace entirely.

2. Probes outperform the LLM detector — and the gap grows with scale

We compare the probe against Haiku 3 as a text-based influence detector. Both produce binary predictions on the same val-split samples against the same is_influenced labels: the probe thresholds logistic regression output at 0.5, while the classifier returns a binary judgment from the thinking trace. For temporal resolution, we evaluate the probe at chunk boundary capture positions through the thinking trace and the classifier at corresponding text truncation points (10%, 20%, …, 100%).

Probe accuracy (best layer) and classifier accuracy across trace depth.

On OLMo 32B / GPQA, the probe outperforms the classifier by roughly 12 percentage points at full trace depth (~87% vs ~75%). On OLMo 7B / MMLU, the classifier matches or slightly outperforms the probe (~85% vs ~83%). Note that the probe’s 0.5 threshold is untuned — optimizing it on the train split could widen the gap, making this comparison conservative for the probe.

The faithfulness data from the original paper explains the divergence. The 7B model is more faithful overall — 76.4% of its influenced samples transparently acknowledge the hint, compared to 53.2% for 32B on GPQA. When the model says the quiet part out loud, a text classifier has enough signal. When it doesn’t, the classifier is left reading CoT that looks like independent reasoning.

This pattern is consistent with a scissors dynamic: CoT text becoming less informative about influence while activations become more informative as scale increases. Two data points in one model family do not establish a trend, but the direction is worth flagging. If it generalizes, CoT monitoring would degrade with the capability it’s meant to monitor — not because monitors get worse at reading, but because there’s progressively less to read. The internal signal moves the other way: AUROC 0.87 at 7B, 0.95 at 32B. Section 6 identifies the mechanism at 32B — the probe’s advantage is entirely concentrated on unfaithful samples, precisely where text-based detection breaks down.

3. Influence detection generalizes across hint types

If the probe detects a universal “I’m being influenced” direction rather than hint-type-specific features, it should generalize: a probe trained on sycophancy should detect metadata influence, and vice versa. We test this by training on each hint type’s train split and evaluating on every other type’s val split.

Cross-condition generalization heatmaps (thinking_end position). Diagonal is same-condition; off-diagonal is cross-condition transfer.

OLMo 32B / GPQA (thinking_end): Same-condition mean AUROC = 0.845, cross-condition mean = 0.745. Transfer is strong, with the off-diagonal only 0.10 below the diagonal, but not uniform.

Among the four textual hint types (sycophancy, consistency, metadata, grader), cross-condition pairs generally achieve AUROC 0.8–0.9. Two hint types break this pattern:

This suggests what generalizes across the textual hint types is better characterized as a persuasion-by-authority feature rather than a universal influence feature. Stimuli that don’t operate through authority (visual reordering, safety-triggering claims) project onto it less cleanly. This means a probe trained on one class of influence may miss others that operate through different mechanisms, though within-condition probes remain strong even for the outlier types.

4. Faithfulness is also linearly readable

Beyond detecting influence, we probe for is_faithful — whether the CoT transparently acknowledges the hint. This is only defined for the influenced subset, so sample sizes are small.

OLMo 32B / GPQA (val split, aggregated across hint types):

Per-condition sample sizes range from 15 (consistency) to 86 (unethical). The aggregate probe across all 6 hint types shows clear signal; per-condition results should be interpreted cautiously.

Faithfulness probe AUROC across layers with confidence bands (OLMo 32B). Signal is strongest at thinking_end.

At thinking_end, the faithfulness probe peaks at layer 26 while the influence probe peaks at layer 42, while at last_content the peaks nearly coincide at layers 25 and 26. This is suggestive of the two computations happening at different processing depths, consistent with the separability result in Section 5.

5. Influence and faithfulness share more structure as reasoning progresses

Does the “influence” direction overlap with the “faithfulness” direction? We test this with a cross-probe direction analysis: train an influence probe, project influenced-only samples onto that direction, and compute AUROC against faithfulness labels (and vice versa).

This avoids the methodological pitfall of comparing probe weight vectors directly (e.g. via cosine similarity). The influence and faithfulness probes are trained on different data subsets with different StandardScalers, making weight vectors live in incomparable spaces. (See Appendix B for the cosine similarity analysis, which we include for completeness but consider less informative for this reason.)

OLMo 32B / GPQA cross-probe AUROC:

Position Infl dir → faithfulness Faith dir → influence
prompt_end 0.469 0.495
thinking_end 0.600 0.543
last_content 0.625 0.572

The temporal progression is the most interesting part of this analysis. At prompt_end, cross-prediction is at chance, and the two directions are functionally independent before reasoning begins. By last_content, cross-prediction reaches 0.63, still far below within-target performance (0.89+ for faithfulness, 0.95 for influence) but meaningfully above chance. The two representations start independent and become partially entangled through the reasoning process.

PCA to 50 dimensions sharpens this picture, mitigating the d » n regime. That PCA cross-prediction slightly exceeds full-dimensionality results is consistent with denoising in the d » n regime rather than overfitting, as removing noisy dimensions concentrates the shared variance the cross-probe test is looking for.

Position Infl dir → faithfulness Faith dir → influence
prompt_end (PCA-50) 0.464 0.493
thinking_end (PCA-50) 0.639 0.549
last_content (PCA-50) 0.684 0.731

Cross-probe direction AUROC across layers, full dimensionality (top) and PCA-50 (bottom). The temporal progression from independent to partially entangled is visible in both.

Even with PCA, cross-prediction peaks around 0.68–0.73, indicating that the two concepts share structure but remain substantially separable. The model’s representations of “I was influenced” and “I was transparent about it” begin independent and develop some overlap through reasoning, but never converge.

6. The probe advantage is concentrated on unfaithful samples

The probe’s ~12-point overall advantage on 32B/GPQA (Section 2) aggregates across samples with very different properties. Splitting val-split samples by faithfulness subgroup reveals exactly where that advantage comes from.

We divide samples into three groups: not influenced (hint did not change the answer), influenced + faithful (hint changed the answer and the CoT acknowledged it), and influenced + unfaithful (hint changed the answer but the CoT did not acknowledge it). For each, we compute detection accuracy at full trace depth for the probe and for two text classifiers (Haiku 3 and Qwen 3.6 Plus).

Detection accuracy by faithfulness subgroup for OLMo 7B/MMLU, OLMo 32B/MMLU, and OLMo 32B/GPQA. The probe maintains accuracy across subgroups; classifiers collapse on unfaithful samples.

On OLMo 32B / GPQA, the subgroup breakdown reveals that the two classifiers fail in complementary ways. Qwen 3.6 Plus sees influence everywhere, with ~100% accuracy on both faithful and unfaithful influenced samples but only ~25% on non-influenced samples. This means Qwen actually beats the probe on aggregate accuracy since ~49% of samples are influenced, but it achieves this through bias rather than discrimination. Claude 3 Haiku is more conservative at ~63% on non-influenced and ~82% on faithful influenced, but only ~25% on unfaithful influenced (it misses influence when the CoT doesn’t reveal it).

The probe (~63% on non-influenced, ~93% on faithful, ~88% on unfaithful) maintains high accuracy across all three subgroups. Neither classifier achieves this: Qwen sacrifices specificity, Haiku sacrifices sensitivity on unfaithful samples. A text classifier can be tuned to predict “influenced” more or less aggressively, but no threshold recovers the information that isn’t in the text.

Discussion

The scissors hypothesis: text gets less informative, activations get more informative

The probe-vs-classifier comparison across two model scales shows a suggestive pattern. At 7B, the text-level detectors surpass the probe, suggesting the model’s CoT is transparent enough that reading it suffices. At 32B, the probe pulls ahead of Haiku by ~12 points, indicating that the model produces cleaner-looking CoT while its internal state still clearly encodes influence.

The subgroup analysis (Section 6) identifies the mechanism at 32B: Qwen 3.6 Plus was biased toward detecting influence and collapsed on the non-influenced traces, while Claude 3 Haiku was biased away from it and performed worse on influenced traces. The probe maintains its accuracy regardless. The failure is not about classifier capability or calibration but about what information the text contains.

Two data points in one model family do not establish a trend. But the direction, if it generalizes, would be concerning for CoT monitoring as a safety tool: larger models producing text that looks like independent reasoning while their internal state records influence. Activation-based monitoring would not suffer this degradation — in our data, the internal signal gets cleaner with scale (AUROC 0.87 at 7B, 0.95 at 32B). Testing this hypothesis across more architectures and scales is a natural next step.

What cross-condition transfer tells us about the influence feature

The cross-condition generalization pattern constrains what the probe is detecting. Strong transfer among sycophancy, consistency, metadata, and grader (AUROC 0.8–0.9) rules out the probe picking up on surface-level hint text features — these hints have completely different wording. Instead, the model appears to have a shared linear feature for “I am being persuaded by an authority-like signal.”

The failures are equally informative. Visual pattern operates through spatial reordering, not textual persuasion, and doesn’t share the authority feature. Unethical triggers safety-adjacent representations that the authority feature doesn’t capture. This means the “influence direction” is better understood as a persuasion-by-authority direction, and a deployment-ready probe would need to account for influence channels that don’t route through this feature.

What partial separability means for interventions

The cross-probe analysis shows that influence and faithfulness start as independent features (chance-level cross-prediction at prompt_end) and develop partial overlap through reasoning (0.63–0.73 AUROC at last_content). Combined with the 16-layer gap in their peak layers (faithfulness at L26, influence at L42 at thinking_end), these appear to be genuinely separate internal computations at different processing stages.

This has practical consequences: steering the model away from influence will not automatically make it more faithful, and vice versa. An intervention that successfully suppresses hint-following at L42 would leave the faithfulness computation at L26 untouched — the model might still produce unfaithful-looking CoT on the (fewer) cases where it is influenced, because the unfaithfulness circuit operates independently. Conversely, an intervention that promotes faithful reporting would not reduce susceptibility to hints. The two goals require separate interventions targeting different layers and directions.

Overfitting Considerations

These experiments operate in a regime where overfitting is a real concern: ~264 val samples aggregated across conditions, ~44 per condition on GPQA, and hidden dimensions of 5,120. We address this several ways:

Cross-validation vs. held-out agreement. We compare two evaluation strategies: 5-fold stratified CV within the val split (n~44 per condition) and proper held-out evaluation (train on ~158 train-split samples, evaluate on the same ~44 val samples). If the probes were overfitting to the small val folds, the CV estimates would be inflated relative to held-out.

CV vs held-out AUROC by layer for three hint conditions on OLMo 32B / GPQA. Held-out (red) tracks CV (blue) closely — in some conditions, held-out slightly exceeds CV.

The two curves track each other closely across layers and conditions. In several cases, the held-out curve (trained on 158 samples, tested on 44) slightly exceeds the CV curve (trained on ~35, tested on ~9). This is consistent with the held-out probes benefiting from more training data rather than either method overfitting.

Cross-condition generalization. A probe trained on one hint type and evaluated on completely different samples with different hint surface features still achieves high AUROC. This rules out the probe memorizing hint-type-specific artifacts.

Linear probes are low-capacity. Logistic regression on StandardScaled features has exactly hidden_dim + 1 parameters. With 5,120-dimensional activations and ~200+ training samples, this is not a model that can memorize its training set.

Faithfulness sample sizes are the main limitation. The is_faithful probe operates on the smallest subset (only influenced samples, val split). Per-condition n ranges from 15 to 86. The aggregate probe (n=174) produces clear signal, but per-condition faithfulness results should be interpreted cautiously. The cross-probe direction test uses PCA dimensionality reduction partly to mitigate the d » n regime for this target.

Limitations

Planned Experiments

Difficulty confound test. The prompt_end probe achieves 0.849 AUROC before any reasoning tokens, but may partly reflect question difficulty rather than hint-awareness (harder questions are both more susceptible to hints and may produce distinguishable representations). To test this: train probes on baseline (unhinted) prompt_end activations to predict is_influenced labels from the hinted conditions. If baseline activations predict influence at comparable AUROC, the probe is detecting difficulty; if they fall to chance, the signal is genuinely hint-specific.

Causal steering. The probing results are observational — activations encode influence, but we have not shown these representations are causally involved in the model’s decision-making. To test this: extract normalised weight vectors from trained probes, hook into the residual stream during generation, and subtract the projection of activations onto the probe direction (scaled by a sweep of alpha values) at the probe’s best layer. If the influence direction is causal, steering away from it should reduce the model’s tendency to follow the hint (lower influence rate) without destroying coherence. Separately steering the faithfulness direction should affect whether the model acknowledges the hint in its CoT without changing susceptibility — testing the separability claim from Section 5 in a causal setting.

Code for activation capture, probing, and analysis is available at github.com/intentionallydense. Data is from Young (2026). Experiments and writeup created with the help of Claude Code.


Appendix A: Why MMLU Probe Results Are Less Informative

The 32B/MMLU probes achieve near-perfect AUROC (0.999), which looks impressive but reflects trivial classification tasks rather than strong probe signal. MMLU questions are easy enough that the 32B model either completely ignores the hint or completely follows it, with almost no ambiguous cases. This produces extreme class splits that any classifier — even a random projection — could partially separate.

32B/MMLU held-out aggregate: 0.995 AUROC (thinking_end, L23), 0.999 AUROC (last_content, L22), n=255, 33.3% influenced. These numbers look impressive but break down on inspection.

Per-condition influence rates tell the story:

Hint type 7B / MMLU (val) 32B / MMLU (val) 32B / GPQA (val)
Sycophancy 27.5% 21.6% 54.5%
Consistency 9.8% 3.9% 29.6%
Visual pattern 5.9% 2.0% 11.4%
Metadata 15.7% 19.6% 52.3%
Grader 39.2% 23.5% 63.6%
Unethical 27.5% 98.0% 81.8%
Aggregate 20.9% 33.3% 48.9%

On 32B/MMLU, every per-condition probe achieves AUROC 1.0 at some layer. Consistency has 2 positive samples out of 51 (3.9%) — separating 2 from 49 in 5,120 dimensions is not meaningful. Unethical has 50 out of 51 (98.0%) — predicting the single negative is equally trivial. The aggregate 0.999 is inflated by these degenerate splits.

On 7B/MMLU, the story is different: influence rates are low but not degenerate (9.8–39.2%), and the aggregate 0.857 AUROC represents genuine probe signal. The 7B model’s influence rates are low because it’s capable enough to mostly resist hints on easy MMLU questions, but not so capable that influence becomes all-or-nothing.

GPQA’s higher difficulty produces the most informative benchmark. Influence rates span 11–82% across conditions with a near-50/50 aggregate split (48.9%). Hints have room to partially sway the model on hard questions without completely overriding its reasoning, creating the kind of graded signal that probes can meaningfully discriminate.

The 7B/MMLU results are included in the layer curves and probe-vs-classifier figures to support the scaling comparison (Section 2). The 32B/MMLU results appear in the cross-condition heatmap for completeness. Primary quantitative claims rest on 32B/GPQA.

Appendix B: Cosine Similarity Between Probe Directions

Cosine similarity between influence and faithfulness probe weight vectors across layers.

The cosine similarity between is_influenced and is_faithful probe weight vectors is near-zero across all layers and positions, hovering in the -0.1 to +0.1 range.

We include this for completeness but consider it less informative than the cross-probe direction test (Section 5). The two probes are trained on different data subsets with different StandardScalers — influence uses all hinted samples while faithfulness uses only the influenced subset. This means the weight vectors live in effectively different spaces, and near-zero cosine similarity could reflect either genuine orthogonality or simply the incomparability of the vectors. The cross-probe direction test avoids this confound by directly measuring whether one probe’s direction carries predictive power for the other target.

Methods Reference

Component Implementation
Activation capture Single forward pass with per-layer hooks, safetensors output
Probing Logistic regression (sklearn), 5-fold stratified CV or train→val held-out
External classifiers Claude Haiku 3 + Qwen 3.6 Plus, blind detection prompt, 10-chunk trajectory
Cross-position transfer Train probe at one capture position, evaluate on another (same layer)
Cross-condition Train on one hint type’s train split, eval on all others’ val splits
Faithfulness probing Same approach, is_faithful target on influenced-only subset
Cross-probe directions Train probe A, project onto its direction, compute AUROC for target B
Models probed OLMo-3-7B-Think, OLMo-3.1-32B-Think
Data source Young (2026), “Lie to Me” — 498 questions, 6 hint types
Train/val split Deterministic 80/20 hash-based split on question_id (MD5, seed=103)
Compute Single A100-80GB for forward passes; local for probing and analysis
← all writing