A³-Bench

Benchmarking Memory-Driven Scientific Reasoning
via Anchor and Attractor Activation

Jian Zhang¹, Yu He¹, Zhiyuan Wang¹, Zhangqi Wang¹, Kai He², Fangzhi Xu¹, Qika Lin², Jun Liu^1*

¹Xi'an Jiaotong University, ²National University of Singapore

🤗

Figure 1: Performance and token analysis across ten LLMs and three memory paradigms. The three color-coded groups represent the experimental paradigms: Vanilla, Anchors & Attractors, and Annotated Anchors & Attractors. This enhancement improves accuracy while keeping token costs controllable, supporting cognitively aligned evaluation and model development.

Abstract

Scientific reasoning relies not only on logical inference but also on activating prior knowledge and experiential structures. Memory can efficiently reuse knowledge and enhance reasoning consistency and stability. However, existing benchmarks mainly evaluate final answers or step-by-step coherence, overlooking the memory-driven mechanisms that underlie human reasoning, which involves activating anchors and attractors, then integrating them into multi-step inference.

To address this gap, we propose A³-Bench, a benchmark to evaluate scientific reasoning through dual-scale memory-driven activation, grounded in Anchor and Attractor Activation.

First, we annotate 2,198 science reasoning problems across domains using the SAPM process (subject, anchor & attractor, problem, and memory developing). Second, we introduce a dual-scale memory evaluation framework utilizing anchors and attractors, along with the AAUI (Anchor--Attractor Utilization Index) metric to measure memory activation rates. Finally, through experiments with various base models and paradigms, we validate A³-Bench and analyze how memory activation impacts reasoning performance, providing insights into memory-driven scientific reasoning.

Leaderboard on A³-Bench

Performance comparison across three different memory paradigms.
AAUI (Anchor-Attractor Utilization Index) measures the effective usage of memory.

1. Annotated Anchors & Attractors

Models are provided with expert-annotated Anchors and Attractors. This represents the ideal memory utilization scenario.

#	Model	Avg. Acc (%)	Math	Physics	Chemistry	AAUI	Tokens
1	Grok-4-Fast 🥇	65.10	75.94	60.12	78.75	0.97	2.64M
2	Qwen3-30B 🥈	60.60	73.18	67.08	71.67	0.95	2.73M
3	Qwen3-4B 🥉	58.92	72.18	61.52	69.17	0.92	2.68M
4	Claude-Haiku-4.5	56.37	70.93	56.21	74.58	0.77	2.26M
5	DeepSeek-V3.2	55.96	65.66	72.50	73.75	0.88	1.65M
6	GLM-4-32B	47.95	63.91	55.83	55.42	0.92	1.62M
7	GPT-OSS-120B	47.18	59.40	47.90	63.33	0.68	2.05M
8	Llama-3.1-70B	44.77	56.64	45.83	55.00	0.96	1.69M
9	GPT-5-Mini	25.34	36.59	32.57	22.92	0.74	2.33M
10	Gemini-2.5-Flash	19.75	37.84	35.07	8.33	0.69	2.34M

2. Retrieved Anchors & Attractors (HybridRAG)

Models retrieve memory automatically using HybridRAG. This evaluates the system's ability to find and use relevant knowledge.

#	Model	Avg. Acc (%)	Math	Physics	Chemistry	AAUI	Tokens
1	Grok-4-Fast 🥇	56.69	68.92	72.50	65.00	0.66	3.17M
2	Claude-Haiku-4.5 🥈	51.18	64.66	60.42	62.50	0.46	2.58M
3	DeepSeek-V3.2 🥉	47.27	59.40	49.10	44.17	0.22	1.94M
4	GPT-OSS-120B	47.22	56.14	57.08	52.08	0.44	2.48M
5	Qwen3-30B	44.31	64.16	60.42	39.17	0.36	1.97M
6	GLM-4-32B	39.40	59.90	52.50	29.58	0.41	1.82M
7	Qwen3-4B	38.13	59.15	46.59	30.83	0.27	1.87M
8	Llama-3.1-70B	28.98	44.11	37.47	20.83	0.33	1.88M
9	Gemini-2.5-Flash	21.66	30.58	30.26	23.75	0.14	2.77M
10	GPT-5-Mini	18.74	26.32	24.35	18.33	0.09	2.74M

3. Vanilla (No Memory)

Standard zero-shot inference without external memory augmentation.

#	Model	Avg. Acc (%)	Math	Physics	Chemistry	Tokens
1	Qwen3-30B 🥇	48.41	55.64	48.90	60.83	1.81M
2	Grok-4-Fast 🥈	47.45	51.38	43.99	58.75	1.18M
3	DeepSeek-V3.2 🥉	45.36	46.37	38.28	60.42	0.70M
4	Claude-Haiku-4.5	41.26	43.11	50.42	64.58	0.96M
5	GPT-OSS-120B	40.76	49.12	40.18	50.42	0.44M
6	Qwen3-4B	37.76	47.62	42.89	46.67	1.91M
7	GLM-4-32B	25.20	37.59	28.56	30.00	0.44M
8	Llama-3.1-70B	23.96	34.21	26.94	28.21	0.57M
9	GPT-5-Mini	21.97	31.83	26.65	26.67	1.35M
10	Gemini-2.5-Flash	15.01	24.81	26.15	5.00	1.30M

Dataset Overview

A³-Bench is a comprehensive benchmark designed to evaluate memory-driven scientific reasoning. It is constructed via the SAPM process (Subject, Anchor & Attractor, Problem, Memory Developing).

The dataset contains a total of 2,198 problems, distributed across:

Math (45.4%), Physics (27.3%), and Chemistry (27.3%).
Difficulty is balanced: 40% Easy, ~30% Medium, and ~30% Hard.

We model scientific reasoning memory at two scales: Anchors (foundational knowledge) and Attractors (experience-based templates).

Dataset Statistics — **Table 1:** Key Statistics of A³-Bench.

The SAPM Construction Process

Evaluation Framework

HybridRAG Framework — **Figure 3:** The **HybridRAG** framework, utilizing Memory Twin-Needle Activator and Context Fabric Composer to simulate human-like memory retrieval.

Efficiency vs. Performance

Figure 4: Inference time vs. Accuracy. The Annotated paradigm (Green) achieves the highest accuracy with moderate latency, significantly outperforming the Vanilla baseline.

Subject & Difficulty Breakdown

Figure 5: Performance heatmap. The memory mechanism provides the most significant gains in Hard physics and chemistry problems, where multi-hop reasoning is required.

Error Attribution

Figure 6: Distribution of error types. Activating Anchors and Attractors significantly reduces Knowledge Errors and Reasoning Hallucinations.

Robustness to Noise

Figure 7: Impact of interfering memory. The model demonstrates robustness against irrelevant anchors (noise), maintaining stability even when distraction levels increase.

Qualitative analysis of dataset samples and reasoning trajectories.

Dataset Annotation Sample

Figure 8: A representative sample from A³-Bench showing how Anchors (formulas) and Attractors (solution templates) are mapped to a complex problem.

Successful Reasoning Trace

Figure 9: A success case where the model correctly activates the relevant Anchor (Pitman-Yor process) to solve a problem failed by the Vanilla model.

Failure Analysis

Figure 10: A failure case where the retrieval system activated a noisy Anchor (irrelevant formula), leading to incorrect derivation.

BibTeX

@misc{zhang2026a3benchbenchmarkingmemorydrivenscientific,
      title={$A^3$-Bench: Benchmarking Memory-Driven Scientific Reasoning via Anchor and Attractor Activation},
      author={Jian Zhang and Yu He and Zhiyuan Wang and Zhangqi Wang and Kai He and Fangzhi Xu and Qika Lin and Jun Liu},
      year={2026},
      eprint={2601.09274},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.09274},
}

A³-Bench

Benchmarking Memory-Driven Scientific Reasoning
via Anchor and Attractor Activation

Abstract