Logo A3-Bench

Benchmarking Memory-Driven Scientific Reasoning
via Anchor and Attractor Activation

1Xi'an Jiaotong University, 2National University of Singapore
Performance and token analysis

Figure 1: Performance and token analysis across ten LLMs and three memory paradigms. The three color-coded groups represent the experimental paradigms: Vanilla, Anchors & Attractors, and Annotated Anchors & Attractors. This enhancement improves accuracy while keeping token costs controllable, supporting cognitively aligned evaluation and model development.

Abstract

Scientific reasoning relies not only on logical inference but also on activating prior knowledge and experiential structures. Memory can efficiently reuse knowledge and enhance reasoning consistency and stability. However, existing benchmarks mainly evaluate final answers or step-by-step coherence, overlooking the memory-driven mechanisms that underlie human reasoning, which involves activating anchors and attractors, then integrating them into multi-step inference.

To address this gap, we propose A3-Bench, a benchmark to evaluate scientific reasoning through dual-scale memory-driven activation, grounded in Anchor and Attractor Activation.

First, we annotate 2,198 science reasoning problems across domains using the SAPM process (subject, anchor & attractor, problem, and memory developing). Second, we introduce a dual-scale memory evaluation framework utilizing anchors and attractors, along with the AAUI (Anchor--Attractor Utilization Index) metric to measure memory activation rates. Finally, through experiments with various base models and paradigms, we validate A3-Bench and analyze how memory activation impacts reasoning performance, providing insights into memory-driven scientific reasoning.

Leaderboard on Logo A3-Bench

Performance comparison across three different memory paradigms.
AAUI (Anchor-Attractor Utilization Index) measures the effective usage of memory.

1. Annotated Anchors & Attractors

Models are provided with expert-annotated Anchors and Attractors. This represents the ideal memory utilization scenario.

# Model Avg. Acc (%) Math Physics Chemistry AAUI Tokens
1 Grok-4-Fast 🥇 65.10 75.94 60.12 78.75 0.97 2.64M
2 Qwen3-30B 🥈 60.60 73.18 67.08 71.67 0.95 2.73M
3 Qwen3-4B 🥉 58.92 72.18 61.52 69.17 0.92 2.68M
4Claude-Haiku-4.556.3770.9356.2174.580.772.26M
5DeepSeek-V3.255.9665.6672.5073.750.881.65M
6GLM-4-32B47.9563.9155.8355.420.921.62M
7GPT-OSS-120B47.1859.4047.9063.330.682.05M
8Llama-3.1-70B44.7756.6445.8355.000.961.69M
9GPT-5-Mini25.3436.5932.5722.920.742.33M
10Gemini-2.5-Flash19.7537.8435.078.330.692.34M

2. Retrieved Anchors & Attractors (HybridRAG)

Models retrieve memory automatically using HybridRAG. This evaluates the system's ability to find and use relevant knowledge.

# Model Avg. Acc (%) Math Physics Chemistry AAUI Tokens
1 Grok-4-Fast 🥇 56.69 68.92 72.50 65.00 0.66 3.17M
2 Claude-Haiku-4.5 🥈 51.18 64.66 60.42 62.50 0.46 2.58M
3 DeepSeek-V3.2 🥉 47.27 59.40 49.10 44.17 0.22 1.94M
4GPT-OSS-120B47.2256.1457.0852.080.442.48M
5Qwen3-30B44.3164.1660.4239.170.361.97M
6GLM-4-32B39.4059.9052.5029.580.411.82M
7Qwen3-4B38.1359.1546.5930.830.271.87M
8Llama-3.1-70B28.9844.1137.4720.830.331.88M
9Gemini-2.5-Flash21.6630.5830.2623.750.142.77M
10GPT-5-Mini18.7426.3224.3518.330.092.74M

3. Vanilla (No Memory)

Standard zero-shot inference without external memory augmentation.

# Model Avg. Acc (%) Math Physics Chemistry Tokens
1 Qwen3-30B 🥇 48.41 55.64 48.90 60.83 1.81M
2 Grok-4-Fast 🥈 47.45 51.38 43.99 58.75 1.18M
3 DeepSeek-V3.2 🥉 45.36 46.37 38.28 60.42 0.70M
4Claude-Haiku-4.541.2643.1150.4264.580.96M
5GPT-OSS-120B40.7649.1240.1850.420.44M
6Qwen3-4B37.7647.6242.8946.671.91M
7GLM-4-32B25.2037.5928.5630.000.44M
8Llama-3.1-70B23.9634.2126.9428.210.57M
9GPT-5-Mini21.9731.8326.6526.671.35M
10Gemini-2.5-Flash15.0124.8126.155.001.30M

Logo A3-Bench Dataset

Dataset Overview

A3-Bench is a comprehensive benchmark designed to evaluate memory-driven scientific reasoning. It is constructed via the SAPM process (Subject, Anchor & Attractor, Problem, Memory Developing).

The dataset contains a total of 2,198 problems, distributed across:

  • Math (45.4%), Physics (27.3%), and Chemistry (27.3%).
  • Difficulty is balanced: 40% Easy, ~30% Medium, and ~30% Hard.

We model scientific reasoning memory at two scales: Anchors (foundational knowledge) and Attractors (experience-based templates).

Dataset Statistics
Table 1: Key Statistics of A3-Bench.

The SAPM Construction Process

SAPM Construction Process
Figure 2: The four-step SAPM annotation process: Subject Benchmarking, Anchors & Attractors Developing, Problem Reconstructing, and Memory Mapping.

Evaluation Framework

HybridRAG Framework
Figure 3: The HybridRAG framework, utilizing Memory Twin-Needle Activator and Context Fabric Composer to simulate human-like memory retrieval.

Analysis & Insights

Efficiency vs. Performance

Inference Time vs Accuracy

Figure 4: Inference time vs. Accuracy. The Annotated paradigm (Green) achieves the highest accuracy with moderate latency, significantly outperforming the Vanilla baseline.

Subject & Difficulty Breakdown

Heatmap Analysis

Figure 5: Performance heatmap. The memory mechanism provides the most significant gains in Hard physics and chemistry problems, where multi-hop reasoning is required.

Error Attribution

Error Type Analysis

Figure 6: Distribution of error types. Activating Anchors and Attractors significantly reduces Knowledge Errors and Reasoning Hallucinations.

Robustness to Noise

Impact of Distracting Memory

Figure 7: Impact of interfering memory. The model demonstrates robustness against irrelevant anchors (noise), maintaining stability even when distraction levels increase.

Case Studies

Qualitative analysis of dataset samples and reasoning trajectories.

BibTeX

@misc{zhang2026a3benchbenchmarkingmemorydrivenscientific,
      title={$A^3$-Bench: Benchmarking Memory-Driven Scientific Reasoning via Anchor and Attractor Activation},
      author={Jian Zhang and Yu He and Zhiyuan Wang and Zhangqi Wang and Kai He and Fangzhi Xu and Qika Lin and Jun Liu},
      year={2026},
      eprint={2601.09274},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.09274},
}