Artificial Intelligence

Tianxin Wei et al. · Multiple institutions

2026-01-18

Agentic Reasoning for Large Language Models

Three-layer framework for agentic reasoning: foundational, self-evolving, multi-agent

Mohamed Amine Ferrag et al. · Multiple institutions

2026-03-06

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

Unified taxonomy of ~60 benchmarks, agent framework comparison, collaboration protocols survey

Hu Jinchao et al. · Harbin Institute of Technology Shenzhen, TikTok Inc

2026-04-01

Agentic Tool Use in Large Language Models

Unified evolutionary framework for LLM tool use: prompting, supervised, RL paradigms

Zhaoshu Yu, Bo Wang, Pengpeng Zeng et al. · Multiple institutions

2025-10-27

A Survey on Efficient Vision-Language-Action Models

First comprehensive taxonomy for VLA efficiency across model design, training, and data collection pillars

Rui Shao, Wei Li, Lingsen Zhang et al. · Multiple institutions

2025-08-18

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Taxonomy of VLM-based VLA architectures (monolithic vs hierarchical) with RL, world model, and human video integration

Ashok Kumar Kanagala · Independent Researcher, Boston, MA

2026-02-07

Agentic AI Security & Autonomous Red-Teaming

Red-teaming framework for agentic AI: permission escalation, hallucination, orchestration flaws, memory manipulation, supply chain

Pengfei Du · Not specified

2026-03-08

Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers

Write-manage-read taxonomy, 5 mechanism families, three-dimensional taxonomy for agent memory

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang · Multiple institutions

2025-02-17

A-MEM: Agentic Memory for LLM Agents

Agentic memory with Zettelkasten-inspired note construction, dynamic linking, memory evolution

Charafeddine Mouzouni · OPIT – Open Institute of Technology; Cohorte AI, Paris, France

2026-04-06

Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities

Large-scale systematic taxonomy of LLM agent exploitation triggers across 12 attack dimensions, identifying goal reframing as the sole reliable trigger while ruling out nine others, with GPT-4.1 achieving complete immunity across 1,850 trials.

Zijun Wang, Haoqin Tu, Letian Zhang, Hardy Chen, Juncheng Wu, Xiangyan Liu, Zhenlong Yuan, Tianyu Pang, Michael Qizhe Shieh, Fengze Liu, Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, Cihang Xie · UC Santa Cruz, National University of Singapore, Tencent, ByteDance, UC Berkeley, UNC-Chapel Hill

2026-04-06

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

First real-world safety evaluation of a deployed personal AI agent (OpenClaw), introducing the CIK taxonomy and showing that poisoning any single dimension raises attack success rate from 24.6% to 64–74%.

Usman Naseem · Macquarie University

2026-02-01

Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions

Comprehensive survey mapping mechanistic interpretability techniques to LLM alignment objectives, with a future research roadmap emphasizing automated interpretability and interpretability-driven alignment scaling.

Zae Myung Kim, Dongseok Lee, Jaehyung Kim, Vipul Raheja, Dongyeop Kang · University of Minnesota Twin Cities, Yonsei University, Google DeepMind

2026-04-11

The Amazing Agent Race: Strong Tool Users, Weak Navigators

DAG-structured benchmark of 1,400 Wikipedia navigation tasks revealing that current best agents achieve only 37.2% accuracy with navigation errors dominating (27–52% of failures), exposing compositional reasoning as the primary frontier bottleneck.

2025-12-16

Model-First Reasoning LLM Agents: Reducing Hallucinations through Explicit Problem Modeling

Annu Rana, Gaurav Kumar

Two-phase reasoning: LLMs construct explicit problem models before generating solutions. Reduces constraint violations vs CoT and ReAct across five planning domains.

Jingtao Ding et al. · Tsinghua University (FIB Lab)

2024-11-21

Understanding World or Predicting Future? A Comprehensive Survey of World Models

Two-function taxonomy separating world models that build internal representations (understanding) from those that predict future states (simulation/decision guidance); ACM CSUR extended version

Mido Assran, Yann LeCun, et al. (30 authors) · Meta FAIR

2025-06-11

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Action-free JEPA pre-trained on 1M+ hours of video; V-JEPA 2-AC post-training on <62h robot video enables zero-shot pick-and-place on Franka arms

Xinqing Li et al. · Multiple

2025-10-19

A Comprehensive Survey on World Models for Embodied AI

Three-axis taxonomy (Functionality × Temporal × Spatial) for embodied AI world models

2025-01-20

A Survey of World Models for Autonomous Driving

Three-tiered taxonomy: future-world generation, behavior planning, integrated closed-loop systems

2025-11-04

A Step Toward World Models: A Survey on Robotic Manipulation

Surveys manipulation methods exhibiting world-model capabilities — bridges VLA models and explicit world models

2025-09-09

3D and 4D World Modeling: A Survey

Hierarchical taxonomy (VideoGen / OccGen / LiDARGen) for 3D/4D world models

Lucas Maes et al. · Research collaboration

2026-03-25

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

First JEPA training stably end-to-end from raw pixels using only two loss terms — removes EMA/distillation tricks earlier JEPAs required

2026-02-15

Causal-JEPA: Learning World Models through Object-Level Latent Interventions

Extends masked JEPA with object-centric representations; object-level masking induces counterfactual-like latent interventions

2026-03-04

H-WM: Robotic Task and Motion Planning Guided by Hierarchical World Model

Hierarchical world model jointly predicting logical and visual state transitions — mitigates error accumulation in TAMP

2026-03-13

Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation (StructVLA)

Structured sparse frame prediction for planning — avoids dense pixel rollouts by predicting physically meaningful keyframes

2025-03-26

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Wayve team · Wayve

Latent diffusion world model for AV — controllable multi-view video generation from structured conditioning; production tool at Wayve

2025-07-17

PhyWorldBench: A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

12,600-video empirical benchmark — quantifies systematic physics failures in Sora and peer generative video models

2025-12-03

VideoScience-Bench: Benchmarking Scientific Understanding and Reasoning for Video Generation

Sora-2 ~64% / Veo-3 ~58.7% on Phenomenon Congruency — quantifies how far frontier video models are from ground-truth physical realism

Xin Cheng, Wangding Zeng, Damai Dai, et al. · DeepSeek / Peking University (collaboration)

2026-01-12

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Conditional memory as a sparsity axis orthogonal to MoE, via Engram (O(1) n-gram lookup). Sparsity Allocation problem yields a U-shaped scaling law (compute vs static memory); scaled to 27B params with gains on BBH +5.0, ARC-C +3.7, MMLU +3.4, NIAH 84.2→97.0.

Ahmadreza Jeddi, Marco Ciccone, Babak Taati · Not specified (ICLR 2026)

2026-02-11

LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation

Looped Transformer with elastic, budget-conditioned depth: a shortcut-consistency training scheme aligns reasoning trajectories of different lengths so one model trades inference depth for compute at test time — latent reasoning in weight-space rather than via explicit CoT tokens.

Mingdeng Du · Independent / not stated in abstract

2026-03-30

Tiered Super-Moore's Law: Price Evolution, Production Frontiers, and Market Competition in LLM Inference Services

First systematic empirical analysis of LLM token pricing across 3,237+ models (2020-2026); ~600x price decline; 'Tiered Super-Moore' hypothesis (economy 1.10yr / mid 1.55yr price half-life vs 2yr Moore benchmark; flagship/reasoning resists via ~31.5x premium); cost decline ~103.7% software/architecture-driven, ~-0.9% hardware.

Yike Wang, Huaisheng Zhu, Zhengyu Hu, Yige Yuan, Zhengyu Chen, Shakti Senthil, Hannaneh Hajishirzi, Yulia Tsvetkov, Pradeep Dasigi, Teng Xiao · Multiple institutions (inferred UW/AI2 affiliations, not stated verbatim in abstract)

2026-07-14

Rethinking the Evaluation of Harness Evolution for Agents

Automatic harness evolution for LLM agents does not consistently outperform simple test-time-scaling baselines under matched feedback/inference budget, and generalizes poorly to held-out tasks (Terminal-Bench 2.1, GPT-5.4 + Claude Opus 4.6) — a methodological check on 'scaffold gains' claims in the agentic-harness literature.

Yunbei Zhang, Janet Wang, Yingqiang Ge, Weijie Xu, Jihun Hamm, Chandan K. Reddy · Not stated verbatim on abstract page

2026-05-07

Stop Comparing LLM Agents Without Disclosing the Harness

Formalizes the Binding Constraint Thesis — for long-horizon agentic tasks across comparable-capability models the execution harness is often a stronger performance determinant than the model — and measures harness-induced variance at 7.80x model-induced variance (18.48 vs 2.37 pp^2, 6/9 ranking reversals) in a 3-model x 3-harness SWE-bench Verified experiment; proposes an ETCSOVG Harness Card disclosure standard + variance-decomposition protocol. Directly on the MenFem 'edge is the harness' thesis.

Yilun Yao, Xinyu Tan, Chao-Hsuan Liu, Yaoming Li, Zhengyang Wang, Wenhan Yu, Zhewen Tan, Yuxuan Tian, Guangxiang Zhao, Lin Sun, Xiangzheng Zhang, Tong Yang · Not stated verbatim on abstract page

2026-05-27

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

Diagnostic benchmark (106 sandboxed offline tasks, 5,194 execution trajectories) that isolates configuration-level harness effects from model capability by fixing task/budget/eval and varying only the harness across model backends; finds substantial variation in completion, quality, efficiency, and failure behavior, and names execution-alignment decoupling as the dominant failure class. Empirical complement to Stop-Comparing.

Jessica McFadyen, Ole Jorgensen, Harry Coppock, Kevin Wei, Cozmin Ududec · Not stated verbatim on abstract page (roster overlaps UK AI Security Institute / frontier-eval groups)

2026-06-16

How Inference Compute Shapes Frontier LLM Evaluation

Across up to 12 frontier models x 7 benchmarks (SWE/math/medicine/cyber) under controlled inference-scaling interventions (token budget, context compaction, repeated attempts), fixed single-budget evals increasingly UNDERSTATE newer models, which have a higher ceiling at generous budgets; recommends reporting capability as a function of inference-time compute. Bridges test-time-compute, the matched-budget eval floor (§12), and inference-economics (§4).

arXiv preprint authors · Multi-institution

2026-04-23

LLM Reasoning Is Latent, Not the Chain of Thought

Argues LLM reasoning should be studied as latent-state trajectory formation, not faithful surface CoT — implications for interpretability, alignment, and training-objective design

Xin Cheng, Xingkai Yu, Chenze Shao, et al. (30+ authors incl. Wenfeng Liang, Damai Dai) · DeepSeek (author roster includes Wenfeng Liang, Damai Dai — DeepSeek's core research team)

2026-07-06

DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation

First-party DeepSeek disclosure of a production speculative-decoding system combining parallel semi-autoregressive drafting with confidence-scheduled adaptive verification; claims 60-85% real-world deployment speedups. DeepSeek is already a KB-tracked and MenFem-studied entity (deepseek-v4-2026 source, entities/deepseek.md) — this is a first-party look at their inference-efficiency stack, directly bearing on the 'who captures inference margin' thesis the inference-economics topic tracks.

Tianjian Yang, Meng Li · arXiv preprint

2026-07-02

Spec-AUF: Accept-Until-Fail Training under Train-Inference Misalignment for Masked Block Drafters

Single-mechanism fix for a known train/inference mismatch in block (DLM-style) speculative drafters: truncates the cross-entropy training support to the drafter's own first predicted failure point, concentrating supervision on the accepted prefix rather than the full block. No auxiliary objective, no verifier rollouts, no inference-pipeline change. Raises average emitted length τ from 2.40→2.61 on Qwen3-8B (DFlash) and transfers to a second drafter family (Domino, 2.56→2.68) — representative of the current 'free efficiency' phase of speculative-decoding research.

Jaeyeon Kim, Jewon Lee, Bo-Kyeong Kim · Nota Inc. (Efficient Qwen Competition submission)

2026-07-05

Quantize the Target, Quantize the Drafter: Efficient Inference with Qwen3.5-4B

Combines quantization and speculative decoding in one system for constrained hardware (NVIDIA A10G): quantization-aware distillation for the target model + a two-stage-trained block-diffusion drafter, plus sliding-window attention. Achieves 6.978x average speedup over baseline while meeting quality thresholds (3rd place, Efficient Qwen Competition). Concrete evidence the two inference-efficiency techniques this lane tracks (quantization, speculative decoding) compound rather than substitute.

Zhikai Li, Zhen Dong, Xuewen Liu, Jing Zhang, Qingyi Gu · arXiv preprint

2026-05-06

OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization

Training-free, closed-form (no iterative optimization) post-training weight-only quantization: exploits a stable low-rank null space in the Hessian to build an additive weight transformation that suppresses outliers without affecting the task loss, absorbed into the weights offline with zero inference overhead. At 2-bit, OSAQ *integrated with* GPTQ achieves >40% lower perplexity than vanilla GPTQ (a complement to GPTQ, not a replacement).

Haiwen Yi, Xinyuan Song · Not stated verbatim on abstract page

2026-07-05

Measuring Harness-Induced Belief Divergence in Multi-Step LLM Agents

Shows the harness changes an agent's multi-step beliefs (progress, risk, recoverability, constraints, failure mode, uncertainty, future success, repair cost, next action) even with task/env/model fixed; introduces a belief-rollout diagnostic, a cross-harness belief-divergence metric split into arrival (interface) + growth (horizon) terms, and BIWM (no-training trajectory alignment). Terminal success often preserved while decision-driving beliefs diverge. Freshest of the July harness cluster.

Jiayi Yao, Samuel Shen, Kuntai Du, Shaoting Feng, Dongjoo Seo, Rui Zhang, Yuyang Huang, Yuhan Liu, Shan Lu, Junchen Jiang · Not stated verbatim on abstract page (roster overlaps LMCache / U. Chicago systems group)

2026-05-17

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

Makes lossy KV-cache compression lossless by using compressed KV (GPU HBM) as a speculative drafter and verifying against full KV (CPU/remote) in parallel — drafting is HBM-bound, verification PCIe/network-bound, so the paths overlap. Up to ~2.7-4.3x throughput over full-KV with output identical to full-KV (KL<0.01 nats), ~25-40 accepted tokens/round vs 2-3 for small-model drafters; composes with EAGLE/MTP to 4.35x. Opens KV-cache coverage in the KB.

Yudi Zhang, Weilin Zhao, Xu Han, Tiejun Zhao, Wang Xu, Hailong Cao, Conghui Zhu · not stated in abstract (code: github.com/AI9Stars/SpecMQuant)

2025-05-28

Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

Resolves the KB's flagged speculative-decoding × quantization CONFLICT with measurement: applying tree-style EAGLE-2 to a 4-bit weight-quantized model diminishes the quantization memory benefit because tree-draft verification costs significantly more than a single-token forward pass; a hierarchical framework using a small intermediate model to convert tree drafts into sequence drafts restores it — 2.78× speedup on 4-bit-weight Llama-3-70B (A100), +1.31× over EAGLE-2 on the same setup. Resolves the KB's highest-priority flagged tension (compound-vs-conflict) with measurement. Weight-quantization finding only.

Tho Mai, Joo-Young Kim · not stated in abstract

2026-05-08

Reformulating KV Cache Eviction Problem for Long-Context LLM Inference (LaProx)

Recasts KV-cache eviction from head-wise attention-weight averaging into an output-aware, layer-wise matrix-multiplication approximation (LaProx), modeling the multiplicative interaction between attention maps and projected value states to yield the first GLOBALLY-comparable token importance score for model-wide (not per-head) selection; maintains model performance at just 5% KV-cache retention across 19 datasets (LongBench + Needle-In-A-Haystack) and cuts accuracy loss up to 2× vs prior SOTA under extreme compression, with minimal overhead. Broadens the KB's KV-cache lane from compression into eviction.

Google DeepMind · Google DeepMind

2025-05-14

AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

Gemini-powered evolutionary coding agent; 0.7% Google compute recovery; math breakthroughs

Anthropic Interpretability Team (Elhage, Olsson, Bricken, Templeton, Ameisen, Lindsey et al.) · Anthropic

2025-03-27

Anthropic Transformer Circuits Thread & Circuit Tracing

Running research thread: features, circuits, superposition, attribution graphs, circuit tracing tools

Anthropic Interpretability Team · Anthropic

2021-12-01

Transformer Circuits Thread (Broader Research Program)

Foundational research thread on mechanistic interpretability: mathematical framework, superposition, monosemanticity

Google DeepMind Genie Team · Google DeepMind

2026-02-19

Genie 3: A New Frontier for World Models

First real-time interactive generative world model — 11B-param autoregressive transformer producing 720p navigable worlds at 24fps with ~1 minute visual memory

Yann LeCun · NYU Courant + Meta FAIR

2022-06-27

A Path Towards Autonomous Machine Intelligence (Version 0.9.2)

Canonical position paper proposing JEPA, configurable predictive world models, hierarchical planning, intrinsic motivation, and SSL as the blueprint for Autonomous Machine Intelligence

Meta AI / industry coverage · Meta

2026-04-08

Meta Muse Spark — Native Multimodal Foundation Model with Contemplating Mode

Meta announces Muse Spark — natively multimodal (text/image/voice in single transformer) with Contemplating mode that orchestrates parallel sub-agents for deeper reasoning without latency penalty

2026-04-15

Gemini 3 — Google's Latest Multimodal + Agentic Foundation Model

Google · Google DeepMind

Google releases Gemini 3 — claimed best-in-world multimodal understanding and most powerful agentic model. Improved tool-use, planning, and rich multimodal output over Gemini 2.5

UK AI Security Institute (AISI) · UK AI Security Institute (AISI)

2025-12-18

Frontier AI Trends Report (AISI)

First public AISI assessment of frontier-model capability trends (30+ models, 2022–2025): cyber apprentice tasks 9%→50%, cyber time-horizon doubling ~8mo (accelerating to ~4.7mo), RepliBench <5%→>60%, 1hr SWE tasks >40%, models surpass biology-PhD baseline; universal jailbreaks in every system.

2026-07-15

Anthropic Claude Platform Release Notes — June 25 to July 15, 2026

Anthropic · Anthropic

First-party changelog: Claude Sonnet 5 launch (1M context, new tokenizer +~30% tokens, adaptive-thinking-only), agent-memory-2026-07-22 beta header replacing managed-agents-2026-04-01 on memory-store endpoints, Admin API user management for Enterprise orgs, API key expiration controls, Dreams managed-agent preview expands to Sonnet 5/Fable 5.

MIT Technology Review · MIT Technology Review

2026-01-12

Mechanistic Interpretability — 10 Breakthrough Technologies 2026

Named mech interp as 2026 breakthrough; Anthropic microscope + CoT monitoring advances

2026-02-09

AI Safety, Alignment, and Interpretability in 2026

Zylos Research · Zylos

DPO replacing RLHF analysis, alignment mirages concept, 6 documented failure modes, alignment trilemma

2026-04-20

The AI Research Landscape in 2026: From Agentic AI to Embodiment

Adaline Labs · Adaline

Synthesis of 2026 AI landscape across 5 frontiers: agentic AI mainstream, native multimodality standard, embodied/VLA scaling, world models + continual learning, autonomous agents in production

Simon Willison (independent analysis; corroborated by Bloomberg/Fortune/Axios/CNBC/Forbes/Tom's Hardware coverage of Moonshot AI's announcement) · Moonshot AI (subject)

2026-07-16

Kimi K3 — Moonshot AI's 2.8T-Parameter Open-Weight Model

Moonshot AI released Kimi K3, a 2.8T-parameter MoE (896 experts, 16 active/token), 1M context, native vision — largest open-weight model to date; API live at $3/$15 per Mtok (Sonnet-tier pricing, most expensive Chinese-lab release yet); open weights due 2026-07-27; tops Arena.ai Frontend Code leaderboard ahead of Claude Fable 5.

VentureBeat (corroborated by morphllm.com + Hugging Face model card) · DeepSeek

2026-04-24

DeepSeek V4 — Open-Weight Trillion-Parameter MoE at ~1/6th Frontier Cost

DeepSeek V4 (open-weight MIT): V4-Pro 1.6T/49B-active, V4-Flash 284B/13B-active, 1M context, ~$0.435/$0.87 per-Mtok — near-frontier capability (SWE-bench Verified ~80.6%, GPQA ~90–92%) at ~1/6th the cost of Opus 4.7 / GPT-5.5; lands inside an 8-day April-2026 frontier window.

Tech Startups (corroborated by CryptoBriefing; OpenAI's own page returned HTTP 403) · OpenAI

2026-07-09

GPT-5.6 — OpenAI's Three-Tier (Sol / Terra / Luna) Frontier Family + Token-Efficiency Claims

OpenAI launched GPT-5.6 as three tiers (Sol $5/$30, Terra $2.50/$15, Luna $1/$6 per Mtok) on 2026-07-09; claims 54% higher token efficiency on agentic coding vs rivals plus a 90% cached-read discount — a live instance of tier-differentiated frontier pricing.