- 11 min read

Speculative Actions: Running Your Agent's Future Steps Before Deciding Them

Intermediate Deep Dive

TL;DR:


The wait no one is measuring

Open any LLM agent paper from the last two years and you’ll find careful benchmarks on task success rate, planning depth, and hallucination reduction. Latency benchmarks are conspicuously absent.

Here’s the arithmetic that gets glossed over. The standard React-loop
The canonical Thought → Action → Observation loop used by tool-using LLM agents, formalized in the ReAct paper (ICLR 2023). Each reasoning step produces a tool call, which produces an observation that feeds the next reasoning step — creating a strictly sequential dependency chain that is the primary latency bottleneck in multi-step agents.
, Thought → Action → Observation repeated, executes tool calls sequentially. Each call blocks until the prior result arrives. ToolBench, which characterized latency across 16,000 real-world APIs, found that for fast LLM backends, tool-call latency accounts for over 60% of total agent response time.

A 10-step agent with a 2-second average tool latency accumulates 20 seconds of pure idle time. Not computation. Not reasoning. Just waiting. Make your model twice as fast and those 20 seconds don’t shrink· the bottleneck is structural, not silicon.

This is the central observation of the Speculative Actions framework: the latency bottleneck is algorithmic. And algorithms already have a proven solution to this exact problem, one that has been running on your CPU for three decades.

How CPUs cracked this in 1981

When a modern CPU encounters a conditional branch, it faces the same structural problem an agent does. It cannot evaluate the branch condition without executing prior computation· stalling the entire pipeline while it waits is catastrophic. Modern pipelines are 15–20 stages deep; a single stall wastes many cycles per branch.

The solution is Speculative-execution
A technique where a processor or system pre-executes future operations before knowing they are needed, then discards results if the prediction was wrong. Speculative execution trades occasional wasted work for reduced average latency, and is the foundation of both CPU branch prediction and LLM speculative decoding.
: guess the branch outcome, pre-execute the predicted path, and roll back if the guess was wrong. This was formalized in James E. Smith’s 1981 ISCA paper on Branch-prediction
A CPU optimization that guesses which path a conditional branch will take before the branch condition is evaluated. Modern processors achieve 95–99% prediction accuracy, allowing them to pre-execute instructions and avoid costly pipeline stalls when the guess is correct.
and industrialized through the 1990s as out-of-order processors became standard.

What makes the approach viable is the math. Branch prediction accuracy reaches 80–95% in practice, which means the cost of occasional rollbacks is easily amortized across the majority of correct predictions. Critically, the correctness guarantee is strong: a misprediction causes a full rollback to consistent architectural state, not silent corruption. Output is identical to non-speculative execution.

Google researchers Leviathan, Kalman, and Matias extended this pattern to LLM token generation in 2023 with speculative decoding; Chen et al. at DeepMind arrived at the same result independently. A small draft model generates k candidate tokens· a large target model validates all of them in a single forward pass. Acceptance rates of 70–90% yield 2–3× decoding speedups with mathematically guaranteed losslessness. Token quality is provably unchanged.

Columbia’s DAPLab then asked the natural next question: if we can speculate at the token level, why not at the tool-call level?

Lifting the pattern to agent actions

The Speculative Actions paper, presented as an oral at ICLR 2026, extends the pattern one abstraction level higher, from token sequences to tool-call sequences.

The mechanism has five steps:

  1. Draft model predicts ahead. A small, fast model observes the current agent context and predicts the likely next tool call (name, arguments, expected sequencing) before the primary LLM finishes its reasoning step.
  2. Safe actions are dispatched immediately. Only read-only, reversible predicted actions are executed against live systems. Write operations are held until validation.
  3. Primary LLM produces the actual next action. This step is happening anyway· no extra wall-clock time is introduced.
  4. Validation. If the primary LLM’s action matches the draft prediction, the pre-fetched result is consumed. The API call latency is hidden behind reasoning time.
  5. Mismatch means discard. If the prediction was wrong, the speculative result is thrown away and execution continues on the normal path. No side effects were committed.
sequenceDiagram
    participant D as Draft Model
    participant T as Tool / API
    participant L as Primary LLM
    participant V as Validator

    L->>L: Begin reasoning (step N)
    D->>D: Predict likely action A
    D->>T: Execute A speculatively (if read-safe)
    Note over T: API call runs in parallel with LLM reasoning

    L->>V: Actual action = A'
    alt A' matches A
        T-->>V: Result ready
        V->>L: Consume result — latency hidden
    else A' ≠ A
        V->>L: Discard speculative result
        L->>T: Execute A' on normal path
        T-->>L: Result
    end

Across gaming, e-commerce, web-search, and operating-system benchmarks, the draft model achieved up to 55% next-action prediction accuracy. That means for roughly half of all agent steps, the tool-call latency is fully hidden behind the LLM’s own reasoning time.

Agent trajectories turn out to be locally predictable in a way that is initially counterintuitive. In a web-shopping task (WebArena), “search for product” is almost always followed by “click first result.” In data-retrieval pipelines, API call sequences are highly stereotyped within a task type. The draft model doesn’t need global intelligence· it only needs to be right often enough for amortization to work. This is precisely the property that makes branch prediction viable in CPUs, applied one level up the abstraction hierarchy.

Here’s the core logic in pseudocode:

def speculative_agent_step(context, draft_model, primary_llm, classifier):
    # Draft model runs in parallel with LLM reasoning — no extra wall-clock time
    predicted_action = draft_model.predict_next(context)

    # Speculatively execute only if the action is safe to roll back
    prefetched = None
    if classifier.is_read_safe(predicted_action):
        prefetched = execute_async(predicted_action)

    # Primary LLM produces the ground-truth action (this was happening anyway)
    actual_action = primary_llm.reason_and_act(context)

    if actual_action == predicted_action and prefetched is not None:
        return prefetched.result()      # latency hidden
    else:
        return execute_sync(actual_action)  # normal path; no harm done

The critical design choice is the action classifier, not the draft model.

The safety model that makes it deployable

Speculative execution is only safe when mispredictions can be undone. For CPUs, a rollback means restoring register state. For agent tool calls, a rollback means ignoring the pre-fetched result, which is only possible if the call produced no durable external side effect.

This mirrors Transactional-memory
A CPU mechanism that allows concurrent read operations to proceed optimistically without locking, committing writes atomically only after validating no conflicts occurred. Hardware transactional memory (HTM), as in Intel’s TSX extensions, applies this read/write asymmetry at the hardware level — the same asymmetry exploited by speculative agent action frameworks.
: reads proceed optimistically within a transaction; writes are committed atomically only after validation. Intel’s TSX (Transactional Synchronization Extensions) operationalized exactly this read/write asymmetry at the hardware level, and the same logic maps cleanly to HTTP-semantics APIs.

For agent tool calls, the boundary is pragmatic:

Safe to speculate (no durable side effects):

Unsafe to speculate (durable external effects):

This boundary has a favorable shape for many real-world workloads. A research agent queries ten sources before drafting a report· the write happens at the end. A data pipeline calls five retrieval APIs before updating a record· the mutation happens at the end. Speculative actions are strongest precisely where this front-loaded read pattern holds, which is common in planning, research, and retrieval agents.

The engineering team at incident.io reportedly applied a simpler version of this pattern to voice-agent tool calling, saving several seconds per interaction. In voice interfaces, response latency directly maps to whether a conversation feels natural· those seconds are product quality, not an infrastructure detail. (Note: the specific engineering post describing this could not be independently located for direct citation; treat it as a directional reference.)

What this means for your agent stack

The framework is implementable today without re-architecting your agent. Here’s where the practical complexity actually lives.

The action classifier is the real work. For HTTP-semantics APIs, the read/write boundary is clear: GET is safe, POST is not. For stateful tools (a “read” that increments a view counter, a “search” that logs analytics), you’ll need domain-specific rules. Build the classifier first; safety classification errors have real consequences.

Train the draft model on your own trajectory data. The 55% accuracy in the paper comes from task-specialized models, not generic small LLMs. Logs of your agent’s historical action sequences are the training signal. The more stereotyped your task type, the stronger the draft model can become.

The pattern is highest-value when:

Expect draft model overhead on the cold path. The draft model runs on every step, not just successful speculations. For agents with very fast tool calls (< 100ms), the inference cost of the draft model may outweigh the savings. Measure your specific workload before committing.

The bigger picture

The dominant agenda in Ai-agent
An autonomous AI system that can execute tasks, make decisions, and take actions across multiple systems without constant human intervention. Agents use language models to understand goals and interact with tools, APIs, and data sources to accomplish work.
research focuses on accuracy: better prompting, better Rag
Retrieval-Augmented Generation - A technique that enhances language models by retrieving relevant information from external sources before generating a response. RAG combines the knowledge stored in the model with fresh, specific data from databases or documents.
retrieval, better planning architectures like Reflexion. Latency is treated as a deployment concern: throw faster GPUs at it, use a smaller model, cache more aggressively.

Speculative Actions reframes latency as a systems design problem with an algorithmic solution. The 1990s out-of-order execution revolution in CPUs wasn’t about faster transistors· it was about restructuring computation to extract parallelism from code that looked strictly sequential. Agent loops that look strictly sequential at the Thought → Action → Observation level are not actually causally sequential at the tool-call level. The observation for step N often doesn’t require finishing the reasoning for step N· it only requires knowing what action step N is likely to be. The draft model exploits precisely that gap.

If the next generation of production agents runs 10, 20, or 50 sequential tool calls, and trajectory data from AgentBench and WebArena suggests it will, then algorithmic latency optimization isn’t an optimization pass. It’s the performance axis that separates agents people use from agents people abandon.

References