Enterprises expanding their AI implementations are hitting an invisible performance barrier. Perpetrator? Static profiteers who cannot sustain with changing workloads.
Speculators are smaller AI models that interact with large language models to make inferences. They design multiple tokens in advance, which the foremost model then validates in parallel. This technique (called speculative decoding) has develop into essential for enterprises trying to scale back inference and latency costs. Instead of generating tokens one at a time, the system can accept multiple tokens at once, dramatically improving throughput.
Total AI today announced research and a latest system called ATLAS (AdapTive-LeArning Speculator System) that goals to assist corporations overcome the challenge of static speculators. This technique provides self-learning inference optimization that may also help deliver as much as 400% faster inference performance in comparison with the baseline level of performance available in existing inference technologies similar to vLLM. The system solves a critical problem: as AI workloads evolve, inference speeds decline, even with the presence of specialised speculators.
The company that has its starting in 2023 the focus was on inference optimization on an enterprise AI platform. Earlier this yr, the company raised $305 million as customer acceptance increases and demand increases.
“The companies we work with typically see a change in workload as they scale up and then don’t see as much acceleration from speculative execution as before,” Tri Dao, chief scientist at Together AI, told VentureBeat in an exclusive interview. “These speculators generally don’t perform well when their load domain starts to change.”
The load drift problem no one talks about
Most speculators currently in production are “static” models. They are trained once on a fixed dataset representing the expected workloads and then deployed without adaptation. Companies like Meta and Mistral send pre-trained speculators with their foremost models. Inference platforms like vLLM use these static speculators to extend throughput without changing output quality.
But there’s a catch. As the use of AI in the enterprise evolves, the accuracy of the static speculator declines rapidly.
“If you’re a coding agent company and most of your developers were writing in Python, suddenly some of them switch to writing in Rust or C and you see the speed start to slow down,” Dao explained. “The speculator has a misunderstanding between what he was trained in and the actual workload.”
This shift in workload is a hidden tax on AI scaling. Companies either accept inferior performance or invest in retraining non-standard speculators. This process only captures a snapshot in time and quickly becomes outdated.
How adaptive speculators work: A two-model approach
ATLAS uses a dual-speculator architecture that mixes stability with adaptation:
Static speculator – Heavyweight model trained on extensive data ensures consistent baseline performance. It serves as a “speed floor”.
Adaptive profiteer – Lightweight model constantly learns from live motion. Specializes in emerging domains and usage patterns on an ongoing basis.
A controller for confidence – The orchestration layer dynamically chooses which speculator to make use of. Adjusts the speculation “lookahead” based on confidence metrics.
“Before the adaptive speculator learns anything, we still have a static speculator to help increase the speed at the beginning,” Ben Athiwaratkun, an artificial intelligence scientist at Together AI, explained to VentureBeat. “As the adaptive speculator becomes more confident, the speed increases over time.”
Technical innovation is about balancing acceptance rate (how often the goal model agrees with the designed tokens) and draft latency. As the adaptive model learns from traffic patterns, the controller relies more on the lightweight speculator and extends the lookahead function. This increases efficiency.
Users do not must tune any parameters. “On the user side, you don’t have to turn any knobs,” Dao said. “For our part, we have twisted these knobs so that users can adjust them in a configuration that provides good acceleration.”
Performance comparable to custom silicon
Common AI tests show that ATLAS achieves 500 tokens per second in DeepSeek-V3.1 after full tuning. More impressively, these numbers on Nvidia B200 GPUs match or exceed specialized inference systems similar to Groq custom equipment.
“Improvements in software and algorithms have the ability to fill the gap for really specialized hardware,” Dao said. “We saw speeds of 500 tokens per second on these huge models, which are even faster than some custom chips.”
The 400% speedup that the company claims as inference is the cumulative effect of Together’s Turbo optimization suite. Quantization in FP4 provides an 80% speedup in comparison with the FP8 baseline. Static Turbo Speculator adds one other 80-100% gain. At the top are the adaptive layers of the system. Each optimization combines the advantages of the others.
Compared to straightforward inference engines similar to vLLM or Nvidia’s TensorRT-LLM, the improvement is significant. The AI collaboratively benchmarks with a stronger baseline between them for each workload before applying speculative optimizations.
The trade-off between memory and computation is explained
The performance gains come from exploiting a fundamental inefficiency of recent inference: wasted computing power.
Dao explained that typically during inference, a significant amount of computing power is not fully utilized.
“During inference, which is actually the dominant workload today, it mainly uses the memory subsystem,” he said.
Speculative decoding turns idle computation into limited memory access. When the model generates one token at a time, it is memory certain. The GPU stays idle waiting for memory. However, when the speculator proposes five tokens and the goal model validates them concurrently, computation utilization increases dramatically while memory access stays more or less constant.
“The total amount of computation needed to generate five tokens is the same, but the memory only had to be accessed once rather than five times,” Dao said.
Think of it as smart caching for artificial intelligence
For infrastructure teams familiar with traditional database optimization, adaptive speculators act like an intelligent caching layer, but with a key difference.
Traditional caching systems like Redis or memcached require exact matches. You store the very same query result and retrieve it when that specific query is run again. Adaptive speculators work in another way.
“You can think of it as a smart way of caching, not about storing it exactly, but about detecting certain patterns that you see,” Dao explained. “Generally speaking, what we see is that you’re working with similar code or you’re working with similar, you know, controlling the computation in a similar way. We can then predict what the big model is going to say. We’re just getting better at predicting it.”
Instead of storing exact answers, the system learns the patterns in which the model generates tokens. Recognizes that if you edit Python files in a specific codebase, certain token sequences develop into more likely. The speculator adapts to those patterns, improving his predictions over time without requiring an identical inputs.
Use cases: RL training and changing workloads
Adaptive speculation particularly advantages from two enterprise scenarios:
Reinforcement learning training: Static speculators quickly develop into unhinged as politics evolves during training. ATLAS continuously adapts to the changing distribution of policies.
Changing loads: As enterprises discover latest use cases for AI, workload composition is changing. “Maybe they started using AI in chatbots, but then they realized, hey, it can write code, so they started switching to coding,” Dao said. “Or they realize that these AIs can actually call tools, control computers, do accounting, things like that.”
During a vibration coding session, the adaptive system can specialize in the specific code base being edited. These are files not seen during training. This further increases the acceptance rate and decoding speed.
What this implies for enterprises and the inference ecosystem
ATLAS is now available on dedicated Together AI endpoints inside the platform at no additional cost. More than 800,000 of the company’s developers have access to optimization (there have been 450,000 in February).
But the broader implications go beyond a single vendor’s product. Moving from static to adaptive optimization means fundamentally rethinking how inference platforms should work. As enterprises deploy AI across multiple domains, the industry might want to move away from one-off trained models towards systems that learn and continually improve.
In the past, AI has made some of its research techniques open source and collaborated with projects similar to vLLM. While the fully integrated ATLAS system is proprietary, some of the underlying techniques may ultimately impact the broader inference ecosystem.
For enterprises trying to lead in AI, the message is clear: adaptive algorithms on off-the-shelf hardware can match custom silicon at a fraction of the cost. As this approach matures across the industry, software optimization is increasingly outperforming specialized hardware.
