Apart from GPT architecture: why Google diffusion approach can transform the implementation of LLM

Last month, together with a comprehensive set of recent AI tools and innovations, Google Deepmind exposed Twins diffusion. This experimental research model uses a diffusion -based approach to generating text. Traditionally, large language models (LLM), corresponding to GPT and Gemini itself, were based on autoregression, a step -by -step approach, in which each word is generated on the basis of the previous one. Models of the diffusion language (DLM), also often known as models of large languages based on diffusion (DLLM), use the method more common in image generation, starting with random noise and progressively providing it with a coherent output. This approach significantly increases the production speed and can improve consistency and consistency.

Twin diffusion is currently available as an experimental demo; Register on the waiting list here to access.

- Advertisement -

Understanding diffusion vs. autoregression

Diffusion and autoregression are essentially different approaches. The autoregressive approach generates a text sequentially, with tokens expected individually. Although this method ensures strong coherence and context tracking, it could be intensive computing and slow, especially in the case of a long form.

However, diffusion models begin with a random noise, which is progressively transmitted into a coherent output. After applying in the language, the technique has several benefits. Text blocks can be processed in parallel, potentially producing entire segments or sentences at a much higher pace.

Twin diffusion can supposedly generate 1000-2000 tokens per second. However, Gemini 2.5 flash has an average output speed of 272.4 tokens per second. In addition, production errors can be improved during the refining process, improving accuracy and reducing the number of hallucinations. There could also be compromises in terms of accuracy of the token level; However, the increase in speed will probably be changing into many applications.

How does diffusion based on diffusion work?

During the training, DLM works, progressively damaging his mind with noise for many steps until the original sentence is completely beyond recognition. The model is then trained to reverse this process, step by step, reconstruct the original sentence from increasingly noisy versions. Through iterative improvement, learns to model the entire distribution of credible sentences in training data.

Although the specificity of Gemini’s diffusion has not yet been disclosed, a typical training methodology for the diffusion model includes these key stages:

Freeze diffusion: With each sample in the set of training data, the noise is added progressively by many cycles (often 500 to 1000), until it becomes indistinguishable from random noise.

Reverse diffusion: The model learns to reverse each stage of the bat process, principally learning how “Denoise” a corrupt sentence at once at once, ultimately restoring the original structure.

This process is repeated hundreds of thousands of times with a variety of samples and noise level, which allows the model to learn a reliable denoising function.

After training, the model is capable of generate completely recent sentences. DLMs normally require state or input data, corresponding to prompt, class label or deposition to steer the generation towards the desired results. The condition is injected into each stage of the denoising process, which shapes the initial drop of noise into a structural and coherent text.

Advantages and disadvantages of diffusion -based models

In an interview with Venturebeat, Brendan O’Donoghue, a scientist at Google Deepmind and one of the potential customers of the Gemini diffusion project, developed on some benefits of diffusion -based techniques in comparison with autoregression. According to O’Donoghue, the predominant benefits of diffusion techniques are as follows:

Lower delays: Dyffusion models can produce tokens sequence in a much shorter time than autoregressive models.
Adaptation calculations: Dyfusion models will coincide with the tokens sequence at different speeds depending on the difficulty of the task. This allows the model to less resources (and have lower delays) on easy tasks and more on tougher.
Non -pecuniary reasoning: Due to the two -way attention in Denoiser, tokens can bring future tokens in the same generation block. This enables non-coffee reasoning and allows the model to make global editions in the block to acquire a more coherent text.
Iterative improvement / self -improvement: The denoising process includes taking samples that can introduce errors, as in autoregression models. However, unlike autoregressive models, tokens are transmitted back to Denoiser, which then has the ability to correct the error.

O’Donoghue also noticed the predominant disadvantages: “Higher cost of kidnappings and a slightly higher time to the first token (TTFT), because autoregressive models will immediately create the first token. To achieve diffusion, the first token may appear only if the entire sequence of token.”

Benchmark performance

Google says that the performance of Diffusion Gemini is Comparable to Gemini 2.0 Flash-Lite.

Benchmark	Type	Twins diffusion	Gemini 2.0 Flash-Lite
Livecodebench (V6)	Code	30.9%	28.5%
Bigcodebench	Code	45.4%	45.8%
LBPP (V2)	Code	56.8%	56.0%
Your Bench verified*	Code	22.9%	28.5%
Humaneval	Code	89.6%	90.2%
MBPP	Code	76.0%	75.8%
Gpqa diamond	Science	40.4%	56.5%
Aime 2025	Mathematics	23.3%	20.0%
Big Bench Extra difficult	Reasoning	15.0%	21.0%
Global mml (lite)	Multilingual	69.1%	79.0%

Two models were compared using several reference points, with results based on how many times the model gave the correct answer at the first attempt. Twin diffusion worked well in coding and mathematics tests, while Gemini 2.0 Flash-Lite had an advantage over reasoning, scientific knowledge and multilingual possibilities.

As Gemini’s diffusion evolutions, there is no reason to think that its performance won’t catch up in more fixed models. According to O’Donoghue, the gap between the two techniques is “basically closed in terms of comparative efficiency, at least with relatively small sizes to which we scaled. In fact, there may be a certain advantage of performance for diffusion in some domains, in which unlocal coherence is important, for example, coding and reasoning.”

Gemini diffusion testing

Venturebeat has received access to an experimental demo. When performing Gemini’s diffusion by its pace, the first thing we noticed was speed. When launching suggested hints provided by Google, including building interactive HTML applications, corresponding to xylophone and planet Tac Toe, each request ended in lower than three seconds, at speeds from 600 to 1300 tokens per second.

To test your performance using a real application, we asked Gemini Diffusion to build a video chat interface with the following monitor:

In lower than two seconds, Gemini’s diffusion has created a working interface with a video preview and an audio meter.

Although this was not a complex implementation, it may very well be the starting of MVP, which can be accomplished with a small number of hints. Note that Gemini 2.5 Flash has also created a working interface, although at a barely slower pace (about seven seconds).

Gemini’s diffusion also accommodates “Instant Edit”, a mode in which text or code can be pasted and edited in real time with minimal hints. Instant edition is effective for many types of text editing, including grammar correcting, text updating to focus on various personalities of readers or add search engine optimization keywords. It is also useful for tasks, corresponding to the re -invoice code, adding recent functions to the application or transforming the existing code base into one other language.

Cases of using an enterprise for DLMS

It can be safely said that any application that requires a quick response time can use DLM technology. This includes applications in real time and low delays, corresponding to AI and Chatbot, transcription and live translation, or assistants of Autocomplete and Coding IDE.

According to O’Donoghue, with applications that use the “built -in edition, for example, taking a piece of text and introducing some changes on the spot, diffusion models apply in a way that autoregressive models are not.” DLM also has an advantage because of reason, mathematics and coding problems, because of “undisturbed reasoning of a given two -way attention.”

DLM is still in its infancy; However, technology can potentially transform the way of building language models. They not only generate text at a much higher pace than autoregressive models, but their ability to return and repair errors signifies that they can ultimately also give results with greater accuracy.

Twin diffusion enters the growing DLM ecosystem, with two significant examples Mercurydeveloped by Incepcja laboratories i LladaOpen Source model with GSAI. Together, these models reflect a wider moment of generating diffusion based and offer a scalable, parallel alternative to traditional autoregression architecture.

Daily observations in matters of business use with VB each day

If you must impress your boss, VB Daily is covered by you. We provide you with an internal measure about what corporations do with generative artificial intelligence, from regulatory changes to practical implementation, so you can share insights for the maximum roi.

Read our Privacy Policy

Thanks for the subscription. Check out more VB newsletter here.

There was a mistake.

Apart from GPT architecture: why Google diffusion approach can transform the implementation of LLM

Understanding diffusion vs. autoregression

How does diffusion based on diffusion work?

Advantages and disadvantages of diffusion -based models

Benchmark performance

Cases of using an enterprise for DLMS

Latest Posts

Recomended