Look under the hood of transfomers, the engine drives the evolution of the AI ​​model

Look under the hood of transfomers, the engine drives the evolution of the AI ​​model


Today, virtually every most recent product and AI model uses the transformer architecture. Large language models (LLM), reminiscent of GPT-4O, LAMA, Gemini and Claude, are based on transformers and other AI applications, reminiscent of text models, automatic speech recognition, image generation and text for text film have transformers as their basic technology .

Because the noise around artificial intelligence probably is not going to decelerate in the near future, it is time to present the transformers due, so I would love to clarify how they work, why they are so necessary for the development of scalable solutions and why they are the LLM spine.

- Advertisement -

Transformers are greater than that meets the eye

In short, the transformer is the architecture of the neural network designed for modeling the data sequence, which makes them ideal for tasks, reminiscent of language translation, ending sentences, automatic speech recognition and many others. Transformers have really turn out to be the dominant architecture of many of these sequence modeling tasks, because the mechanism underlying the attention could be easily parallel, enabling a massive scale during training and applying.

Originally introduced in the 2017 article, “Attention is all you wish“From Google researchers, the transformer was introduced as the architecture of the encoder-decoder specially designed for language translation. The following 12 months, Google issued two -way representations of the Enkoder from Transformers (Bert), which could be considered one of the first LLM – although it is currently considered small in accordance with today’s standards.

Since then – and especially accelerated with the arrival of GPT models with OpenAI – the trend was training and larger models with more data, more parameters and longer context windows.

To facilitate this evolution, there have been many innovations, reminiscent of: more advanced GPU equipment and higher training for many GPU training; techniques reminiscent of quantization and an expert mix (MOE) to cut back memory consumption; latest training optimizers, reminiscent of shampoo and Adamw; Techniques of effective attention calculation, reminiscent of Flashattion and BUFORING KV. The trend will probably be continued in the foreseeable future.

The importance of self -improvement in transformers

Depending on the application, the transformer model follows the architecture of the Enkoder-DECODER. The Encoder component learns a vector representation of data, which may then be used for lower tasks, reminiscent of classification and evaluation of sentiments. The decoder component accepts a vector or hidden representation of a text or image and uses it to generate a latest text, making it useful for tasks reminiscent of ending and summary of sentences. For this reason, many of the most recent models, reminiscent of the GPT family, are only decoders.

The Enkoder-Decoder models mix each components, which makes them useful for translation and other sequence tasks for the sequence. Both in the case of the architectures of the encoder and the decoder, the basic element is the layer of attention, because it allows the model to maintain the context based on words that appear much earlier in the text.

Attention appears in two flavors: self -improvement and cross. Self -understanding is used to capture the relationship between words in the same sequence, while the cross is used to capture the relationship between words in two different sequences. Crossing combines the components of the encoder and decoder in the model and during translation. For example, it allows the English word “strawberries” to check with the French word “Fraise”. Mathematically each independent and cross are different forms of multiplication of the matrix, which could be done extremely effectively with the help of GPU.

Due to the layer of attention, transformers can higher capture relationships between words separated by long amounts of text, while previous models, reminiscent of repetitive neural networks (RNN) and long short -term memory (LSTM) lose tracking the context of words from earlier words in the text.

The future of models

Currently, transformers are the dominant architecture in many cases of use that require LLM and profit from the most research and development. Although evidently this may not change in the near future, one other class class that has recently gained interest are the models of the state space (SSM), reminiscent of Mamba. This highly efficient algorithm can support very long data sequences, while transformers are limited through the context window.

For me, the most fun applications of transformer models are multimodal models. For example, GPT-4O OPENAI is capable of handle the text, audio and images-other suppliers are beginning to follow. Multimodal applications are very diverse, from video signatures to voice cloning to image segmentation (and many others). They are also an opportunity to make artificial intelligence for disabled people. For example, the blind could be very served by the possibility of interaction through voice elements and audio multimodal application.

This is an exciting space with great potential to find latest cases of use. Remember, nonetheless, that at least in the foreseeable future are largely the basis of the transformer architecture.

Datadecisionmakers

Welcome to the Venturebeat community!

DatadecisionMakers is a place where experts, including technical people performing data, can provide observations and innovations related to data.

If you need to read about the latest ideas and current information, the best practices and the future of data and data technology, join us at DatadecisionMakers.

You may even consider your personal article!

Latest Posts

Advertisement

More from this stream

Recomended