Just in time for Halloween 2024, the Meta has been unveiled Meta Spirit LMthe first open-source multimodal language model enabling seamless integration of text and speech input and output.
As such, it competes directly with OpenAI’s GPT-4o (also natively multimodal) and other multimodal models reminiscent of Hume’s EVI 2, in addition to dedicated text-to-speech and speech-to-text offerings reminiscent of ElevenLabs.
Designed by the Meta Fundamental AI Research (FAIR) team, Spirit LM goals to beat the limitations of existing AI voice experiences by offering more expressive and natural-sounding speech generation while learning tasks in various modalities, reminiscent of automatic speech recognition (ASR), on speech (TTS) and speech classification.
Unfortunately for entrepreneurs and business leaders, this model is currently only available for non-commercial use inside Meta’s FAIR Non-Commercial Research Licensewhich grants users the right to make use of, reproduce, modify and create derivative works of the Meta Spirit LM models, but only for non-commercial purposes. Any distribution of those models or derivatives must also comply with the non-commercial use restrictions.
A brand new approach to text and speech
Traditional AI models for voice rely on automatic speech recognition to process spoken information before synthesizing it with a language model, which is then converted to speech using text-to-speech techniques.
While effective, this process often sacrifices expressive features inherent in human speech, reminiscent of tone and emotion. Meta Spirit LM introduces a more advanced solution by including phonetic, tone and pitch tokens to beat these limitations.
Meta released two versions of the Spirit LM:
• Spirit LM base: Uses phonetic tags to process and generate speech.
• LM Expressive Spirit: Includes additional pitch and tone markers so the model can capture more varied emotional states, reminiscent of excitement or sadness, and reflect them in the generated speech.
Both models are trained on a combination of text and speech datasets, allowing Spirit LM to perform cross-modal tasks reminiscent of speech-to-text and text-to-speech, while maintaining the natural expressiveness of speech in its output.
Non-commercial open source software – available for research purposes only
Consistent with Meta’s commitment to open science, the company has made Spirit LM fully open source, providing researchers and developers with model weights, code, and supporting documentation on which they’ll build.
Meta hopes that the open nature of Spirit LM will encourage the AI research community to explore recent methods for integrating speech and text in AI systems.
The edition also includes, among others: research article describing in detail the architecture and capabilities of the model.
Mark Zuckerberg, CEO of Meta, is a strong supporter of open source AI, stating in a recent open letter that AI has the potential to “enhance human productivity, creativity and quality of life” while accelerating progress in areas reminiscent of medical research and scientific discovery .
Applications and future potential
Meta Spirit LM is designed to learn recent tasks in a number of modalities reminiscent of:
• Automatic Speech Recognition (ASR): Convert spoken language into written text.
• Text-to-speech (TTS): Generating spoken language from written text.
• Speech classification: Identifying and categorizing speech based on its content or emotional tone.
The LM Expressive Spirit The model goes a step further by incorporating emotional cues into speech generation.
For example, it may possibly detect and reflect emotional states reminiscent of anger, surprise, or joy in its output, making interaction with AI more human and engaging.
This has significant implications for applications reminiscent of virtual assistants, customer support bots, and other interactive AI systems where more nuanced and expressive communication is required.
A broader effort
Meta Spirit LM is a part of a broader set of research tools and models made publicly available by Meta FAIR. This includes an update to Meta’s Segment Everything Model 2.1 (SAM 2.1) for image and video segmentation, which has been used in disciplines reminiscent of medical imaging and meteorology, in addition to research into improving the performance of huge language models.
Meta’s overarching goal is to realize advanced machine intelligence (AMI), with an emphasis on developing AI systems that are each efficient and accessible.
The FAIR team has been sharing their research for over a decade, with the goal of advancing artificial intelligence in ways that profit not only the technology community, but society as a whole. Spirit LM is a key component of this effort, supporting open science and reproducibility while pushing the boundaries of what artificial intelligence can achieve in natural language processing.
What’s next for Spirit LM?
With the release of Meta Spirit LM, Meta is taking a significant step forward in integrating speech and text in AI systems.
By offering a more natural and expressive approach to AI-generated speech and making the model open source, Meta enables the broader research community to explore recent opportunities for multimodal applications of AI.
Whether in ASR, TTS or beyond, Spirit LM represents a promising advancement in the field of machine learning, with the potential to power the next generation of more human-like AI interactions.