Textless NLP

Text-based language models such as BERT, RoBERTa, and GPT-3 have made huge strides in recent years. When given written words as input, they can generate extremely realistic text on virtually any topic. In addition, they also provide useful pretrained models that can be fine-tuned for a variety of difficult natural language processing (NLP) applications, including sentiment analysis, translation, information retrieval, inferences, and summarization, using only a few labels or examples (e.g., BART and XLM-R).
There is an important limitation, however: These applications are mainly restricted to languages with very large text data sets suitable for training AI models.
Facebook AI is introducing Generative Spoken Language Model (GSLM), the first high-performance NLP model that breaks free of this dependence on text. GSLM leverages recent breakthroughs in representation learning, allowing it to work directly from only raw audio signals, without any labels or text. It opens the door to a new era of textless NLP applications for potentially every language spoken on Earth—even those without significant text data sets.
Source: https://ai.facebook.com/blog/textless-nlp-generating-expressive-speech-from-raw-audio/