14th July 2022
An open-access, multilingual AI
A new language model similar in scale to GPT-3 is being made freely available and could help to democratise access to AI.
BLOOM (which stands for BigScience Large Open-science Open-access Multilingual Language Model) has been developed by 1,000 volunteer researchers from over 70 countries and 250 institutions, supported by ethicists, philosophers, and legal experts, in a collaboration called BigScience. The project, coordinated by New York-based startup Hugging Face, used funding from the French government.
The new AI took more than a year of planning and training, which included a final run of 117 days (11th March – 6th July) using the Jean Zay, one of Europe's most powerful supercomputers, located in the south of Paris, France.
In AI language models, the term "parameters" refers to variables that determine how input data is transformed into a desired output. These can be thought of as equivalent to individual neurons in a human brain. BLOOM's parameter count (176 billion) is only slightly higher than GPT-3's (175 billion), the latter being perhaps the most well-known of recent models. However, BLOOM offers major advantages.
Firstly, it can generate text in 46 natural languages and 13 programming languages. For almost all of them, such as Spanish, French, and Arabic, BLOOM will become the first language model with over 100 billion parameters.
The next major feature of BLOOM is the 100% open and transparent nature of its development. The current generation of large-scale AI models – such as OpenAI's GPT-3 and Google's LaMDA – are largely hidden from public inspection. By contrast, the team behind BLOOM is making its code freely available.
The textual sources used for training the AI are extremely diverse, equivalent to the content of several million books, ranging from literature to scientific articles, radio transcriptions, podcasts, and sports news. Its languages are highly varied too, including 20 from Africa. Combining content in various languages makes it possible to train powerful and robust models, often yielding better results than monolingual models, according to the researchers. Code in 13 different programming languages accounted for 10.8% of its input, as shown in the pie chart below.
Last but not least, BLOOM is distributed under a Responsible AI Licence, explicitly prohibiting its use for malicious purposes. While the current laws around language models are yet to be fully fleshed out, this licence can function like a terms of service agreement, designed to act as a deterrent from using BLOOM in high-risk applications to harm, deceive, or exploit people.
In a blog post, its creators write: "BLOOM can be asked to produce summaries or translations of text, output code from instructions, and follow prompts to perform original tasks such as writing recipes, extracting information from a news article, or composing sentences using a newly-defined invented word [...] BLOOM's performance will continue to improve as the workshop continues to experiment and advance on top of BLOOM."
In addition to language versatility, it could help solve the problems of bias and toxicity encountered with previous AI models. The team behind it hope that BLOOM will spur new ways of eliminating falsehoods and prejudices against races, religions, sexes, and people with disabilities.
"We're slated to add more languages, make the model smaller so it's easier to use at the same level of performance – and we'll support community efforts to expand it," says the BigScience collaboration. "BLOOM is a living family of models that will grow, not a one-and-done model."
"The creation of the BLOOM model and the success of the BigScience research collaboration demonstrate that another way of creating, studying, and sharing AI innovations is possible, bringing together industry, academia, and non-profits around an international, multidisciplinary, open-access project," said Thomas Wolf, co-founder and chief science officer of Hugging Face. "I am thrilled that Hugging Face was able to find the support it needed in France to pursue a novel approach of global scale."
"BLOOM shows the continued power of open source and open science even for expensive, large foundational models," said Richard Socher, an investor and mentor at AIX Ventures, in an interview with TechCrunch. "It also shows that in AI, no organisation has a major edge for very long. Once an organisation shows that something is doable, the same capabilities will appear six to 12 months after in other places."
If you enjoyed this article, please consider sharing it: