The latest JAIS large language model (LLM), JAIS 70B was released today by Inception, a G42 company specializing in the development of advanced AI models and applications, all provided as a service. A 70 billion parameter model, JAIS 70B is built for developers of Arabic-based natural-language processing (NLP) solutions and promises to accelerate the integration of Generative AI services across various industries, enhancing capabilities in areas such as customer service, content creation, and data analysis.
JAIS 70B delivers Arabic-English bilingual capabilities at an unprecedented size and scale for the open-source community. As a 70 billion parameter model, it has increased ability to handle complicated and nuanced tasks, as well as better capability to process complex datasets. JAIS 70B was developed using continuous training, a process of fine-tuning a pre-trained model, on 370 billion tokens of which 330 billion were Arabic tokens, the largest Arabic dataset ever used to train an open-source foundational model.
In this release, the company has also unveiled a comprehensive suite of JAIS foundation and fine-tuned models; 20 models, across 8 sizes, ranging from 590M to 70B parameters, and specifically fine-tuned for chat applications. Trained on up to 1.6T tokens of Arabic, English, and code data. In response to feedback from the Arabic NLP community, this extensive release now delivers a breadth of tools, including the first Arabic-centric model small enough to run on a laptop, delivering both small, compute-efficient models for targeted applications, and advanced model sizes for enterprise precision.
This suite of JAIS models accommodates a wide range of use cases, and aims to accelerate innovation, development, and research opportunities for multiple downstream applications for the Arabic speaking and bilingual community.
Dr. Andrew Jackson, CEO, Inception said: “AI is now a proven value-adding force, and large language models have been at the forefront of the AI adoption spike. JAIS was created to preserve Arabic heritage, culture, and language, and to democratize access to AI. Releasing JAIS 70B and this new family of models reinforces our commitment to delivering the highest quality AI foundation model for Arabic speaking nations. The training and adaptation techniques we are delivering successfully for Arabic models are extensible to other under-served languages and we are excited to be bringing this expertise to other countries.”
Inception released JAIS-13B and JAIS-13B-chat in August 2023 and subsequently launched the state-of-the-art Arabic-centric models, JAIS-30B and JAIS-30B-chat. JAIS 70B and JAIS 70B-chat have proven to be even more performant in benchmarking data in both English and Arabic compared to previous models.
Neha Sengupta, Principal Applied Scientist, Inception said: “For models up to 30 billion parameters, we successfully trained JAIS from scratch consistently outperforming adapted models in the community. However, for models with 70 billion parameters and above, the computational complexity and environmental impact of training from scratch were significant. We made a choice to build JAIS 70B on the Llama2 model, allowing us to leverage the extensive knowledge base of an existing English model and develop a more efficient and sustainable solution.”
JAIS 70B retains, and in specific cases, exceeds, the high-quality English-language processing capabilities of Llama2, while vastly excelling on Arabic outputs versus the base model. The JAIS development team trained an expanded tokenizer based on the Llama2 tokenizer to enhance Arabic text processing efficiency, doubling the model’s base vocabulary. According to Sengupta, the model “splits Arabic words less aggressively and makes training and inferencing cheaper” than the standard Llama2 model.
Users can download the JAIS models and access the technical paper and benchmarking data by visiting the dedicated page on Hugging Face: https://huggingface.co/inceptionai