Artificial Intelligence ecosystem provider AI Singapore has released SEA-LION v2, the latest in its family of open-source language models specifically designed to understand and represent Southeast Asia’s linguistic and cultural diversity.
This enhanced model aims to provide more accurate and contextually relevant language processing capabilities tailored to the unique needs of the region, Dr. Leslie Teo, Senior Director at AI Singapore and the project lead, said in a LinkedIn post.
SEA-LION v2 is part of AI Singapore’s mission to develop AI capabilities, create social and economic impacts, nurture local talent, build an AI ecosystem, and position the island nation as a global AI leader. This model leverages a state-of-the-art open-source framework with continued pre-training and fine-tuning for Southeast Asia.
Unlike the original SEA-LION, which was trained from scratch, the second version is built on Meta’s Llama 3.
The original model was trained using 8x Nvidia A100 GPUs and created by a lean team of 20 Singaporeans. It outperformed other large language models (LLMs) on Southeast Asian tasks.
On the other hand, Version 2 was trained using 64x Nvidia H100 GPUs in just two days for each run. This excludes the numerous experimentations with hyperparameters and data mixes. Dr. Teo mentioned that the main challenge with continued pre-training (CPT) lies in maintaining existing knowledge while integrating new information.
Also Read: These Artificial Intelligence startups are proving to be industry game-changers
The project is now part of the National Multimodal LLM Programme (NMLP), which sees the Singapore government setting aside SG$70 million (US$51.8 million) to develop AI talent.
AI Singapore reports that SEA-LION v2 demonstrates superior performance on tasks in regional languages while retaining Llama 3’s general capabilities. According to Teo, SEA-LION v2 includes the following key features:
- Continued pre-training and fine-tuning: Built on the Llama 3 architecture.
- Multilingual capabilities: Instruction-tuned in English, Bahasa Indonesia, Thai, Vietnamese, and Tamil.
- SEA training data: Trained with approximately 50 billion tokens from Southeast Asian languages.
- Open source: Licensed under the Meta Llama 3 Community License.
For now, SEA-LION v2 is available for download on HuggingFace as a base model, an instruction-tuned model, or quantised models. While there is no online demo, the instruction-tuned model supports basic “chats” when properly deployed in a suitable environment.
Plans are also underway to build on Google’s Gemma 2 and AI startup Reka’s models next.
—
Fundraising or preparing your startup for fundraising? Build your investor network, search from 400+ SEA investors on e27, and get connected or get insights regarding fundraising. Try e27 Pro for free today.
Image credit: Canva Pro
The post AI Singapore releases SEA-LION v2 designed to understand SEA’s linguistic, cultural diversity appeared first on e27.