Despite its popularity and increasingly widespread use, there is a problem of cultural gap that can be found in today’s most popular AI tools, such as ChatGPT. Since 40 per cent of the existing models in the market today are produced by US-based companies, they are more aligned to Western culture, creating a distance for users in markets such as Southeast Asia (SEA).
AI Singapore aims to tackle this challenge through SEA-LION, its first open-sourced SEA Large Language Model (LLM) that is catered specifically for regional use cases, industries, languages, and contexts.
According to the organisation’s statement, unlike many current models, SEA-LION will confer users the benefits of the ability to understand nuances in native languages and demonstrate greater awareness of cultural context specific to the region.
“This lowers the bar for adoption by governments, enterprises, and academia while effectively expanding the SEA languages and cultural representation in the mainstream LLMs, which are currently dominated by models predominantly trained on a corpus of English data from the western, developed world.”
In a presentation at the National University of Singapore on January 24, Dr Leslie Teo, Senior Director of AI Products at AI Singapore, explained that the project does not intend to compete with the big producers of AI tools such as OpenAI. “Instead, we want to complement the existing tools,” he stressed.
Also Read: How Transparently.AI uses Artificial Intelligence to detect accounting manipulation, fraud
At its beginning in November 2023, the SEA-LION project initially focused on the developer side, but then it began receiving business queries. This led to the project to create a public infrastructure that is necessary in the AI space.
SEA-LION works through a partnership of different institutions, where each contributes to the data and metrics required to develop the technology. SEA LION works with non-copyrighted (“kosher”) materials in putting together data.
“The data used for pre-training the model was primarily sourced from the internet, specifically the CommonCrawl Dataset, which is publicly available. This data is downloaded, cleaned, and pre-processed for use in pre-training SEA-LION. The proportion of various SEA languages in the pre-training dataset was also adjusted to reflect the distribution of languages more accurately in our region,” the project stated.
In a demo that e27 witnessed, SEA-LION was placed side-by-side with popular LLMs such as OpenAI, Llama, and SEA LLM. All the tools were given the same questions in regional languages such as Bahasa Indonesia and Thai to answer, and the differences are interesting to see.
Of all the LLMs, SEALION, SEA LLM and OpenAI were the ones who were able to generate answers in Bahasa Indonesia and Thai.
SEA LION and OpenAI tended to give straightforward answers that were tailored for the chatbox. While OpenAI was slower in generating its answer, it was able to have a better understanding of context. In terms of accuracy, these two LLMs were also the most accurate.
Also Read: AI in mobile advertising: Transforming relevance, efficiency, and immersive experiences
What is next for SEA LION
When it comes to its practical, day-to-day use, SEA-LION aims to help enterprises in SEA incorporate AI into their workflows. For example, it can be used to enable customer service chatbots that have the capacity to capture local nuances in SEA languages, enhance fraud detection on online marketplaces in SEA, and enable more accurate translation and summarisation of information in regional languages.
In his presentation, Dr Teo also mentioned a use case where SEA-LION is used to help with legal advice.
For the development of SEA-LION, AI Singapore collaborated with companies such as Amazon Web Services and Google Research. It also partnered with communities such as SEACrowd to build a diverse data corpus in native languages.
The model is set to be piloted by enterprise users such as NCS and Tokopedia. Additionally, SEA-LION has garnered interest from regional government-linked entities such as KORIKA in Indonesia, which is pioneering the use of SEA-LION for various applications.
SEA-LION is publicly accessible on platforms such as Huggingface and Github. In the near future, it will also be available on AWS Jumpstart and Bedrock, as well as Google’s Model Garden. The model is free, encouraging research and commercial use to stimulate innovation and applications across various industries, languages, and contexts.
Also Read: In the age of AI, which human skills increasingly stand out?
SEA-LION initially prioritises commonly used languages in SEA, including Bahasa Indonesia, Malay, Thai, and Vietnamese, with plans to expand its coverage to other Southeast Asian languages such as Burmese and Lao in the future.
In an interview with e27, Dr Teo highlighted that despite its commercial use cases, SEA LION was not built as a commercial project. Instead, the project aims to build a public infrastructure.
“If we are successful, then we will see commercial things happening … Hopefully, because of that, we will be able to keep investing in the data and metrics because language changes–everything has to be continuously updated.”
The post How SEA-LION aims to bridge the cultural gap existing in popular AI tools appeared first on e27.