AudioMind goes beyond speech recognition and discerns tone, gender, emotions

Soniox co-founders Ambroz Bizjak and Klemen Simonic (R)

Klemen Simonic met Ambroz Bizjak at the University of Ljubljana, Slovenia, during their undergraduate studies. After studies, they travelled in different directions: while Simonic joined Facebook and developed its speech systems, Bizjak worked in Cosylab, where he developed the core software for control systems for particle accelerators, fusion reactors, and cancer therapy systems.

After spending several years in the corporate world, the duo got together to embark on a new journey to understand humans through audio AI technologies.

This led them to start Soniox.

Soniox, a startup based in the US, has developed AudioMind, a foundational AI model that can deeply understand audio with all its information.

Also Read: How big tech players are redefining the classic freedom of speech vs. censorship debate

“Through interactions with our customers, we recognised a growing demand for capabilities beyond mere speech-to-text conversion. Clients expressed interest in features such as sentiment detection, summarisation, and audio event recognition, indicating a clear need for a more versatile audio intelligence solution,” says Simonic. “Driven by this demand, we conceived the idea of AudioMind — a general-purpose intelligence for audio that could perform a wide range of tasks, akin to text-based Large Language Model operators. ”

Comprehensive audio processing

According to Simonic, AudioMind distinguishes itself from traditional speech recognition technology by offering a “comprehensive” approach to audio processing. Unlike other similar apps in the market that focus on converting speech to text, AudioMind natively processes audio as the input modality, enabling it to utilise all available information within the audio signal fully.

“Our solution offers a wide range of capabilities beyond simple transcription. Through prompting mechanisms, AudioMind empowers users to specify how they want the audio content to be interpreted,” he shares.

AudioMind supports a wide range of instructions for converting speech to text. For instance, to transcribe speech, one can use a simple prompt like ‘Transcribe this audio for me, please’, or ‘Transcribe this audio into a polished transcript’.

“AudioMind introduces a groundbreaking focus on speaker intelligence. Unlike conventional systems that primarily transcribe speech without distinguishing between speakers, our solution offers advanced capabilities to separate and identify speakers within a conversation accurately,” Simonic claims.

Furthermore, the app allows users to “effortlessly” generate speaker-separated and labelled transcriptions, summaries, and documents. By providing prompts, users can instruct AudioMind on how they want the document to be organised and structured, including specifying titles and sections.

Understanding tone, gender, and emotions

Human communication is not solely reliant on speech or text; it encompasses tone, intonation, and emotional cues. AudioMind has the ability to decipher these elements to provide a more comprehensive understanding of communication.

For instance, in customer service industries, recognising the tone of a customer’s voice can help gauge satisfaction levels or detect frustration. This insight enables businesses to tailor their responses appropriately, leading to improved customer experiences and satisfaction.

It also has the capability to discern emotions and aids in sentiment analysis, allowing organisations to gauge public opinion, customer sentiment, or patient well-being accurately. For example, in mental health care, analysing the emotional tone of patient conversations can assist therapists in tracking progress or identifying potential issues.

The solution also supports certain types of background filtering. By filtering out background noise and irrelevant sounds, it can focus on extracting meaningful information from the audio input. This directly improves the accuracy of downstream tasks.

Limitless opportunities

The entrepreneur-duo sees “limitless” opportunities for their solution, given the ubiquity of audio, voice, and speech across diverse sectors. Beyond traditional speech transcription, AudioMind holds promise in healthcare, where it can facilitate the creation of medical documentation through voice input, improving efficiency and accuracy.

In customer service, the voice generator app allows for enhanced interactions between agents and customers, improving satisfaction and retention rates.

Moreover, AudioMind can “interpret users’ voices with precision” in virtual assistants and voice-enabled devices, opening up new possibilities for intuitive and personalised experiences.

“AudioMind has been meticulously trained to listen and understand audio in a manner akin to human processing. Through extensive training with diverse audio datasets, it has developed the capability to recognise and understand various types of sounds, including those originating from the environment and those produced by humans,” Simonic explains. “This distinction is crucial for comprehending the surrounding context within an audio environment.”

Also Read: Why is text-to-speech technology a game-changer for inclusivity in faith-based apps?

For example, while speech recognition systems may focus solely on transcribing spoken words, AudioMind goes beyond recognising nuances such as laughter, indicating humour, or crying, signalling distress.

The startup plans to broaden its language support beyond English, aiming to enhance its usability and break down language barriers for users worldwide. “We recognise the importance of linguistic diversity and understand that catering to multiple languages is crucial for reaching a global audience. While we are still finalising the list, some of the languages under consideration include Spanish, Korean, Mandarin Chinese, French, German, Portuguese and Italian,” he adds.

“Our goal is to ensure that AudioMind becomes accessible and beneficial to users from diverse linguistic backgrounds, facilitating seamless communication and interaction across borders and cultures,” Simonic concludes.

—

X marks Echelon. Join us at Singapore EXPO on May 15-16 for the 10th edition of Asia’s leading tech and startup conference. Enjoy 2 days of building connections with potential investors, partners, and customers, exploring innovation, and sharing insights with 8,000+ key decision-makers of Asia’s tech ecosystem. Get your tickets here.

Want more from your Echelon experience? Be an Echelon X sponsor or exhibitor. Send enquiry here.

The post AudioMind goes beyond speech recognition and discerns tone, gender, emotions appeared first on e27.