Automatic Speech Recognition Challenge - Overview
Welcome to the Automatic Speech Recognition (ASR) Challenge! This competition focuses on developing and evaluating speech recognition systems that can accurately transcribe Vietnamese speech into text.
Automatic Speech Recognition is a technology that converts spoken language into written text. It has numerous applications including voice assistants, transcription services, accessibility tools, and more.
In this challenge, participants will develop systems that:
Systems will be evaluated using the Word Error Rate (WER) metric, which measures the minimum number of word edits (insertions, deletions, and substitutions) required to transform the system's output into the reference transcript, divided by the number of words in the reference. Lower WER values indicate better performance.
For questions or support, please use the competition forum or contact the organizers at [email protected], cc-ing [email protected], [email protected].
Automatic Speech Recognition Challenge - Evaluation
The primary evaluation metric for this challenge is the Word Error Rate (WER), expressed as a percentage (%). WER measures how accurately your ASR system transcribes speech into text.
Word Error Rate is calculated as:
Where:
Lower WER values indicate better performance. A perfect system would have a WER of 0%, meaning the transcription exactly matches the reference.
We use the implementation from jiwer to calculate WER: https://github.com/jitsi/jiwer
Submission Format
Your submission should be a text file named transcripts.txt
containing transcriptions for each audio file. Each line should only contain one column for the transcript, with the same order as the provided test list.
For example, if the first line of the test list is "audio1.wav", then the first line in your submitted prediction file should be the prediction for that audio.
Before calculating WER, the following text normalization steps are applied to both reference and hypothesis transcriptions:
The evaluation process works as follows:
Your submitted transcriptions are compared with the ground truth transcriptions.
For each audio file, the WER is calculated between your transcription and the reference.
The final score is the average WER across all audio files.
For example, if the reference transcription is:
xin chào việt nam
And your system's transcription is:
xin chào việt nam hôm nay
The WER calculation would be:
Participants will be ranked based on their WER score, with lower values being better. In case of ties, earlier submissions will be ranked higher.
Automatic Speech Recognition Challenge - Terms and Conditions
The challenge organizers reserve the right to modify these terms and conditions at any time. Participants will be notified of any changes. The decisions of the challenge organizers regarding any aspect of the competition are final.
For questions or clarifications regarding these terms, please contact the challenge organizers at [email protected], cc-ing [email protected], [email protected].
Start: April 1, 2025, midnight
Start: Sept. 1, 2025, midnight
Dec. 31, 2025, midnight
You must be logged in to participate in competitions.
Sign In