This project runs a few speech recognition engines against different kinds of speech with varying speeds, accents, and lengths, and compares and contrasts the performance of each of them. The engines are tested against the following set of criteria:
- Accuracy
- Prices
These criteria is based on our company OKR to be first. Emphasis is placed on accuracy, following the philosophy that "bad data is worse than no data." There is no use being first at the wrong times or while executing the wrong course of action. If the engine cannot produce reliable results, then it would be better not to use it at all, regardless of how fast it can produce an output. Latency will also be taken into consideration however, as we cannot be first if we acquire accurate results after our competitors.
Speech data acquired from Kaggle's speech accent archive
- Fast speech: armenian6.wav
- Slow speech: hungarian8.wav
- Thick accent: bambara4.wav
- Normal speech: serbian9.wav An excerpt of Trump speaking on COVID-19 was used a real life example An exposition of the book of Philippians was used as a lengthy example (approx. 9 minutes)
- CMUSphinx
- Google Speech Recognition
These were some of the top engines suggested in various sources about which speech recognition engines to use. Some commonly known engines that focus more on natural language processing (NLP) have been disqualified because the length of speech they are willing to process is too short to be of any practical use for the purposes of gathering data.
The system is set up to make it easier to add new speech engines if needed with minimal-if-any revisions directly to the code.
This project assumes that conversion from video to WAV or mp3 to WAV takes the same amount of time for all engines. Though whether the engine takes in MP3 files or other audio and video formats will be noted at the end of this report. Contractions will be considered one word, not two. Punctuation and caps are not taken into consideration, but could be a plus and will be noted at the end of this report. Not proessing the sound file because the engine cannot recognize it will count as an error on the whatever was not processed.
This project will be broken up into a series of smaller experiments testing for accuracy and latency above.
Trial procedure will consist of:
- Feeding the sound file into the sound recognition engine
- Recording the engine's guess and calculating the WER (using this method)
The results will be saved in a spreadsheet in the root directory of this project.