PromptingNemo is a toolkit for fine-tuning end-to-end (E2E) automatic speech recognition (ASR) systems, integrating both audio-visual and speech-only models.
Focus on this repository for Speech-2-Action models, built by innovaiton and research over pretrained audio and visual encoders.
These models process streamed audio or video input and output streamed tokens. Tokens are either:
- What you speak (standard transcription)
- Infered TAGS
For the sample modles we provide, TAGS infered from the context belong to multi-modal language and tonal understanding, visualQA and voice biometrics.
We provides tools for exporting Nemo models to ONNX and Hugging Face formats.
Also checkout our applications on sample usage of Speech-2-Action models
Use pre-built docker
docker pull WhissleAI:nemo:latest
Build docker from WhissleAI's Nemo branch
add docker commands
Currently, we provide models capable of tagging different tokens while transcribing.
Language | Token-type | #hrs | #parameters | HF-link |
---|---|---|---|---|
Bengali | Transcription, key entities, age, gender, intent, dialect | 100 hrs | 110M | Speech-Tagger-BN-KEY |
Marathi | Transcription, key entities, age, gender, intent, dialect | 100 hrs | 110M | Speech-Tagger-MR-KEY |
Punjabi | Transcription, key entities, age, gender, intent, dialect | 100 hrs | 110M | Speech-Tagger-PA-KEY |
Hindi | Transcription, key entities, age, gender, intent, dialect | 100 hrs | 110M | Speech-Tagger-HI-KEY |
English | Transcription, NER, Emotion | 2500 hrs | 110M | Speech-Tagger-EN-NER |
English | IOT entities and emotion | 150 hrs | 115M | Speech-Tagger-EN-IOT |
EURO (5 languages) | Transcription, Entities, Emotion | CommonVoice | 115M | Speech-Tagger-EURO-NER |
English | Transcription, Role, Entities, Emotion, Intent | Speech-medical-exams | 115M | Speech-Tagger-2person-medical-exams |
This directly fetches the model from hugging-face
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("WhissleAI/speech-tagger_en_ner_emotion")
transcriptions = asr_model.transcribe(["file.wav"])
cd scripts/nemo
python
Assuming you are in the docker.
Adjust things in config.yml
to point to correct pretrained checkpoint and data manifest
cd /PromptingNemo/scripts/nemo/asr/
python nemo_adapter.py
cd /PromptingNemo/scripts/data/audio/1person/
This folder has scripts to process some widely used ASR and SLU datasets
python process_cv.py ### common-voice transcription dataset
python process_libre.py ## Multi-lingual Librespeech telphonic conversations
python process_slurp.py ## IOT focused speech focused benchmark
cd PromptingNemo
python scripts/data/audio/1person/synthetic/create_synthetic_tagged_text.py
Change extension_dataset
to a file which has a sentence on every line which has capitalized tags along with transcription.
Provide a dataset which has audio files, to choose to clone from.
Tagged sentence data in the same language.
python scripts/data/audio/1person/synthetic/create_tts_manifest_xtts.py
In this one you have to change paths in self.clone_voices
and your noise files data self.all_noise_files
and also set required paths to config
cd /PromptingNemo/scripts/data/audio/2person/
This folder has scripts to process role-based turn-taking conversations.
When the audio recording is available with role marked transcriptions, you can annotate them and fine-tune a model with it.
python 01_create_manifest_raw.py ## takes folder with transcript and audio files to create a manifest
python 02_ctm2segments.py ## takes output of nemo forced aligner and organized them to segment level
python 03_annotate_turns.py ## annotate text using LLMs
python 04_split_and_emotion.py ## split audio file using timestamps, and get speaker emotion
or follow 2person.ipynb
notebook.
Generate synthetic datasets to train ASR and natural language systems.
- Exporting Nemo Models: Convert your Nemo models to ONNX and Hugging Face formats for deployment.
python scripts/utils/nemo2hf.py
python scripts/utils/nemo2onxx.py
Explore various applications built with PromptingNemo.
Run the VoiceBot application using Docker.
- Build and run the Docker container:
docker ...
- Inside the Docker container:
cd PromptingNemo/applications/voicebot python app.py
- Description: Implement call-routing functionality within the VoiceBot for efficient handling of calls.
- Tool: Use the data generator tool to create synthetic datasets for various ASR applications.
- π³ Docker
- π Python 3.x
- π¦ Necessary Python libraries (listed in
requirements.txt
)
- Clone the repository:
git clone https://github.com/WhissleAI/PromptingNemo.git
- Navigate to the project directory:
cd PromptingNemo
- Install dependencies:
pip install -r requirements.txt
Karan, S., Shahab, J., Yeon-Jun, K., Andrej, L., Moreno, D. A., Srinivas, B., & Benjamin, S. (2023, December). 1-step Speech Understanding and Transcription Using CTC Loss. In Proceedings of the 20th International Conference on Natural Language Processing (ICON) (pp. 370-377).
Karan, S., Mahnoosh, M., Daniel, P., Ryan, P., Srinivas, C. B., Yeon-Jun, K., & Srinivas, B. (2023, December). Combining Pre trained Speech and Text Encoders for Continuous Spoken Language Processing. In Proceedings of the 20th International Conference on Natural Language Processing (ICON) (pp. 832-842).
We welcome contributions to PromptingNemo! Please read our contribution guidelines to get started.
This project is licensed under the MIT License - see the LICENSE file for details.
Special thanks to all contributors and the open-source community for their invaluable support.