[QUESTION] Finetune Parakeet TDT-CTC: using an extended pretrained tokenizer and weights restoration including decoder and joint networks #14728
Replies: 2 comments
-
Just posting an update. So after changing the vocabulary to our merged bilingual tokeniser, and restoring all monolingual weights, the best way to stabilize the weights of the extend decoder embedding/joint_layer embedding/bias is simply to freeze the encoder during the warm up phase. After warm up, unfreezing the encoder enables rapid convergence on the bilingual training corpus. |
Beta Was this translation helpful? Give feedback.
-
Also another note, applying the same finetuning recipe ( i.e changing the vocabulary to our merged bilingual tokeniser, and restoring all monolingual weights including the decoder and joint) to the larger parakeet-tdt-0.6b-v2, does not require freezing of the encoder during the warm up phase. We start to see codeswitching abilities even after 5K training steps. Which is much more efficient than reinitalizing the decoder and joint. However it is still the case that, training longer than 5K steps on our bilingual dataset, causing the recognition of numbers and acronyms to degrade. I suspect what is going on is that through parakeets extensive encoder training (i.e initialised from a wav2vec SSL checkpoint, then pretraining, and stage 2 finetuning) it has learned the weights for handling speech patterns present in numbers and acronyms (i.e typically fast speech spoken as one word). My bilinugal training corpus only contains small amount of digits. And so finetuning iteratively unlearns those specific speech patterns. We are currently curating some parakeet data with an empasis on numbers to add to the data mixture to balance our bilingual corpus. I was wondering if the NeMo team also applied an data augmentation strategies to the parakeet dataset? For example speed pertubation? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi team,
Thanks for the release of the Parakeet series of models. They are particular interesting given their vast improvement on spoken numbers, which is a problem we've been working on for a while now.
I want to pick someone's brain on an experimental approach I've been trying, but first let me give you some background.
Background information
We have successfully trained from scratch our own fast-conformer-hybrid-rnnt-ctc (Large) models using a unified bilingual tokeniser with ~5700 hours of bilingual data, which has lead to a fully fledged code-switching model, including punctuation and capitilisation, and SOTA performance on our internal benchmarks for the target language and codeswitching, and comparable at english. However, our archilles heal has always been the performance on english proper nouns, numbers, and acronyms, due which is well known problem even for industrial scale model.
Parakeet, on the other hand, has SOTA performance on english including improved performance on spoken numbers.
Goal
The goal would be to finetune Parakeet on our bilingual corpus, but retain parakeets performance on numbers, acronyms and proper nouns.
Existing finetuning methods
I've read some existing documentation on finetuning monolingual ASR models and adapting them for a new language.
In all cases, changing the vocabulary, only includes the preprocessor and encoder, and reinitializes the decoder, joint and ctc decoder. Which is fine if the goal is to finetune for a new monolinugal language.
Furthermore I have also read some papers where they change the vocabulary to a brand new unified bilingual tokenizer, and finetune on a bilingual corpus (https://aclanthology.org/2025.calcs-1.3.pdf). The authors results show a degradation over the pretrained english model.
Experimental approach:
This approach I am testing is pretty simple:
This is a common approach taken with extending LLM models to a new language.
Validating new tokeniser and weights restoration:
Below is the average tokens/word used to tokenise a word, for English, target language and Codeswitching. The merged tokeniser in this case, retains the same english vocabulary and token IDs but includes 1024 new bilingual tokens. As you can see, we've closely been able to maintain tokenizer performance on english, but improved tokenization on the target language and codeswitching, while preserving the original token ID's of the pretrained model.
Weight restoration including the decoder and joint weights:
Using the above merged tokenizer of size 2048, we have updated parakeet's vocabulary which preserves the pre-processor and encoder. We have then very carefully restored the weights decoder, joint and ctc decoders, with special attention to the blank ID and tdt tokens. Any weights and biases associated with the new vocab were left as the default randomly intialized close to zero values.
When running inference with new checkpoint with the extend vocab and restored weights, we can vaildate the model still retains full transcription accuracy as the pretrained monolingual checkpoint.
Finetuning
Initial Checkpoint
Dataset
Training strategy
Below is screenshot of the early steps of the current experiement. As you can see the ctc_decoder has adpated quite well to the new vocabulary.
The decoder_joint however, suffers quite an ubrupt destabilization as it adapts to the new vocabulary. Validation WER does start to converge, although it looks like it might plateau higher than I expected. albeit it is still early in the training, so we will see.
Discussion
Going back to the goal of the experiment - it is to adapt parakeet to a new language, while preserving performance on English with special attention to numbers, acroynms and proper nouns.
I've tested an one of the early checkpoints, and changed the decoding strategy to use the ctc. And I can see that the model handles code switching pretty well. I do notice however the numbers and acronyms still suffer from the same issue as our proprietary bilngual model.
Numbers and Acronyms
Both numbers and acronyms suffer from the same issue:
Delayed token recognition
Below is an screen shot of our model and parakeet, showing each encoder beam frame and the decoding beam frames that come out of each encoder output frame. using greedy decoding, the green squares are the outputed non-blank tokens. for the audio/transcript "AI and ASR":
Compared to parakeet, our encoder sees correct tokens much later in the beam sequence and with weaker initial probabilities, resulting in:
It looks like primarily an encoder problem, even when parakeet has low initial confidence (0.2%), its decoder successfully amplifies signals to 56%, while our model's decoder can't compensate for the encoder's delayed/weak token detection.
We see the same issue with numbers. "Thirty three" only gets transcribe as "three".
Next Steps:
I'm considering trying different learning rates for encoder, decoder, joint, ctc modules. I see this functionality was added by editing the yaml config
We are also in the process curating the original dataset used, with a focus on numbers, acronyms, proper nouns, and including it in the finetuning dataset to prevent forgetting.
Another thought is better intialization of the new tokens, in the decoder embedding, joint_net.2.weights and biases. Currently it is small random near zero weights. Simple approach would be the mean of the english tokens plus some random noise. Or even better would be some kind of weighted intialization based on phonetic or semantic similarity.
I would be interested to hear what your thoughts on this approach are
Beta Was this translation helpful? Give feedback.
All reactions