[QUESTION] Finetune Parakeet TDT-CTC: using an extended pretrained tokenizer and weights restoration including decoder and joint networks #14728

lee-onidas · 2025-09-15T05:22:57Z

lee-onidas
Sep 15, 2025

Hi team,

Thanks for the release of the Parakeet series of models. They are particular interesting given their vast improvement on spoken numbers, which is a problem we've been working on for a while now.

I want to pick someone's brain on an experimental approach I've been trying, but first let me give you some background.

Background information

We have successfully trained from scratch our own fast-conformer-hybrid-rnnt-ctc (Large) models using a unified bilingual tokeniser with ~5700 hours of bilingual data, which has lead to a fully fledged code-switching model, including punctuation and capitilisation, and SOTA performance on our internal benchmarks for the target language and codeswitching, and comparable at english. However, our archilles heal has always been the performance on english proper nouns, numbers, and acronyms, due which is well known problem even for industrial scale model.

Parakeet, on the other hand, has SOTA performance on english including improved performance on spoken numbers.

Goal

The goal would be to finetune Parakeet on our bilingual corpus, but retain parakeets performance on numbers, acronyms and proper nouns.

Existing finetuning methods

I've read some existing documentation on finetuning monolingual ASR models and adapting them for a new language.

In all cases, changing the vocabulary, only includes the preprocessor and encoder, and reinitializes the decoder, joint and ctc decoder. Which is fine if the goal is to finetune for a new monolinugal language.

Furthermore I have also read some papers where they change the vocabulary to a brand new unified bilingual tokenizer, and finetune on a bilingual corpus (https://aclanthology.org/2025.calcs-1.3.pdf). The authors results show a degradation over the pretrained english model.

Experimental approach:

This approach I am testing is pretty simple:

Extend the parakeet pretrained tokenizer while preserving the original 1024 token ids. 0-1023 (English), 1024-2047 (new vocab)
Change the models vocabulary
Carefuly restore the weights of the decoder, joint, and ctc_decoder weights and biases. with special attention to blank id and tdt tokens.
Finetune on bilingual corpus.

This is a common approach taken with extending LLM models to a new language.

Validating new tokeniser and weights restoration:

Below is the average tokens/word used to tokenise a word, for English, target language and Codeswitching. The merged tokeniser in this case, retains the same english vocabulary and token IDs but includes 1024 new bilingual tokens. As you can see, we've closely been able to maintain tokenizer performance on english, but improved tokenization on the target language and codeswitching, while preserving the original token ID's of the pretrained model.

	                               parakeet_tokens_per_word	              merged_tokens_per_word
language		
Code Switch                                2.214	                                        1.889
English	                                     2.160	                                        2.281
Target language                          2.944	                                        1.764

Weight restoration including the decoder and joint weights:

Using the above merged tokenizer of size 2048, we have updated parakeet's vocabulary which preserves the pre-processor and encoder. We have then very carefully restored the weights decoder, joint and ctc decoders, with special attention to the blank ID and tdt tokens. Any weights and biases associated with the new vocab were left as the default randomly intialized close to zero values.

When running inference with new checkpoint with the extend vocab and restored weights, we can vaildate the model still retains full transcription accuracy as the pretrained monolingual checkpoint.

Finetuning

Initial Checkpoint

nvidia/parakeet-tdt_ctc-110m
Changed vocab to use new merged tokenizer
Restored weights of decoder, joint and ctc decoder

Dataset

~5700hours (including punctuation, capitilzation. only includes spoken numbers as per parakeet dataset)
60:40 english/target language
~2k hours of commonvoice, which I believe is included in the parakeet dataset (to prevent forgetting)

Training strategy

global batch size of 528
peak lr of 4.8 e-4
Total steps 250K
warm up steps 10%

Below is screenshot of the early steps of the current experiement. As you can see the ctc_decoder has adpated quite well to the new vocabulary.

The decoder_joint however, suffers quite an ubrupt destabilization as it adapts to the new vocabulary. Validation WER does start to converge, although it looks like it might plateau higher than I expected. albeit it is still early in the training, so we will see.

Discussion

Going back to the goal of the experiment - it is to adapt parakeet to a new language, while preserving performance on English with special attention to numbers, acroynms and proper nouns.

I've tested an one of the early checkpoints, and changed the decoding strategy to use the ctc. And I can see that the model handles code switching pretty well. I do notice however the numbers and acronyms still suffer from the same issue as our proprietary bilngual model.

Numbers and Acronyms

Both numbers and acronyms suffer from the same issue:

Delayed token recognition

Below is an screen shot of our model and parakeet, showing each encoder beam frame and the decoding beam frames that come out of each encoder output frame. using greedy decoding, the green squares are the outputed non-blank tokens. for the audio/transcript "AI and ASR":

Compared to parakeet, our encoder sees correct tokens much later in the beam sequence and with weaker initial probabilities, resulting in:

Weak starting probabilities for correct tokens (vs o >1% confidence)
Decoder unable to sufficiently boost weak signals to overcome blank tokens
Missing tokens entirely (only getting "I and R" instead of "AI and ASR")

It looks like primarily an encoder problem, even when parakeet has low initial confidence (0.2%), its decoder successfully amplifies signals to 56%, while our model's decoder can't compensate for the encoder's delayed/weak token detection.

We see the same issue with numbers. "Thirty three" only gets transcribe as "three".

Next Steps:

I'm considering trying different learning rates for encoder, decoder, joint, ctc modules. I see this functionality was added by editing the yaml config

model:
  optim_param_groups:
    encoder:
      lr: 1e-6
    decoder:
      lr: 4e-4
    joint:
      lr: 4e-4
    ctc:
      lr: 4e-4

We are also in the process curating the original dataset used, with a focus on numbers, acronyms, proper nouns, and including it in the finetuning dataset to prevent forgetting.

Another thought is better intialization of the new tokens, in the decoder embedding, joint_net.2.weights and biases. Currently it is small random near zero weights. Simple approach would be the mean of the english tokens plus some random noise. Or even better would be some kind of weighted intialization based on phonetic or semantic similarity.

I would be interested to hear what your thoughts on this approach are

is the merged tokenizer with fully restored weights, and random intializing of the new vocab weights a valid approach for RNNT models. Am I wasting my time trying to optimize hyperparameters here?
was there anything special done with parakeet to improve performance on numbers, outside of scaling up the datatset, and normalising digits to spoken numbers?
Why is our model seeing the correct tokens so much later in the encoder beam? possibly our encoder needs a bit of silence after a token has been spoken to be more confident, so the start tokens are really weak?

lee-onidas · 2025-09-20T00:25:57Z

lee-onidas
Sep 20, 2025
Author

Just posting an update.

So after changing the vocabulary to our merged bilingual tokeniser, and restoring all monolingual weights, the best way to stabilize the weights of the extend decoder embedding/joint_layer embedding/bias is simply to freeze the encoder during the warm up phase. After warm up, unfreezing the encoder enables rapid convergence on the bilingual training corpus.

0 replies

lee-onidas · 2025-09-26T22:28:56Z

lee-onidas
Sep 26, 2025
Author

Also another note, applying the same finetuning recipe ( i.e changing the vocabulary to our merged bilingual tokeniser, and restoring all monolingual weights including the decoder and joint) to the larger parakeet-tdt-0.6b-v2, does not require freezing of the encoder during the warm up phase. We start to see codeswitching abilities even after 5K training steps. Which is much more efficient than reinitalizing the decoder and joint.

However it is still the case that, training longer than 5K steps on our bilingual dataset, causing the recognition of numbers and acronyms to degrade.

I suspect what is going on is that through parakeets extensive encoder training (i.e initialised from a wav2vec SSL checkpoint, then pretraining, and stage 2 finetuning) it has learned the weights for handling speech patterns present in numbers and acronyms (i.e typically fast speech spoken as one word). My bilinugal training corpus only contains small amount of digits. And so finetuning iteratively unlearns those specific speech patterns.

We are currently curating some parakeet data with an empasis on numbers to add to the data mixture to balance our bilingual corpus.

I was wondering if the NeMo team also applied an data augmentation strategies to the parakeet dataset? For example speed pertubation?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QUESTION] Finetune Parakeet TDT-CTC: using an extended pretrained tokenizer and weights restoration including decoder and joint networks #14728

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

[QUESTION] Finetune Parakeet TDT-CTC: using an extended pretrained tokenizer and weights restoration including decoder and joint networks #14728

Uh oh!

Uh oh!

lee-onidas Sep 15, 2025

Background information

Goal

Existing finetuning methods

Experimental approach:

Validating new tokeniser and weights restoration:

Weight restoration including the decoder and joint weights:

Finetuning

Initial Checkpoint

Dataset

Training strategy

Discussion

Numbers and Acronyms

Delayed token recognition

Next Steps:

Replies: 2 comments

Uh oh!

lee-onidas Sep 20, 2025 Author

Uh oh!

Uh oh!

lee-onidas Sep 26, 2025 Author

lee-onidas
Sep 15, 2025

lee-onidas
Sep 20, 2025
Author

lee-onidas
Sep 26, 2025
Author