Skip to content

Conversation

jiapingW
Copy link
Contributor

Motivation

We found that for inference models such as the qwq-32b and qwen3 series, the input IDs they construct are inconsistent with the expected ones.
Using qwen3 as an example:

    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who are you?"},
        {"role": "assistant", "content": "I am a model."},
        {"role": "user", "content": "What is your name?"},
        {"role": "assistant", "content": "My name is QwQ."},
    ]
    res = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,
    )
    # response will be:
    # <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWho are you?<|im_end|>\n<|im_start|>assistant\nI am a model.<|im_end|>\n<|im_start|>user\nWhat is your name?<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nMy name is QwQ.<|im_end|>\n

    res = tokenizer.apply_chat_template(
        messages[:-1],
        tokenize=False,
        add_generation_prompt=True,
    ) + message[-1]["content"] + "<|im_end|>\n"

    # response will be:
    # <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWho are you?<|im_end|>\n<|im_start|>assistant\nI am a model.<|im_end|>\n<|im_start|>user\nWhat is your name?<|im_end|>\n<|im_start|>assistant\nMy name is QwQ.<|im_end|>\n

We can see it add a <think>\n\n</think>\n\n token at the end of the last response. It corresponds to the case where qwen3 adopts non-think mode. This is also the current implementation, which causes the Eagle model to train tokens that should not be involved in calculating the loss. As for the currently open source qwen3-eagle data such as https://huggingface.co/datasets/Tengyunw/qwen3_8b_eagle3, the data obtained uses the thinking mode. This will cause the training data and the inference data to be misaligned.

Therefore, we introduce an additional parameter for the inference model: is-think-mode to determine whether to use the prompt processing related to think mode. Because different models' apply_chat_template handles think mode differently, we choose to splice prompt to complete the following mode. Besides, we modified the template in template.py to adapt the corresponding think input to different models.

You can use the following code under the processing.py to debug and verify the correctness of the assistant message processed and the corresponding entire conversation.

if __name__ == "__main__":
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("Qwen3/Qwen3-30B-A3B", trust_remote_code=True)
    messages = [[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who are you?"},
        {"role": "assistant", "content": "I am a model."},
        {"role": "user", "content": "What is your name?"},
        {"role": "assistant", "content": "My name is QwQ."}
    ]]
    chat_template = TEMPLATE_REGISTRY.get('qwen3')
    max_length = 2048
    res = preprocess_conversations(tokenizer, messages, chat_template, max_length,is_think_mode=False)
    print('\033[31mtext:\033[0m\n',tokenizer.decode(res["input_ids"][0][0]))
    print('\033[31massistant_message:\033[0m\n',tokenizer.decode(res['input_ids'][0][res['loss_mask'][0]==1]))

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @jiapingW, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an inconsistency in input ID construction for certain inference models, such as the Qwen3 series, where unwanted <think> tokens were being included, leading to misalignment between training and inference data. To resolve this, a new is-think-mode parameter has been introduced, allowing the system to correctly process chat templates and apply loss masks by adapting to the specific 'think mode' behavior of different models. This ensures that training data accurately reflects the expected input format for inference, preventing models from training on tokens that should not contribute to loss calculation.

Highlights

  • New Command-Line Argument: A new command-line argument, --is-think-mode, has been introduced to the train_eagle3_online.py script. This argument allows users to specify whether the input data requires special handling for 'think' tokens, which is crucial for models like Qwen3.
  • Enhanced Data Preprocessing: The data preprocessing logic in preprocessing.py has been updated to incorporate the is_think_mode parameter. This enables the system to correctly apply loss masks and process conversations by conditionally adjusting assistant message separators and chat template application based on whether 'think mode' is active.
  • Updated Chat Templates: The ChatTemplate class in template.py now includes an assistant_think_header field. This addition, along with new and updated model templates (e.g., for Qwen3, QwQ, Llama3, Llama4, Qwen, Deepseek), ensures that the system can accurately handle the specific 'think' token formats used by various inference models, aligning training and inference data.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an is-think-mode flag to handle variations in chat template processing, particularly for models like qwen3 that have different modes for generating responses. The changes correctly plumb this new flag through the data processing pipeline and update the chat templates accordingly.

My review has identified a critical issue in the conversation preprocessing logic that could break parsing for some existing chat templates. I've also found a minor error in an example code snippet. Please address these points.

Also, there's a small typo in the pull request title ('suppoerted' -> 'supported').

Comment on lines 74 to 77
user_message_separator = (
f"{chat_template.end_of_turn_token}{chat_template.user_header}"
)
assistant_message_separator = (
f"{chat_template.end_of_turn_token}{chat_template.assistant_header}"
f"{chat_template.end_of_turn_token}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change to user_message_separator introduces a regression. By setting it to just the end_of_turn_token, it breaks parsing for two cases:

  1. Templates with an empty end_of_turn_token (e.g., deepseek). The regex delimiter becomes empty, causing it to match incorrectly.
  2. Templates where end_of_turn_token is a newline (\n) and assistant responses can contain newlines. The regex will prematurely stop at the first newline in the assistant's response.

The previous implementation was more robust. Please revert this part of the change to restore the correct behavior.

Suggested change
user_message_separator = (
f"{chat_template.end_of_turn_token}{chat_template.user_header}"
)
assistant_message_separator = (
f"{chat_template.end_of_turn_token}{chat_template.assistant_header}"
f"{chat_template.end_of_turn_token}"
)
user_message_separator = (
f"{chat_template.end_of_turn_token}{chat_template.user_header}"
)

@jiapingW jiapingW changed the title suppoerted think mode supported think mode Aug 28, 2025
@jiapingW
Copy link
Contributor Author

Not making these changes will not affect model training, but it will cause the training results to deviate from expectations. For example, the <think> token has never been seen in the QwQ-32B training data.

Comment on lines +82 to +85
assistant_message_separator = (
f"{chat_template.end_of_turn_token}{chat_template.assistant_header}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assistant_message_separator = ( f"{chat_template.end_of_turn_token}{chat_template.assistant_header}" f"{chat_template.end_of_turn_token}" )
Any reason for removing f"{chat_template.end_of_turn_token}"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adopted the suggested code from gemini code; I added back the chat_template.end_of_turn_token, restoring it to the original assistant_message_separator.

# if the first message is not from user, skip it
source = source[1:]

assert len(source) % 2 == 0, "conversation have question without answer"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if len(source) % 2 != 0:
    continue  # raise ValueError("odd number of turns: question without answer")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think your modifications are more robust.

assert len(source) % 2 == 0, "conversation have question without answer"
convroles = ["user", "assistant"]
for j, sentence in enumerate(source):
for j, sentence in enumerate(source[:-1]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the last element skipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want the last round of dialogue to be formatted with the correct conversation template, rather than directly appending the entire conversation during construction. Otherwise, the same issue described in the motivation will occur, leading to constructed input IDs that are inconsistent with expectations. Therefore, I adopted the approach of concatenating only the last round of dialogue to ensure that the "think" format of the final round is correct.

Comment on lines 174 to 177
add_generation_prompt=True,
enable_thinking=is_think_mode,
)
conversation += source[-1]["content"] + chat_template.end_of_turn_token
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last role should be assistant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We hope that the final round of dialogue is not constructed directly using apply_chat_template, but rather by concatenating user and assistant messages. This is because the logic apply_chat_template uses to process a complete conversation is inconsistent with the output produced when apply_chat_template is applied directly to only the user message in the think model.

@jiapingW jiapingW requested a review from shuaills September 8, 2025 02:10
@jiapingW
Copy link
Contributor Author

Anyone can help review it? I think it's important for train think mode such as Qwen3-8B.

@shuaills
Copy link
Collaborator

shuaills commented Sep 15, 2025

Can you resolve the conflicts and fix the lints? Thanks
Also some tests failed.

@jiapingW
Copy link
Contributor Author

Can you resolve the conflicts and fix the lints? Thanks Also some tests failed.

Wait some minutes, I will fix these.

@jiapingW
Copy link
Contributor Author

I have resolved the conflicts and fixed the lints.

@shuaills
Copy link
Collaborator

Can you rebase it?

@jiapingW jiapingW force-pushed the features/supported_think_mode branch from 306d3c1 to bde431c Compare September 16, 2025 00:53
@jiapingW
Copy link
Contributor Author

Can you rebase it?

ok.

@jiapingW jiapingW force-pushed the features/supported_think_mode branch 2 times, most recently from 7cf8a0c to 09ffac3 Compare September 16, 2025 02:24
@jiapingW
Copy link
Contributor Author

jiapingW commented Sep 16, 2025

Can you rebase it?

I have handled it.

@jiapingW jiapingW closed this Sep 16, 2025
@jiapingW jiapingW reopened this Sep 16, 2025
@shuaills shuaills force-pushed the features/supported_think_mode branch from 9e814dc to 53edcf6 Compare September 21, 2025 06:50
@shuaills
Copy link
Collaborator

Hi, I think this is ready to merge and can you check the lint and CI please?

@jiapingW
Copy link
Contributor Author

jiapingW commented Sep 21, 2025

Hi, I think this is ready to merge and can you check the lint and CI please?

Thanks, I have handled the lint and tests are queued.

@zyksir
Copy link
Collaborator

zyksir commented Sep 30, 2025

@jiapingW Hi, I have similar changes in #239
I use python conversation = self.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=False, **kwargs ) so that the same code works both for Qwen and GPT. since for different thinking models, the kwargs for apply chat template is different. If we only use is_think_mode, we need to change the code again for GPT-OSS(and all other thinking models),
For #239 , we only need to change data process and add is_think_mode: True in the data.jsonl. I see that you are using that branch as well. Please let me you if that does not work for you.

@jiapingW
Copy link
Contributor Author

@jiapingW Hi, I have similar changes in #239 I use python conversation = self.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=False, **kwargs ) so that the same code works both for Qwen and GPT. since for different thinking models, the kwargs for apply chat template is different. If we only use is_think_mode, we need to change the code again for GPT-OSS(and all other thinking models), For #239 , we only need to change data process and add is_think_mode: True in the data.jsonl. I see that you are using that branch as well. Please let me you if that does not work for you.

I understand that your design supports more models in think mode. I used https://github.dev/sgl-project/SpecForge/tree/feature/add_sgl_online to train qwen2.5-7b. I've generated data for the qwen3 series, but haven't used your version for training yet. However, I believe your implementation may have the following issues:

  1. Due to chat template misalignment, your training tokens may have some issues, such as training tokens that aren't originally involved in training, such as <think>\n<\think>\n
  2. As shown in the example above, qwen3's chat_template differs in its logic for handling multiple complete conversations and generating responses in think and non-think modes.

Simply adding parameters would likely cause issues, so my changes above separate the processing of the final round of conversation from the previous ones.

@jiapingW
Copy link
Contributor Author

@jiapingW Hi, I have similar changes in #239 I use python conversation = self.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=False, **kwargs ) so that the same code works both for Qwen and GPT. since for different thinking models, the kwargs for apply chat template is different. If we only use is_think_mode, we need to change the code again for GPT-OSS(and all other thinking models), For #239 , we only need to change data process and add is_think_mode: True in the data.jsonl. I see that you are using that branch as well. Please let me you if that does not work for you.

You can use the code below to test:

if __name__ == "__main__":
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("Qwen3-30B-A3B", trust_remote_code=True)
    messages = [[
        {"role": "user", "content": "Who are you?"},
        {"role": "assistant", "content": "I am a model."},
        {"role": "user", "content": "What is your name?"},
        {"role": "assistant", "content": "My name is QwQ."},
        {"role": "user", "content": "What is 1+1"},
        {"role": "assistant", "content": "=2."}
    ]]
    # chat_template = TEMPLATE_REGISTRY.get('qwen')
    max_length = 2048

    conversation = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
            enable_thinking=True,
        )

    print(conversation)

You will get the response:

<|im_start|>user\nWho are you?<|im_end|>\n<|im_start|>assistant\nI am a model.<|im_end|>\n<|im_start|>user\nWhat is your name?<|im_end|>\n<|im_start|>assistant\nMy name is QwQ.<|im_end|>\n<|im_start|>user\nWhat is 1+1<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n=2.<|im_end|>\n

I think the last conversation is not a thinking format which violates the expected answer.

@jiapingW jiapingW mentioned this pull request Oct 13, 2025
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants