supported think mode #182

jiapingW · 2025-08-26T12:40:26Z

Motivation

We found that for inference models such as the qwq-32b and qwen3 series, the input IDs they construct are inconsistent with the expected ones.
Using qwen3 as an example:

    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who are you?"},
        {"role": "assistant", "content": "I am a model."},
        {"role": "user", "content": "What is your name?"},
        {"role": "assistant", "content": "My name is QwQ."},
    ]
    res = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,
    )
    # response will be:
    # <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWho are you?<|im_end|>\n<|im_start|>assistant\nI am a model.<|im_end|>\n<|im_start|>user\nWhat is your name?<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nMy name is QwQ.<|im_end|>\n

    res = tokenizer.apply_chat_template(
        messages[:-1],
        tokenize=False,
        add_generation_prompt=True,
    ) + message[-1]["content"] + "<|im_end|>\n"

    # response will be:
    # <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWho are you?<|im_end|>\n<|im_start|>assistant\nI am a model.<|im_end|>\n<|im_start|>user\nWhat is your name?<|im_end|>\n<|im_start|>assistant\nMy name is QwQ.<|im_end|>\n

We can see it add a <think>\n\n</think>\n\n token at the end of the last response. It corresponds to the case where qwen3 adopts non-think mode. This is also the current implementation, which causes the Eagle model to train tokens that should not be involved in calculating the loss. As for the currently open source qwen3-eagle data such as https://huggingface.co/datasets/Tengyunw/qwen3_8b_eagle3, the data obtained uses the thinking mode. This will cause the training data and the inference data to be misaligned.

Therefore, we introduce an additional parameter for the inference model: is-think-mode to determine whether to use the prompt processing related to think mode. Because different models' apply_chat_template handles think mode differently, we choose to splice prompt to complete the following mode. Besides, we modified the template in template.py to adapt the corresponding think input to different models.

You can use the following code under the processing.py to debug and verify the correctness of the assistant message processed and the corresponding entire conversation.

if __name__ == "__main__":
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("Qwen3/Qwen3-30B-A3B", trust_remote_code=True)
    messages = [[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who are you?"},
        {"role": "assistant", "content": "I am a model."},
        {"role": "user", "content": "What is your name?"},
        {"role": "assistant", "content": "My name is QwQ."}
    ]]
    chat_template = TEMPLATE_REGISTRY.get('qwen3')
    max_length = 2048
    res = preprocess_conversations(tokenizer, messages, chat_template, max_length,is_think_mode=False)
    print('\033[31mtext:\033[0m\n',tokenizer.decode(res["input_ids"][0][0]))
    print('\033[31massistant_message:\033[0m\n',tokenizer.decode(res['input_ids'][0][res['loss_mask'][0]==1]))

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @jiapingW, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an inconsistency in input ID construction for certain inference models, such as the Qwen3 series, where unwanted <think> tokens were being included, leading to misalignment between training and inference data. To resolve this, a new is-think-mode parameter has been introduced, allowing the system to correctly process chat templates and apply loss masks by adapting to the specific 'think mode' behavior of different models. This ensures that training data accurately reflects the expected input format for inference, preventing models from training on tokens that should not contribute to loss calculation.

Highlights

New Command-Line Argument: A new command-line argument, --is-think-mode, has been introduced to the train_eagle3_online.py script. This argument allows users to specify whether the input data requires special handling for 'think' tokens, which is crucial for models like Qwen3.
Enhanced Data Preprocessing: The data preprocessing logic in preprocessing.py has been updated to incorporate the is_think_mode parameter. This enables the system to correctly apply loss masks and process conversations by conditionally adjusting assistant message separators and chat template application based on whether 'think mode' is active.
Updated Chat Templates: The ChatTemplate class in template.py now includes an assistant_think_header field. This addition, along with new and updated model templates (e.g., for Qwen3, QwQ, Llama3, Llama4, Qwen, Deepseek), ensures that the system can accurately handle the specific 'think' token formats used by various inference models, aligning training and inference data.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an is-think-mode flag to handle variations in chat template processing, particularly for models like qwen3 that have different modes for generating responses. The changes correctly plumb this new flag through the data processing pipeline and update the chat templates accordingly.

My review has identified a critical issue in the conversation preprocessing logic that could break parsing for some existing chat templates. I've also found a minor error in an example code snippet. Please address these points.

Also, there's a small typo in the pull request title ('suppoerted' -> 'supported').

gemini-code-assist · 2025-08-26T12:42:50Z

specforge/data/preprocessing.py

    user_message_separator = (
-        f"{chat_template.end_of_turn_token}{chat_template.user_header}"
-    )
-    assistant_message_separator = (
-        f"{chat_template.end_of_turn_token}{chat_template.assistant_header}"
+        f"{chat_template.end_of_turn_token}"
    )


This change to user_message_separator introduces a regression. By setting it to just the end_of_turn_token, it breaks parsing for two cases:

Templates with an empty end_of_turn_token (e.g., deepseek). The regex delimiter becomes empty, causing it to match incorrectly.

Templates where end_of_turn_token is a newline (\n) and assistant responses can contain newlines. The regex will prematurely stop at the first newline in the assistant's response.

The previous implementation was more robust. Please revert this part of the change to restore the correct behavior.

Suggested change

user_message_separator = (

f"{chat_template.end_of_turn_token}{chat_template.user_header}"

)

assistant_message_separator = (

f"{chat_template.end_of_turn_token}{chat_template.assistant_header}"

f"{chat_template.end_of_turn_token}"

)

user_message_separator = (

f"{chat_template.end_of_turn_token}{chat_template.user_header}"

)

specforge/data/template.py

jiapingW · 2025-08-28T02:37:57Z

Not making these changes will not affect model training, but it will cause the training results to deviate from expectations. For example, the <think> token has never been seen in the QwQ-32B training data.

shuaills · 2025-08-31T04:49:49Z

specforge/data/preprocessing.py

+        assistant_message_separator = (
+            f"{chat_template.end_of_turn_token}{chat_template.assistant_header}"
+        )


assistant_message_separator = ( f"{chat_template.end_of_turn_token}{chat_template.assistant_header}" f"{chat_template.end_of_turn_token}" )
Any reason for removing f"{chat_template.end_of_turn_token}"?

I adopted the suggested code from gemini code; I added back the chat_template.end_of_turn_token, restoring it to the original assistant_message_separator.

shuaills · 2025-08-31T05:11:10Z

specforge/data/preprocessing.py

                # if the first message is not from user, skip it
                source = source[1:]
-
+            assert len(source) % 2 == 0, "conversation have question without answer"


if len(source) % 2 != 0: continue # raise ValueError("odd number of turns: question without answer")

Thanks, I think your modifications are more robust.

shuaills · 2025-08-31T05:14:01Z

specforge/data/preprocessing.py

+            assert len(source) % 2 == 0, "conversation have question without answer"
            convroles = ["user", "assistant"]
-            for j, sentence in enumerate(source):
+            for j, sentence in enumerate(source[:-1]):


Why is the last element skipped?

I want the last round of dialogue to be formatted with the correct conversation template, rather than directly appending the entire conversation during construction. Otherwise, the same issue described in the motivation will occur, leading to constructed input IDs that are inconsistent with expectations. Therefore, I adopted the approach of concatenating only the last round of dialogue to ensure that the "think" format of the final round is correct.

shuaills · 2025-08-31T05:20:20Z

specforge/data/preprocessing.py

+                add_generation_prompt=True,
+                enable_thinking=is_think_mode,
            )
+            conversation += source[-1]["content"] + chat_template.end_of_turn_token


The last role should be assistant.

We hope that the final round of dialogue is not constructed directly using apply_chat_template, but rather by concatenating user and assistant messages. This is because the logic apply_chat_template uses to process a complete conversation is inconsistent with the output produced when apply_chat_template is applied directly to only the user message in the think model.

jiapingW · 2025-09-15T12:59:06Z

Anyone can help review it? I think it's important for train think mode such as Qwen3-8B.

shuaills · 2025-09-15T13:14:11Z

Can you resolve the conflicts and fix the lints? Thanks
Also some tests failed.

jiapingW · 2025-09-15T13:26:01Z

Can you resolve the conflicts and fix the lints? Thanks Also some tests failed.

Wait some minutes, I will fix these.

jiapingW · 2025-09-15T14:52:11Z

I have resolved the conflicts and fixed the lints.

shuaills · 2025-09-15T19:43:05Z

Can you rebase it?

jiapingW · 2025-09-16T00:58:04Z

Can you rebase it?

ok.

jiapingW · 2025-09-16T03:32:57Z

Can you rebase it?

I have handled it.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

shuaills · 2025-09-21T06:52:21Z

Hi, I think this is ready to merge and can you check the lint and CI please?

jiapingW · 2025-09-21T07:21:09Z

Hi, I think this is ready to merge and can you check the lint and CI please?

Thanks, I have handled the lint and tests are queued.

zyksir · 2025-09-30T08:21:00Z

@jiapingW Hi, I have similar changes in #239
I use python conversation = self.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=False, **kwargs ) so that the same code works both for Qwen and GPT. since for different thinking models, the kwargs for apply chat template is different. If we only use is_think_mode, we need to change the code again for GPT-OSS(and all other thinking models),
For #239 , we only need to change data process and add is_think_mode: True in the data.jsonl. I see that you are using that branch as well. Please let me you if that does not work for you.

jiapingW · 2025-09-30T09:25:38Z

@jiapingW Hi, I have similar changes in #239 I use python conversation = self.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=False, **kwargs ) so that the same code works both for Qwen and GPT. since for different thinking models, the kwargs for apply chat template is different. If we only use is_think_mode, we need to change the code again for GPT-OSS(and all other thinking models), For #239 , we only need to change data process and add is_think_mode: True in the data.jsonl. I see that you are using that branch as well. Please let me you if that does not work for you.

I understand that your design supports more models in think mode. I used https://github.dev/sgl-project/SpecForge/tree/feature/add_sgl_online to train qwen2.5-7b. I've generated data for the qwen3 series, but haven't used your version for training yet. However, I believe your implementation may have the following issues:

Due to chat template misalignment, your training tokens may have some issues, such as training tokens that aren't originally involved in training, such as <think>\n<\think>\n
As shown in the example above, qwen3's chat_template differs in its logic for handling multiple complete conversations and generating responses in think and non-think modes.

Simply adding parameters would likely cause issues, so my changes above separate the processing of the final round of conversation from the previous ones.

jiapingW · 2025-09-30T09:35:54Z

@jiapingW Hi, I have similar changes in #239 I use python conversation = self.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=False, **kwargs ) so that the same code works both for Qwen and GPT. since for different thinking models, the kwargs for apply chat template is different. If we only use is_think_mode, we need to change the code again for GPT-OSS(and all other thinking models), For #239 , we only need to change data process and add is_think_mode: True in the data.jsonl. I see that you are using that branch as well. Please let me you if that does not work for you.

You can use the code below to test:

if __name__ == "__main__":
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("Qwen3-30B-A3B", trust_remote_code=True)
    messages = [[
        {"role": "user", "content": "Who are you?"},
        {"role": "assistant", "content": "I am a model."},
        {"role": "user", "content": "What is your name?"},
        {"role": "assistant", "content": "My name is QwQ."},
        {"role": "user", "content": "What is 1+1"},
        {"role": "assistant", "content": "=2."}
    ]]
    # chat_template = TEMPLATE_REGISTRY.get('qwen')
    max_length = 2048

    conversation = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
            enable_thinking=True,
        )

    print(conversation)

You will get the response:

<|im_start|>user\nWho are you?<|im_end|>\n<|im_start|>assistant\nI am a model.<|im_end|>\n<|im_start|>user\nWhat is your name?<|im_end|>\n<|im_start|>assistant\nMy name is QwQ.<|im_end|>\n<|im_start|>user\nWhat is 1+1<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n=2.<|im_end|>\n

I think the last conversation is not a thinking format which violates the expected answer.

jiapingW requested review from FlamingoPg, shuaills, sleepcoo and zyksir as code owners August 26, 2025 12:40

gemini-code-assist bot reviewed Aug 26, 2025

View reviewed changes

jiapingW changed the title ~~suppoerted think mode~~ supported think mode Aug 28, 2025

shuaills requested changes Aug 31, 2025

View reviewed changes

jiapingW requested a review from shuaills September 8, 2025 02:10

jiapingW force-pushed the features/supported_think_mode branch from 306d3c1 to bde431c Compare September 16, 2025 00:53

jiapingW force-pushed the features/supported_think_mode branch 2 times, most recently from 7cf8a0c to 09ffac3 Compare September 16, 2025 02:24

jiapingW closed this Sep 16, 2025

jiapingW reopened this Sep 16, 2025

jiapingW and others added 8 commits September 21, 2025 14:50

suppoerted think mode

154cba5

Update specforge/data/template.py

63083db

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

suppoerted think mode

684d356

fix some code

15b7ee7

fix lint

2283941

update new code

71899eb

fix bugs

d3636bf

fix lint

53edcf6

shuaills force-pushed the features/supported_think_mode branch from 9e814dc to 53edcf6 Compare September 21, 2025 06:50

fix lint

4b8c57d

jiapingW mentioned this pull request Oct 13, 2025

Feature/add sgl online #239

Merged

6 tasks

supported think mode #182

Are you sure you want to change the base?

supported think mode #182

Uh oh!

Conversation

jiapingW commented Aug 26, 2025

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jiapingW commented Aug 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiapingW commented Sep 15, 2025

Uh oh!

shuaills commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiapingW commented Sep 15, 2025

Uh oh!

jiapingW commented Sep 15, 2025

Uh oh!

shuaills commented Sep 15, 2025

Uh oh!

jiapingW commented Sep 16, 2025

Uh oh!

jiapingW commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shuaills commented Sep 21, 2025

Uh oh!

jiapingW commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zyksir commented Sep 30, 2025

Uh oh!

jiapingW commented Sep 30, 2025

Uh oh!

jiapingW commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shuaills commented Sep 15, 2025 •

edited

Loading

jiapingW commented Sep 16, 2025 •

edited

Loading

jiapingW commented Sep 21, 2025 •

edited

Loading