Some questions about the parameter --chunked-prefill-size #2815

yuki252111 · 2025-01-09T12:00:42Z

yuki252111
Jan 9, 2025

def budget_state(self):
        if self.rem_total_tokens <= 0 or self.cur_rem_tokens <= 0:
            return AddReqResult.NO_TOKEN

        if self.rem_input_tokens <= 0 or (
            self.rem_chunk_tokens is not None and self.rem_chunk_tokens <= 0
        ):
            return AddReqResult.OTHER

        return AddReqResult.CONTINUE

The above is part of the source code from the file scheduler.py. I think the role of self.rem_chunk_tokens is the same as self.rem_input_tokens, both are used to limit the total number of prefill tokens.Both of their modifications are located in the following function.

def _prefill_one_req(
        self, prefix_len: int, extend_input_len: int, max_new_tokens: int
    ):
        self.rem_total_tokens -= extend_input_len + max_new_tokens
        self.cur_rem_tokens -= extend_input_len
        self.rem_input_tokens -= extend_input_len
        if self.rem_chunk_tokens is not None:
            self.rem_chunk_tokens -= extend_input_len

        self.log_hit_tokens += prefix_len
        self.log_input_tokens += extend_input_len

Of course, self.rem_chunk_tokens is also used to determine whether the prompt of a request needs to be truncated.

        if (
            self.rem_chunk_tokens is None
            or req.extend_input_len <= self.rem_chunk_tokens
        ):
            self.can_run_list.append(req)
            self._prefill_one_req(
                0,
                req.extend_input_len,
                min(req.sampling_params.max_new_tokens, CLIP_MAX_NEW_TOKENS_ESTIMATION),
            )
        else:
            # Chunked prefill
            trunc_len = self.rem_chunk_tokens
            if trunc_len == 0:
                return AddReqResult.OTHER

            req.extend_input_len = trunc_len
            req.fill_ids = req.fill_ids[:trunc_len]
            self.can_run_list.append(req)
            self.new_being_chunked_req = req
            self._prefill_one_req(0, trunc_len, 0)

What confuses me is that I understand that self.rem_chunk_tokens should be used to split the prompt of the last request or each request, but each request will modify self.rem_chunk_tokens, and finally determine whether to continue adding requests to the batch based on self.rem_chunk_tokens > 0. So I don't understand what self.rem_chunk_tokens actually does.

Hope to receive your feedback, thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some questions about the parameter --chunked-prefill-size #2815

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Some questions about the parameter --chunked-prefill-size #2815

Uh oh!

yuki252111 Jan 9, 2025

Replies: 0 comments

yuki252111
Jan 9, 2025