Skip to content

CommandScreen: handle multi-byte UTF-8 code sequences while wrapping #1726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ryenus
Copy link
Contributor

@ryenus ryenus commented Jun 22, 2025

In earlier implementation of the command screen, the process command was treated as a sequence of single-byte characters. When there are wide chars, especially UTF-8 byte sequences in the command, it's possible to have part of the multi-byte sequence wrapped to next line.

To fix that we now leverage mbrtowc and wcwidth to know the byte count and character width for non-ASCII characters, so that we can now wrap near window edge and more importantly at character boundaries.

@BenBE Would you mind having a look at this when you get a chance?

ryenus added 3 commits June 20, 2025 13:51
size line buffer for wide chars w/ arbitrary marks

This ensures sufficient space to handle UTF-8 code sequences, especially
unicode characters with arbitrary combining marks or diacritics, that
might otherwise cause buffer overflow errors.
@ryenus ryenus changed the title Wrap cmd wide CommandScreen: handle multi-byte UTF-8 code sequences while wrapping Jun 22, 2025
@ryenus ryenus force-pushed the wrap-cmd-wide branch 2 times, most recently from ebbfd75 to fe82f5a Compare June 22, 2025 14:34
@ryenus ryenus force-pushed the wrap-cmd-wide branch 2 times, most recently from cf02101 to 4f202bb Compare June 22, 2025 15:26
@Explorer09
Copy link
Contributor

I have been trying to make a few PRs (pull requests) that could bring Unicode character width support in htop. So you are not the first one that proposed the idea.

There are many places in htop codebase that need to be upgraded for Unicode character width support. So your width calculation function would be better not limited to CommandScreen.c use.

To avoid duplicate effort, maybe you could look at my PR #1642 to see what you can do with the width calculation.

int width = wcwidth(wc);
if (bytes == (size_t)-1 || bytes == (size_t)-2 || bytes == 0 || width < 0) {
unsigned char c = (unsigned char)p[i];
if (c < 0x80) { // skip mbrtowc for ASCII characters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lacks checking or assertion on control characters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify what the expected behavior is? I’m asking because, as far as I know, it has always worked this way, even in the main screen. For now, I’d prefer to focus on resolving the line wrapping issue first—one step at a time. 😊

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code point range [0x00, 0x1F] and 0x7F are unprintable. You shouldn't assume the width would be 1 here.

last_spc_offset = line_offset;
last_spc_cols = line_cols;
}
bytes = width = 1;
Copy link
Contributor

@Explorer09 Explorer09 Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style issue. Don't do multiple assignments like this, especially when the types of two variables are different (size_t and int).

Write the two assignments as separate lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants