-
-
Notifications
You must be signed in to change notification settings - Fork 488
CommandScreen: handle multi-byte UTF-8 code sequences while wrapping #1726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
size line buffer for wide chars w/ arbitrary marks This ensures sufficient space to handle UTF-8 code sequences, especially unicode characters with arbitrary combining marks or diacritics, that might otherwise cause buffer overflow errors.
ebbfd75
to
fe82f5a
Compare
cf02101
to
4f202bb
Compare
I have been trying to make a few PRs (pull requests) that could bring Unicode character width support in htop. So you are not the first one that proposed the idea. There are many places in htop codebase that need to be upgraded for Unicode character width support. So your width calculation function would be better not limited to To avoid duplicate effort, maybe you could look at my PR #1642 to see what you can do with the width calculation. |
int width = wcwidth(wc); | ||
if (bytes == (size_t)-1 || bytes == (size_t)-2 || bytes == 0 || width < 0) { | ||
unsigned char c = (unsigned char)p[i]; | ||
if (c < 0x80) { // skip mbrtowc for ASCII characters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lacks checking or assertion on control characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify what the expected behavior is? I’m asking because, as far as I know, it has always worked this way, even in the main screen. For now, I’d prefer to focus on resolving the line wrapping issue first—one step at a time. 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code point range [0x00, 0x1F] and 0x7F are unprintable. You shouldn't assume the width would be 1 here.
last_spc_offset = line_offset; | ||
last_spc_cols = line_cols; | ||
} | ||
bytes = width = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style issue. Don't do multiple assignments like this, especially when the types of two variables are different (size_t
and int
).
Write the two assignments as separate lines.
In earlier implementation of the command screen, the process command was treated as a sequence of single-byte characters. When there are wide chars, especially UTF-8 byte sequences in the command, it's possible to have part of the multi-byte sequence wrapped to next line.
To fix that we now leverage
mbrtowc
andwcwidth
to know the byte count and character width for non-ASCII characters, so that we can now wrap near window edge and more importantly at character boundaries.@BenBE Would you mind having a look at this when you get a chance?