8365675: Add String Unicode Case-Folding Support #27628
Open
+1,145
−163
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Case folding is a key operation for case-insensitive matching (e.g., string equality, regex matching), where the goal is to eliminate case distinctions without applying locale or language specific conversions.
Currently, the JDK does not expose a direct API for Unicode-compliant case folding. Developers now rely on methods such as:
String.equalsIgnoreCase(String)
Character.toLowerCase(int) / Character.toUpperCase(int)
String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)
1:M mapping example, U+00DF (ß)
Motivation & Direction
Add Unicode standard-compliant case-less comparison methods to the String class, enabling & improving reliable and efficient Unicode-aware/compliant case-insensitive matching.
This PR proposes to introduce the following comparison methods in
String
classThese methods are intended to be the preferred choice when Unicode-compliant case-less matching is required.
*Note: An early draft also proposed a String.toCaseFold() method returning a new case-folded string.
However, during review this was considered error-prone, as the resulting string could easily be mistaken for a general transformation like toLowerCase() and then passed into APIs where case-folding semantics are not appropriate.
The New API
Usage Examples
Sharp s (U+00DF) case-folds to "ss"
Performance
The JMH microbenchmark StringCompareToIgnoreCase has been updated to compare performance of compareToFoldCase with the existing compareToIgnoreCase().
Refs
Unicode Standard 5.18.4 Caseless Matching
Unicode® Standard Annex #44: 5.6 Case and Case Mapping
Unicode Technical Standard #18: Unicode Regular Expressions RL1.5: Simple Loose Matches
Unicode SpecialCasing.txt
Unicode CaseFolding.txt
Other Languages
Python string.casefold()
The str.casefold() method in Python returns a casefolded version of a string. Casefolding is a more aggressive form of lowercasing, designed to remove all case distinctions in a string, particularly for the purpose of caseless string comparisons.
Perl’s fc()
Returns the casefolded version of EXPR. This is the internal function implementing the \F escape in double-quoted strings.
Casefolding is the process of mapping strings to a form where case differences are erased; comparing two strings in their casefolded form is effectively a way of asking if two strings are equal, regardless of case.
Perl only implements the full form of casefolding, but you can access the simple folds using "casefold()" in Unicode::UCD] ad "prop_invmap()" in Unicode::UCD].
ICU4J UCharacter.foldCase (Java)
Purpose: Provides extensions to the standard Java Character class, including support for more Unicode properties and handling of supplementary characters (code points beyond U+FFFF).
Method Signature (String based): public static String foldCase(String str, int options)
Method Signature (CharSequence & Appendable based): public static A foldCase(CharSequence src, A dest, int options, Edits edits)
Key Features:
Case Folding: Converts a string to its case-folded equivalent.
Locale Independent: Case folding in UCharacter.foldCase is generally not dependent on locale settings.
Context Insensitive: The mapping of a character is not affected by surrounding characters.
Turkic Option: An option exists to include or exclude special mappings for Turkish/Azerbaijani text.
Result Length: The resulting string can be longer or shorter than the original.
Edits Recording: Allows for recording of edits for index mapping, styled text, and getting only changes.
u_strFoldCase (C/C++)
A lower-level C API function for case folding a string.
Case Folding Options: Similar options as UCharacter.foldCase for controlling case folding behavior.
Availability: Found in the ustring.h and unistr.h headers in the ICU4C library.
Progress
Issues
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/27628/head:pull/27628
$ git checkout pull/27628
Update a local copy of the PR:
$ git checkout pull/27628
$ git pull https://git.openjdk.org/jdk.git pull/27628/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 27628
View PR using the GUI difftool:
$ git pr show -t 27628
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/27628.diff
Using Webrev
Link to Webrev Comment