From 10b52ff7563da1f95264db06dab9f2f7d312c633 Mon Sep 17 00:00:00 2001 From: Aolin Date: Fri, 18 Jul 2025 11:55:38 +0800 Subject: [PATCH 1/3] charset: update deal with invalid utf8 characters --- character-set-and-collation.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 5b5b3d845826e..c1aee76c978d2 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -448,6 +448,20 @@ To disable this error reporting, use `set @@tidb_skip_utf8_check=1;` to skip the > > If the character check is skipped, TiDB might fail to detect illegal UTF-8 characters written by the application, cause decoding errors when `ANALYZE` is executed, and introduce other unknown encoding issues. If your application cannot guarantee the validity of the written string, it is not recommended to skip the character check. +In certain SQL statements, comparisons might involve invalid UTF-8 characters. For example: + +```sql +SELECT * FROM `t` WHERE `id` > 'a" + string([]byte{0xff}) + "a'; +``` + +In the preceding statement, `0xff` is an invalid UTF-8 byte. When handling such characters, the behavior of TiDB depends on the collation: + +* Non-binary collations (such as `utf8mb4_general_ci`): TiDB truncates the string at the invalid byte. The truncated part is excluded from the comparison. + +* `gbk_bin` and `gb18030_bin`: TiDB replaces invalid bytes with the character `?` and continues with the comparison. + +* Other binary collations (such as `utf8_bin`): TiDB treats invalid bytes as ordinary bytes and compares them based on their original binary values. + ## Collation support framework From 676f10bc1d3451be4be5c5770ebf5d016cd17f7b Mon Sep 17 00:00:00 2001 From: Aolin Date: Fri, 18 Jul 2025 16:24:09 +0800 Subject: [PATCH 2/3] Apply suggestions from code review Co-authored-by: Grace Cai --- character-set-and-collation.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index c1aee76c978d2..3028b83351437 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -454,11 +454,11 @@ In certain SQL statements, comparisons might involve invalid UTF-8 characters. F SELECT * FROM `t` WHERE `id` > 'a" + string([]byte{0xff}) + "a'; ``` -In the preceding statement, `0xff` is an invalid UTF-8 byte. When handling such characters, the behavior of TiDB depends on the collation: +In the preceding statement, `0xff` is an invalid UTF-8 byte. When handling such characters, TiDB behaves differently depending on the collation: * Non-binary collations (such as `utf8mb4_general_ci`): TiDB truncates the string at the invalid byte. The truncated part is excluded from the comparison. -* `gbk_bin` and `gb18030_bin`: TiDB replaces invalid bytes with the character `?` and continues with the comparison. +* `gbk_bin` and `gb18030_bin` collations: TiDB replaces invalid bytes with the character `?` and continues with the comparison. * Other binary collations (such as `utf8_bin`): TiDB treats invalid bytes as ordinary bytes and compares them based on their original binary values. From 1caf0bfa1aa9a368419b2a0b5e1096e7cebdb899 Mon Sep 17 00:00:00 2001 From: Aolin Date: Fri, 18 Jul 2025 17:29:54 +0800 Subject: [PATCH 3/3] add new in v9.0.0 --- character-set-and-collation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/character-set-and-collation.md b/character-set-and-collation.md index 3028b83351437..6d791a63698ef 100644 --- a/character-set-and-collation.md +++ b/character-set-and-collation.md @@ -454,7 +454,7 @@ In certain SQL statements, comparisons might involve invalid UTF-8 characters. F SELECT * FROM `t` WHERE `id` > 'a" + string([]byte{0xff}) + "a'; ``` -In the preceding statement, `0xff` is an invalid UTF-8 byte. When handling such characters, TiDB behaves differently depending on the collation: +In the preceding statement, `0xff` is an invalid UTF-8 byte. When handling such characters, TiDB behaves differently depending on the collation New in v9.0.0: * Non-binary collations (such as `utf8mb4_general_ci`): TiDB truncates the string at the invalid byte. The truncated part is excluded from the comparison.