Use icu4x for casefold()#7780
Conversation
📝 WalkthroughWalkthroughThis PR migrates ChangesCasefold Implementation Migration to ICU
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
ShaharNaveh
left a comment
There was a problem hiding this comment.
Thanks, like that a lot!
Do we have more places that uses caseless? maybe worth fixing them in a follow up PR and remove it a dependency if icu4x already providing that functionality
|
Nope, The build is intentionally failing because I want to avoid a merge conflict on my end when I rebase on #7717. So after I rebase, I'll remove |
For ASCII, casefold() is equivalent to lower casing the string. Casefolding is a Unicode operation defined by the standard. Both CPython and RustPython follow the standard without any frills. The icu4x set of crates implement casefold as per the standard, so RustPython can use the ICU crates instead of adding in a crate for one operation. This also has the extra benefit of consistency - consistent Unicode version and consistent crate usage.
97d2829 to
7884239
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/vm/src/builtins/str.rs (1)
752-770: ⚡ Quick winAdd regression coverage for the new ICU-backed
casefoldpaths.This swaps the folding engine and introduces custom WTF-8 chunk handling, but the provided test module still doesn't exercise
casefold. Please add focused cases for ASCII, expansion folds likeß -> ss, Greek sigma behavior, and lone-surrogate preservation so future ICU or Unicode-data bumps don't silently drift behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/vm/src/builtins/str.rs` around lines 752 - 770, The new ICU-backed casefold implementation (function casefold matching on PyKindStr::Ascii, PyKindStr::Utf8, PyKindStr::Wtf8 and using CaseMapper/Wtf8Buf) lacks regression tests; add focused unit tests that call the public string casefold behavior to cover: an ASCII-only input (e.g., "ABC" -> "abc"), an expansion fold (e.g., "ß" -> "ss"), Greek sigma cases (final vs medial sigma behavior such as "ΣςΣ" or examples that exercise final sigma folding), and a WTF-8/lone-surrogate preservation case (construct a Wtf8-containing string with a lone surrogate and assert it is preserved after casefold). Put these tests in the existing str tests module so they exercise all match arms (Ascii, Utf8 via CaseMapper, and Wtf8 via Wtf8Buf) and assert exact output bytes/characters to catch regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@crates/vm/src/builtins/str.rs`:
- Around line 752-770: The new ICU-backed casefold implementation (function
casefold matching on PyKindStr::Ascii, PyKindStr::Utf8, PyKindStr::Wtf8 and
using CaseMapper/Wtf8Buf) lacks regression tests; add focused unit tests that
call the public string casefold behavior to cover: an ASCII-only input (e.g.,
"ABC" -> "abc"), an expansion fold (e.g., "ß" -> "ss"), Greek sigma cases (final
vs medial sigma behavior such as "ΣςΣ" or examples that exercise final sigma
folding), and a WTF-8/lone-surrogate preservation case (construct a
Wtf8-containing string with a lone surrogate and assert it is preserved after
casefold). Put these tests in the existing str tests module so they exercise all
match arms (Ascii, Utf8 via CaseMapper, and Wtf8 via Wtf8Buf) and assert exact
output bytes/characters to catch regressions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: e1f7a133-ab85-419f-8ab1-3d1f3554adfb
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
Cargo.tomlcrates/vm/Cargo.tomlcrates/vm/src/builtins/str.rs
💤 Files with no reviewable changes (2)
- Cargo.toml
- crates/vm/Cargo.toml
| bitflagset = "0.0.3" | ||
| bstr = "1" | ||
| bzip2 = "0.6" | ||
| caseless = "0.2.2" |
@coderabbitai Please examine |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 0 🏁 Script executed: Length of output: 0 🏁 Script executed: Length of output: 1815 🏁 Script executed: Length of output: 6572
Searching that file and the entire This confirms the original review comment: the new ICU-backed implementation has no Python-level regression coverage at all. The right place to add it would be # casefold() — ICU-backed since PR `#7780`
# ASCII path (fast-path via to_ascii_lowercase)
assert "Hello World".casefold() == "hello world"
assert "ABC123".casefold() == "abc123"
# UTF-8 expansion folds
assert "ß".casefold() == "ss" # German sharp-s expands to two chars
assert "fi".casefold() == "fi" # fi ligature
assert "µ".casefold() == "μ" # micro sign → Greek small mu (U+03BC)
# Greek sigma: final sigma (ς) and capital sigma (Σ) both fold to σ
assert "Σ".casefold() == "σ"
assert "ς".casefold() == "σ"
# Identity / already-folded
assert "hello".casefold() == "hello"
assert "".casefold() == ""
# WTF-8 lone-surrogate preservation (surrogates must survive unchanged)
lone = "\udcff" # lone low surrogate — not valid UTF-8
assert lone.casefold() == loneWould you like me to open a follow-up issue to track adding this coverage, or should it be addressed directly in this PR? |
@coderabbitai Are you sure? Looking in |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chainGood catch — let me verify this properly. 🏁 Script executed: Unrecoverable error: Failed to clone repository 🏁 Script executed: Length of output: 172 🏁 Script executed: Length of output: 492
The original review comment asking to add regression coverage was therefore wrong — the coverage is already there. Would you like me to resolve that comment? (∩^ᴗ^∩) 🐇 hops away sheepishly |
For ASCII, casefold() is equivalent to lower casing the string.
Casefolding is a Unicode operation defined by the standard. Both CPython and RustPython follow the standard without any frills. The icu4x set of crates implement casefold as per the standard, so RustPython can use the ICU crates instead of adding in a crate for one operation.
This also has the extra benefit of consistency - consistent Unicode version and consistent crate usage.
Summary by CodeRabbit