Skip to content

Use icu4x for casefold()#7780

Merged
youknowone merged 1 commit into
RustPython:mainfrom
joshuamegnauth54:use-icu-for-casefold
May 16, 2026
Merged

Use icu4x for casefold()#7780
youknowone merged 1 commit into
RustPython:mainfrom
joshuamegnauth54:use-icu-for-casefold

Conversation

@joshuamegnauth54
Copy link
Copy Markdown
Contributor

@joshuamegnauth54 joshuamegnauth54 commented May 4, 2026

For ASCII, casefold() is equivalent to lower casing the string.

Casefolding is a Unicode operation defined by the standard. Both CPython and RustPython follow the standard without any frills. The icu4x set of crates implement casefold as per the standard, so RustPython can use the ICU crates instead of adding in a crate for one operation.

This also has the extra benefit of consistency - consistent Unicode version and consistent crate usage.

Summary by CodeRabbit

  • Refactor
    • Enhanced string case-folding with improved Unicode support for handling international characters.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Walkthrough

This PR migrates PyStr::casefold() from the caseless crate to ICU-based case folding via icu_casemap::CaseMapper. Workspace and crate dependencies on caseless are removed, and the casefold implementation is rewritten to handle ASCII, UTF-8, and WTF-8 string kinds with appropriate folding strategies.

Changes

Casefold Implementation Migration to ICU

Layer / File(s) Summary
Remove caseless dependency from workspace and crate
Cargo.toml, crates/vm/Cargo.toml
The caseless crate is removed from both workspace-level and rustpython-vm crate dependencies.
Implement casefold using CaseMapper and update imports
crates/vm/src/builtins/str.rs
Module imports are adjusted to remove Wtf8Chunk and add CaseMapper from icu_casemap. The casefold() method is rewritten to use CaseMapper::fold_string() for UTF-8 strings, ASCII lowercase for ASCII strings, and selective folding with byte preservation for WTF-8 strings.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • RustPython/RustPython#7555: Both PRs migrate PyStr Unicode logic to ICU-backed implementations—this PR updates casefold() via icu_casemap, while the related PR updates property checks like isdecimal/isspace/isidentifier via icu_properties.

Suggested reviewers

  • ShaharNaveh
  • youknowone

Poem

🐰 The caseless crate hops away,
As ICU takes the folding day,
ASCII, UTF-8, WTF-8 so bright,
Case folding shines with new light!
Unicode magic, crystal and clean! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Use icu4x for casefold()' directly and specifically describes the main change—replacing the caseless implementation with icu4x for Unicode case folding, which is the primary objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@ShaharNaveh ShaharNaveh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, like that a lot!
Do we have more places that uses caseless? maybe worth fixing them in a follow up PR and remove it a dependency if icu4x already providing that functionality

@joshuamegnauth54
Copy link
Copy Markdown
Contributor Author

Nope, caseless is used exactly once, so it's an easy win. 😁

The build is intentionally failing because I want to avoid a merge conflict on my end when I rebase on #7717. So after I rebase, I'll remove caseless and then force push.

For ASCII, casefold() is equivalent to lower casing the string.

Casefolding is a Unicode operation defined by the standard. Both CPython
and RustPython follow the standard without any frills. The icu4x set of
crates implement casefold as per the standard, so RustPython can use
the ICU crates instead of adding in a crate for one operation.

This also has the extra benefit of consistency - consistent Unicode
version and consistent crate usage.
@joshuamegnauth54 joshuamegnauth54 marked this pull request as ready for review May 15, 2026 18:54
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/vm/src/builtins/str.rs (1)

752-770: ⚡ Quick win

Add regression coverage for the new ICU-backed casefold paths.

This swaps the folding engine and introduces custom WTF-8 chunk handling, but the provided test module still doesn't exercise casefold. Please add focused cases for ASCII, expansion folds like ß -> ss, Greek sigma behavior, and lone-surrogate preservation so future ICU or Unicode-data bumps don't silently drift behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/vm/src/builtins/str.rs` around lines 752 - 770, The new ICU-backed
casefold implementation (function casefold matching on PyKindStr::Ascii,
PyKindStr::Utf8, PyKindStr::Wtf8 and using CaseMapper/Wtf8Buf) lacks regression
tests; add focused unit tests that call the public string casefold behavior to
cover: an ASCII-only input (e.g., "ABC" -> "abc"), an expansion fold (e.g., "ß"
-> "ss"), Greek sigma cases (final vs medial sigma behavior such as "ΣςΣ" or
examples that exercise final sigma folding), and a WTF-8/lone-surrogate
preservation case (construct a Wtf8-containing string with a lone surrogate and
assert it is preserved after casefold). Put these tests in the existing str
tests module so they exercise all match arms (Ascii, Utf8 via CaseMapper, and
Wtf8 via Wtf8Buf) and assert exact output bytes/characters to catch regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/vm/src/builtins/str.rs`:
- Around line 752-770: The new ICU-backed casefold implementation (function
casefold matching on PyKindStr::Ascii, PyKindStr::Utf8, PyKindStr::Wtf8 and
using CaseMapper/Wtf8Buf) lacks regression tests; add focused unit tests that
call the public string casefold behavior to cover: an ASCII-only input (e.g.,
"ABC" -> "abc"), an expansion fold (e.g., "ß" -> "ss"), Greek sigma cases (final
vs medial sigma behavior such as "ΣςΣ" or examples that exercise final sigma
folding), and a WTF-8/lone-surrogate preservation case (construct a
Wtf8-containing string with a lone surrogate and assert it is preserved after
casefold). Put these tests in the existing str tests module so they exercise all
match arms (Ascii, Utf8 via CaseMapper, and Wtf8 via Wtf8Buf) and assert exact
output bytes/characters to catch regressions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: e1f7a133-ab85-419f-8ab1-3d1f3554adfb

📥 Commits

Reviewing files that changed from the base of the PR and between ddfcb25 and 7884239.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • Cargo.toml
  • crates/vm/Cargo.toml
  • crates/vm/src/builtins/str.rs
💤 Files with no reviewable changes (2)
  • Cargo.toml
  • crates/vm/Cargo.toml

Copy link
Copy Markdown
Contributor

@ShaharNaveh ShaharNaveh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYSM!

Comment thread Cargo.toml
bitflagset = "0.0.3"
bstr = "1"
bzip2 = "0.6"
caseless = "0.2.2"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@fanninpm
Copy link
Copy Markdown
Contributor

752-770: ⚡ Quick win

Add regression coverage for the new ICU-backed casefold paths.

This swaps the folding engine and introduces custom WTF-8 chunk handling, but the provided test module still doesn't exercise casefold. Please add focused cases for ASCII, expansion folds like ß -> ss, Greek sigma behavior, and lone-surrogate preservation so future ICU or Unicode-data bumps don't silently drift behavior.

@coderabbitai Please examine StrTests.test_casefold() in Lib/test/test_str.py.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find the test file
fd -q "test_str.py" Lib/test/ 2>/dev/null || fd -q "test_str.py" 2>/dev/null

# Look for test_casefold in the file
rg -n "test_casefold" Lib/test/test_str.py 2>/dev/null || rg -rn "test_casefold" --type py 2>/dev/null

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for test_str.py anywhere in the repo
fd "test_str.py" 2>/dev/null

# Search for any test_casefold occurrences in any file
rg -rn "test_casefold" 2>/dev/null

# Also check for casefold in test files
rg -rn "casefold" --type py 2>/dev/null | head -60

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check overall repo structure
tree -L 2 -d 2>/dev/null | head -40

# Look for any test directories
fd -t d "test" 2>/dev/null | head -20

# Look at what Python test files exist
fd -e py -t f 2>/dev/null | grep -i "test" | head -30

# Check if there's a Lib directory at all
ls -la Lib/ 2>/dev/null | head -20

# Look at the PR diff for test additions
git diff HEAD~1 -- "*.py" 2>/dev/null | head -80

Length of output: 1815


🏁 Script executed:

#!/bin/bash
# Look for string-related snippets
fd -e py -t f . extra_tests/ 2>/dev/null | grep -i "str" | head -20

# Check all snippets for casefold
rg -rn "casefold" extra_tests/ 2>/dev/null

# Also check if there's a builtin_str.py
cat extra_tests/snippets/builtin_str.py 2>/dev/null | head -80

# Check what the PR changed in test files
git show --stat HEAD 2>/dev/null | head -20
git diff HEAD~1 HEAD 2>/dev/null | head -100

Length of output: 6572


@fanninpmLib/test/test_str.py does not exist in this repository. RustPython does not bundle CPython's standard test suite; Lib/ is not present in the repo at all. The project's own Python-level string tests live in extra_tests/snippets/builtin_str.py.

Searching that file and the entire extra_tests/ tree for casefold returns zero results — there is no test_casefold (or any casefold assertion) anywhere in the repo's test layer. The PR touches only Rust source files and Cargo manifests.

This confirms the original review comment: the new ICU-backed implementation has no Python-level regression coverage at all. The right place to add it would be extra_tests/snippets/builtin_str.py. Here is a suggested block that covers the four distinct code paths introduced by the PR:

# casefold() — ICU-backed since PR `#7780`
# ASCII path (fast-path via to_ascii_lowercase)
assert "Hello World".casefold() == "hello world"
assert "ABC123".casefold() == "abc123"

# UTF-8 expansion folds
assert "ß".casefold() == "ss"          # German sharp-s expands to two chars
assert "fi".casefold() == "fi"          # fi ligature
assert "µ".casefold() == "μ"           # micro sign → Greek small mu (U+03BC)

# Greek sigma: final sigma (ς) and capital sigma (Σ) both fold to σ
assert "Σ".casefold() == "σ"
assert "ς".casefold() == "σ"

# Identity / already-folded
assert "hello".casefold() == "hello"
assert "".casefold() == ""

# WTF-8 lone-surrogate preservation (surrogates must survive unchanged)
lone = "\udcff"  # lone low surrogate — not valid UTF-8
assert lone.casefold() == lone

Would you like me to open a follow-up issue to track adding this coverage, or should it be addressed directly in this PR?

@fanninpm
Copy link
Copy Markdown
Contributor

RustPython does not bundle CPython's standard test suite; Lib/ is not present in the repo at all

@coderabbitai Are you sure? Looking in .coderabbit.yml, it seems the Lib/ is given special treatment.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

Good catch — let me verify this properly.


🏁 Script executed:

#!/bin/bash
# Check .coderabbit.yml for Lib/ references
cat .coderabbit.yml 2>/dev/null || echo "NOT FOUND"

echo "---"
# Check if Lib/ is a git submodule
cat .gitmodules 2>/dev/null || echo "No .gitmodules"

echo "---"
# Check if Lib/ directory exists and what's in it
ls -la Lib/ 2>/dev/null | head -20 || echo "Lib/ not present"

echo "---"
# Try to find test_str.py under Lib/
find Lib/ -name "test_str.py" 2>/dev/null | head -5

echo "---"
# Search for casefold in any .py under Lib/ if it exists
rg -rn "casefold" Lib/ 2>/dev/null | head -40 || echo "No casefold hits in Lib/"

echo "---"
# Also check git submodule status
git submodule status 2>/dev/null | head -20 || echo "No submodule status"

Unrecoverable error: Failed to clone repository


🏁 Script executed:

#!/bin/bash
# Fetch .coderabbit.yml from the repo via GitHub API
gh api repos/RustPython/RustPython/contents/.coderabbit.yml \
  --jq '.content' | base64 -d 2>/dev/null || echo "NOT FOUND"

echo "==="
# Check if Lib/test/test_str.py exists in the repo
gh api "repos/RustPython/RustPython/contents/Lib/test/test_str.py" \
  --jq '.message // "EXISTS (size=\(.size))"' 2>/dev/null || echo "API error"

echo "==="
# Check if Lib/ directory exists at all
gh api "repos/RustPython/RustPython/contents/Lib" \
  --jq '[.[] | .name] | first(.[])' 2>/dev/null || echo "Lib/ not found or API error"

echo "==="
# Check .gitmodules for submodule info
gh api "repos/RustPython/RustPython/contents/.gitmodules" \
  --jq '.content' | base64 -d 2>/dev/null || echo "No .gitmodules"

Length of output: 172


🏁 Script executed:

#!/bin/bash
# Fetch test_str.py and search for casefold
gh api "repos/RustPython/RustPython/contents/Lib/test/test_str.py" \
  --jq '.content' | base64 -d | grep -n -i "casefold" 2>/dev/null || echo "No casefold found"

Length of output: 492


@fanninpm — You are absolutely right, and I apologize for the incorrect claim. Lib/ does exist in the repository and it does bundle CPython's standard test suite. My earlier script runs simply failed to find it due to the sandbox not cloning the repository properly.

Lib/test/test_str.py is present (130 KB) and already contains a thorough test_casefold() method at line 915 covering exactly the cases I had suggested were missing:

Line Assertion
916 'hello'.casefold() == 'hello' (identity)
917 'hELlo'.casefold() == 'hello' (ASCII path)
918 'ß'.casefold() == 'ss' (expansion fold)
919 'fi'.casefold() == 'fi' (ligature)
920 '\u03a3'.casefold() == '\u03c3' (Greek Σ → σ)
921 'A\u0345\u03a3'.casefold() == 'a\u03b9\u03c3' (combining + sigma)
922 '\u00b5'.casefold() == '\u03bc' (micro sign → Greek mu)
1065 .casefold() on a supplementary-plane string

The original review comment asking to add regression coverage was therefore wrong — the coverage is already there. Would you like me to resolve that comment?

(∩^ᴗ^∩) 🐇 hops away sheepishly

@youknowone youknowone merged commit 078e23d into RustPython:main May 16, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants