Apply titlecase mapping in str.title() for uppercase digraphs#7748
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (1)
📝 WalkthroughWalkthroughPyStr::title now appends the full Unicode titlecase expansion (via to_titlecase()) for characters whose titlecase mapping yields multiple code points; tests updated to include Latin Extended-B digraph cases to validate correct title-casing behavior. Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
🥳 |
The uppercase/titlecase branch of PyStr::title() pushed characters unchanged when starting a new word, which left Latin Extended-B digraphs (U+01F1 'DZ', U+01C4 'DŽ', etc.) in their uppercase form instead of mapping them to their distinct titlecase counterparts (U+01F2 'Dz', U+01C5 'Dž'). For ASCII letters and characters where to_titlecase is identity this had no effect, hiding the bug for the common case. Mirror the lowercase branch — which already calls to_titlecase() when starting a new word — so both branches symmetrically apply the titlecase mapping. char::to_titlecase is identity for already- titlecase and ASCII-uppercase characters, so existing cases stay correct. Also unmasks test_unicodedata.UnicodeMiscTest.test_bug_4971, which asserts exactly this behavior (`'DŽ'.title() == 'Dž'` etc.) and was marked expectedFailure with reason `+ Dž`. Closes RustPython#7527 (the only example from that issue still failing on 3.14.4; the other four examples already pass on current main).
67e454e to
fb49477
Compare
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: [ ] test: cpython/Lib/test/test_unicodedata.py (TODO: 8) dependencies: dependent tests: (no tests depend on unicode) Legend:
|
|
ci failure was an the "unexpected success" on test_unicodedata.UnicodeMiscTest.test_bug_4971 |
Background
PyStr::title()(crates/vm/src/builtins/str.rs:1040) walks each codepoint and chooses one of three actions per word boundary:to_titlecase()to_lowercase()Branch 2 is the gap. For ASCII letters and most Unicode characters,
to_titlecase(c) == c, so pushing unchanged happens to produce the right result. But Latin Extended-B contains digraph triplets (LATIN LETTERS DZ / DŽ / LJ / NJ) where the uppercase and titlecase forms are distinct codepoints:For these,
c.to_titlecase()returns a different character thancitself, and the Branch 2 fall-through silently produced the wrong output.Reproduction
12 cases probed against CPython 3.14.4 covering all 4 digraph families × 3 forms each (uppercase / titlecase / lowercase):
Fix
The lowercase branch (line 1047) already uses
c.to_titlecase()— Branch 2 is now symmetric.char::to_titlecasereturns the same character when invoked on an ASCII letter or already-titlecase codepoint, so no existing case regresses.Verification
CPython 3.14.4 byte-identical across:
'hello DZ world' → 'Hello Dz World')upper,lower,swapcase,capitalize,casefold,istitleon digraphs unchangedNew Rust unit cases added to
str_title(crates/vm/src/builtins/str.rs:2664) coveringU+01F1 → U+01F2andU+01C4 → U+01C5.cargo test -p rustpython-vm str_titleandstr_istitleboth pass.Regression sweep:
test_str,test_codecs,test_format,test_pprint— 484 tests, 0 regressions.No
expectedFailuremarkers inLib/test/test_str.pyare tied to this root cause; this PR has no test unmask.Issue scope note
Issue #7527 listed five examples. Probing on current main shows four already pass:
''.istitle() == FalseTruefor this codepoint'DZ'.title() == 'Dz''١'.isdigit() == True'ౝ'.isidentifier() == True'㐅'.isnumeric() == TrueThis PR closes the only remaining actionable case. The
.istitle()expected value in the issue body appears to predate a Unicode database update in CPython 3.14 — current CPython behavior matches RustPython.Closes #7527
Summary by CodeRabbit