Skip to content

Apply titlecase mapping in str.title() for uppercase digraphs#7748

Merged
youknowone merged 1 commit into
RustPython:mainfrom
changjoon-park:fix-str-title-digraph
May 1, 2026
Merged

Apply titlecase mapping in str.title() for uppercase digraphs#7748
youknowone merged 1 commit into
RustPython:mainfrom
changjoon-park:fix-str-title-digraph

Conversation

@changjoon-park
Copy link
Copy Markdown
Contributor

@changjoon-park changjoon-park commented May 1, 2026

Background

PyStr::title() (crates/vm/src/builtins/str.rs:1040) walks each codepoint and chooses one of three actions per word boundary:

  1. Lowercase character starting a new word → call to_titlecase()
  2. Uppercase / titlecase character starting a new word → push unchanged
  3. Cased character continuing a word → call to_lowercase()

Branch 2 is the gap. For ASCII letters and most Unicode characters, to_titlecase(c) == c, so pushing unchanged happens to produce the right result. But Latin Extended-B contains digraph triplets (LATIN LETTERS DZ / DŽ / LJ / NJ) where the uppercase and titlecase forms are distinct codepoints:

Codepoint Form
U+01C4 'DŽ' uppercase DŽ
U+01C5 'Dž' titlecase Dž
U+01C6 'dž' lowercase dž
U+01F1 'DZ' uppercase DZ
U+01F2 'Dz' titlecase Dz
U+01F3 'dz' lowercase dz

For these, c.to_titlecase() returns a different character than c itself, and the Branch 2 fall-through silently produced the wrong output.

Reproduction

'DZ'.title()        # CPython: 'Dz'  Pre-fix: 'DZ'
'DŽ'.title()        # CPython: 'Dž'  Pre-fix: 'DŽ'

12 cases probed against CPython 3.14.4 covering all 4 digraph families × 3 forms each (uppercase / titlecase / lowercase):

Family upper.title() title.title() lower.title()
DŽ (U+01C4-01C6) Dž ✓ Dž ✓ Dž ✓
LJ (U+01C7-01C9) Lj ✓ Lj ✓ Lj ✓
NJ (U+01CA-01CC) Nj ✓ Nj ✓ Nj ✓
DZ (U+01F1-01F3) Dz ✓ Dz ✓ Dz ✓

Fix

} else if c.is_uppercase() || c.is_titlecase() {
    if previous_is_cased {
        title.extend(c.to_lowercase());
    } else {
        title.extend(c.to_titlecase());   // was: title.push_char(c)
    }
    previous_is_cased = true;
}

The lowercase branch (line 1047) already uses c.to_titlecase() — Branch 2 is now symmetric. char::to_titlecase returns the same character when invoked on an ASCII letter or already-titlecase codepoint, so no existing case regresses.

Verification

CPython 3.14.4 byte-identical across:

  • 12 single-codepoint digraph cases — all 4 families × 3 forms
  • 8 mixed-string cases — digraphs at start, mid-word, as standalone word, consecutive digraphs, multi-word strings ('hello DZ world' → 'Hello Dz World')
  • 14 sibling-method casesupper, lower, swapcase, capitalize, casefold, istitle on digraphs unchanged

New Rust unit cases added to str_title (crates/vm/src/builtins/str.rs:2664) covering U+01F1 → U+01F2 and U+01C4 → U+01C5. cargo test -p rustpython-vm str_title and str_istitle both pass.

Regression sweep: test_str, test_codecs, test_format, test_pprint484 tests, 0 regressions.

No expectedFailure markers in Lib/test/test_str.py are tied to this root cause; this PR has no test unmask.

Issue scope note

Issue #7527 listed five examples. Probing on current main shows four already pass:

Issue example Status on current main
'Ᲊ'.istitle() == False Stale: CPython 3.14.4 itself returns True for this codepoint
'DZ'.title() == 'Dz' Still failing — fixed by this PR
'١'.isdigit() == True Already passes
'ౝ'.isidentifier() == True Already passes
'㐅'.isnumeric() == True Already passes

This PR closes the only remaining actionable case. The Ᲊ.istitle() expected value in the issue body appears to predate a Unicode database update in CPython 3.14 — current CPython behavior matches RustPython.

Closes #7527

Summary by CodeRabbit

  • Bug Fixes
    • Corrected title-casing for Unicode characters whose titlecase mapping expands to multiple characters (including Latin Extended‑B digraphs), so title() now produces proper multi-character expansions and consistent title-case output across scripts.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 4e7dd1f0-e831-4223-a087-86de74b1d1d9

📥 Commits

Reviewing files that changed from the base of the PR and between 67e454e and fb49477.

⛔ Files ignored due to path filters (1)
  • Lib/test/test_unicodedata.py is excluded by !Lib/**
📒 Files selected for processing (1)
  • crates/vm/src/builtins/str.rs

📝 Walkthrough

Walkthrough

PyStr::title now appends the full Unicode titlecase expansion (via to_titlecase()) for characters whose titlecase mapping yields multiple code points; tests updated to include Latin Extended-B digraph cases to validate correct title-casing behavior.

Changes

Cohort / File(s) Summary
Title casing Unicode fix
crates/vm/src/builtins/str.rs
Use full to_titlecase() expansion instead of pushing a single code point when constructing title-cased output; extend str_title tests with Latin Extended-B digraph cases.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Suggested reviewers

  • ShaharNaveh
  • youknowone

Poem

🐇 I nibble at bytes and hop through code,
Turning digraphs right on the Unicode road.
With to_titlecase I stretch each part,
Now Dz and friends wear proper art.
Hooray — the string is bold and whole!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: applying titlecase mapping for uppercase digraphs in str.title().
Linked Issues check ✅ Passed The PR fixes the digraph titlecase mismatch (DZ→Dz) from issue #7527, directly addressing the str.title() requirement with proper unit tests.
Out of Scope Changes check ✅ Passed All changes are focused on the str.title() titlecase mapping fix and its corresponding unit tests, remaining within the scope of the linked issue.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@ShaharNaveh
Copy link
Copy Markdown
Contributor

ShaharNaveh commented May 1, 2026

UNEXPECTED SUCCESS: test_bug_4971 (test.test_unicodedata.UnicodeMiscTest.test_bug_4971)

🥳

The uppercase/titlecase branch of PyStr::title() pushed characters
unchanged when starting a new word, which left Latin Extended-B
digraphs (U+01F1 'DZ', U+01C4 'DŽ', etc.) in their uppercase form
instead of mapping them to their distinct titlecase counterparts
(U+01F2 'Dz', U+01C5 'Dž'). For ASCII letters and characters where
to_titlecase is identity this had no effect, hiding the bug for the
common case.

Mirror the lowercase branch — which already calls to_titlecase()
when starting a new word — so both branches symmetrically apply
the titlecase mapping. char::to_titlecase is identity for already-
titlecase and ASCII-uppercase characters, so existing cases stay
correct.

Also unmasks test_unicodedata.UnicodeMiscTest.test_bug_4971, which
asserts exactly this behavior (`'DŽ'.title() == 'Dž'` etc.)
and was marked expectedFailure with reason `+ Dž`.

Closes RustPython#7527 (the only example from that issue still failing on
3.14.4; the other four examples already pass on current main).
@changjoon-park changjoon-park force-pushed the fix-str-title-digraph branch from 67e454e to fb49477 Compare May 1, 2026 09:07
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[ ] test: cpython/Lib/test/test_unicodedata.py (TODO: 8)
[x] test: cpython/Lib/test/test_unicode_file.py
[ ] test: cpython/Lib/test/test_unicode_file_functions.py
[ ] test: cpython/Lib/test/test_unicode_identifiers.py (TODO: 1)
[ ] test: cpython/Lib/test/test_ucn.py (TODO: 3)

dependencies:

dependent tests: (no tests depend on unicode)

Legend:

  • [+] path exists in CPython
  • [x] up-to-date, [ ] outdated

@changjoon-park
Copy link
Copy Markdown
Contributor Author

ci failure was an the "unexpected success" on test_unicodedata.UnicodeMiscTest.test_bug_4971
so the commit was forced pushed

Copy link
Copy Markdown
Contributor

@ShaharNaveh ShaharNaveh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty:)

@youknowone youknowone merged commit c2141a7 into RustPython:main May 1, 2026
21 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Disagreement with CPython for str.istitle/title/isdigit/isidentifier/isnumeric

3 participants