Apply titlecase mapping in str.title() for uppercase digraphs by changjoon-park · Pull Request #7748 · RustPython/RustPython

changjoon-park · 2026-05-01T06:33:12Z

Background

PyStr::title() (crates/vm/src/builtins/str.rs:1040) walks each codepoint and chooses one of three actions per word boundary:

Lowercase character starting a new word → call to_titlecase()
Uppercase / titlecase character starting a new word → push unchanged
Cased character continuing a word → call to_lowercase()

Branch 2 is the gap. For ASCII letters and most Unicode characters, to_titlecase(c) == c, so pushing unchanged happens to produce the right result. But Latin Extended-B contains digraph triplets (LATIN LETTERS DZ / DŽ / LJ / NJ) where the uppercase and titlecase forms are distinct codepoints:

Codepoint	Form
U+01C4 'Ǆ'	uppercase DŽ
U+01C5 'ǅ'	titlecase Dž
U+01C6 'ǆ'	lowercase dž
U+01F1 'Ǳ'	uppercase DZ
U+01F2 'ǲ'	titlecase Dz
U+01F3 'ǳ'	lowercase dz

For these, c.to_titlecase() returns a different character than c itself, and the Branch 2 fall-through silently produced the wrong output.

Reproduction

'Ǳ'.title()        # CPython: 'ǲ'  Pre-fix: 'Ǳ'
'Ǆ'.title()        # CPython: 'ǅ'  Pre-fix: 'Ǆ'

12 cases probed against CPython 3.14.4 covering all 4 digraph families × 3 forms each (uppercase / titlecase / lowercase):

Family	upper.title()	title.title()	lower.title()
DŽ (U+01C4-01C6)	ǅ ✓	ǅ ✓	ǅ ✓
LJ (U+01C7-01C9)	ǈ ✓	ǈ ✓	ǈ ✓
NJ (U+01CA-01CC)	ǋ ✓	ǋ ✓	ǋ ✓
DZ (U+01F1-01F3)	ǲ ✓	ǲ ✓	ǲ ✓

Fix

} else if c.is_uppercase() || c.is_titlecase() {
    if previous_is_cased {
        title.extend(c.to_lowercase());
    } else {
        title.extend(c.to_titlecase());   // was: title.push_char(c)
    }
    previous_is_cased = true;
}

The lowercase branch (line 1047) already uses c.to_titlecase() — Branch 2 is now symmetric. char::to_titlecase returns the same character when invoked on an ASCII letter or already-titlecase codepoint, so no existing case regresses.

Verification

CPython 3.14.4 byte-identical across:

12 single-codepoint digraph cases — all 4 families × 3 forms
8 mixed-string cases — digraphs at start, mid-word, as standalone word, consecutive digraphs, multi-word strings ('hello Ǳ world' → 'Hello ǲ World')
14 sibling-method cases — upper, lower, swapcase, capitalize, casefold, istitle on digraphs unchanged

New Rust unit cases added to str_title (crates/vm/src/builtins/str.rs:2664) covering U+01F1 → U+01F2 and U+01C4 → U+01C5. cargo test -p rustpython-vm str_title and str_istitle both pass.

Regression sweep: test_str, test_codecs, test_format, test_pprint — 484 tests, 0 regressions.

No expectedFailure markers in Lib/test/test_str.py are tied to this root cause; this PR has no test unmask.

Issue scope note

Issue #7527 listed five examples. Probing on current main shows four already pass:

Issue example	Status on current main
`'Ᲊ'.istitle() == False`	Stale: CPython 3.14.4 itself returns `True` for this codepoint
`'Ǳ'.title() == 'ǲ'`	Still failing — fixed by this PR
`'١'.isdigit() == True`	Already passes
`'ౝ'.isidentifier() == True`	Already passes
`'㐅'.isnumeric() == True`	Already passes

This PR closes the only remaining actionable case. The Ᲊ.istitle() expected value in the issue body appears to predate a Unicode database update in CPython 3.14 — current CPython behavior matches RustPython.

Closes #7527

Summary by CodeRabbit

Bug Fixes
- Corrected title-casing for Unicode characters whose titlecase mapping expands to multiple characters (including Latin Extended‑B digraphs), so title() now produces proper multi-character expansions and consistent title-case output across scripts.

coderabbitai · 2026-05-01T06:33:26Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 4e7dd1f0-e831-4223-a087-86de74b1d1d9

📥 Commits

Reviewing files that changed from the base of the PR and between 67e454e and fb49477.

⛔ Files ignored due to path filters (1)

Lib/test/test_unicodedata.py is excluded by !Lib/**

📒 Files selected for processing (1)

crates/vm/src/builtins/str.rs

📝 Walkthrough

Walkthrough

PyStr::title now appends the full Unicode titlecase expansion (via to_titlecase()) for characters whose titlecase mapping yields multiple code points; tests updated to include Latin Extended-B digraph cases to validate correct title-casing behavior.

Changes

Cohort / File(s)	Summary
Title casing Unicode fix `crates/vm/src/builtins/str.rs`	Use full `to_titlecase()` expansion instead of pushing a single code point when constructing title-cased output; extend `str_title` tests with Latin Extended-B digraph cases.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

fix: Swapcase must handle multibyte expansions #7559: Same code-level fix in crates/vm/src/builtins/str.rs replacing single-char pushes with full case-mapping expansions.
fix: Handle char expansion in islower, isupper #7583: Related fixes addressing Unicode case-mapping expansions in string builtins within the same file.

Suggested reviewers

ShaharNaveh
youknowone

Poem

🐇 I nibble at bytes and hop through code,
Turning digraphs right on the Unicode road.
With to_titlecase I stretch each part,
Now ǲ and friends wear proper art.
Hooray — the string is bold and whole!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: applying titlecase mapping for uppercase digraphs in str.title().
Linked Issues check	✅ Passed	The PR fixes the digraph titlecase mismatch (Ǳ→ǲ) from issue `#7527`, directly addressing the str.title() requirement with proper unit tests.
Out of Scope Changes check	✅ Passed	All changes are focused on the str.title() titlecase mapping fix and its corresponding unit tests, remaining within the scope of the linked issue.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ShaharNaveh · 2026-05-01T07:05:28Z

UNEXPECTED SUCCESS: test_bug_4971 (test.test_unicodedata.UnicodeMiscTest.test_bug_4971)

🥳

The uppercase/titlecase branch of PyStr::title() pushed characters unchanged when starting a new word, which left Latin Extended-B digraphs (U+01F1 'DZ', U+01C4 'DŽ', etc.) in their uppercase form instead of mapping them to their distinct titlecase counterparts (U+01F2 'Dz', U+01C5 'Dž'). For ASCII letters and characters where to_titlecase is identity this had no effect, hiding the bug for the common case. Mirror the lowercase branch — which already calls to_titlecase() when starting a new word — so both branches symmetrically apply the titlecase mapping. char::to_titlecase is identity for already- titlecase and ASCII-uppercase characters, so existing cases stay correct. Also unmasks test_unicodedata.UnicodeMiscTest.test_bug_4971, which asserts exactly this behavior (`'Ǆ'.title() == 'ǅ'` etc.) and was marked expectedFailure with reason `+ ǅ`. Closes RustPython#7527 (the only example from that issue still failing on 3.14.4; the other four examples already pass on current main).

github-actions · 2026-05-01T09:07:56Z

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[ ] test: cpython/Lib/test/test_unicodedata.py (TODO: 8)
[x] test: cpython/Lib/test/test_unicode_file.py
[ ] test: cpython/Lib/test/test_unicode_file_functions.py
[ ] test: cpython/Lib/test/test_unicode_identifiers.py (TODO: 1)
[ ] test: cpython/Lib/test/test_ucn.py (TODO: 3)

dependencies:

dependent tests: (no tests depend on unicode)

Legend:

[+] path exists in CPython
[x] up-to-date, [ ] outdated

changjoon-park · 2026-05-01T09:10:57Z

ci failure was an the "unexpected success" on test_unicodedata.UnicodeMiscTest.test_bug_4971
so the commit was forced pushed

ShaharNaveh

ty:)

changjoon-park force-pushed the fix-str-title-digraph branch from 67e454e to fb49477 Compare May 1, 2026 09:07

ShaharNaveh approved these changes May 1, 2026

View reviewed changes

youknowone approved these changes May 1, 2026

View reviewed changes

youknowone merged commit c2141a7 into RustPython:main May 1, 2026
21 checks passed

coderabbitai Bot mentioned this pull request May 2, 2026

Fix title() and capitalize() #7717

Merged

coderabbitai Bot mentioned this pull request May 13, 2026

Fix swapcase() #7788

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply titlecase mapping in str.title() for uppercase digraphs#7748

Apply titlecase mapping in str.title() for uppercase digraphs#7748
youknowone merged 1 commit into
RustPython:mainfrom
changjoon-park:fix-str-title-digraph

changjoon-park commented May 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 1, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

ShaharNaveh commented May 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

changjoon-park commented May 1, 2026

Uh oh!

ShaharNaveh left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

changjoon-park commented May 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Reproduction

Fix

Verification

Issue scope note

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

ShaharNaveh commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 1, 2026

📦 Library Dependencies

Uh oh!

changjoon-park commented May 1, 2026

Uh oh!

ShaharNaveh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

changjoon-park commented May 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 1, 2026 •

edited

Loading

ShaharNaveh commented May 1, 2026 •

edited

Loading