wasm: correct unicode support in stdlib #5255

WillLillis · 2026-01-24T21:27:25Z

This PR changes our minimal wasm stdlib such that all of the functions we export are our own implementations, instead of those from the wasi headers.

There are a few benefits to this, namely the >7x size reduction for the stdlib header (~14k -> ~2k), as well as matched behavior between parsers built for wasm (ts b --wasm) and rust projects built targeting wasm32-unknown-unknown).

The downside is mainly the lack of full unicode support for towupper and towlower. A lot (most, it seems) of the bloat from wasi's implementations comes from unicode tables, which we don't cover in our own stdlib impl. Instead, we only handle ASCII characters, which is the approach already taken for the existing stdlib implementation. We could also just use the wasi implementations for these functions, bringing the header size back up to 7315 bytes.

@maxbrunsfeld Do you think the tradeoff here is worth it?

~~Edit: Not sure why the sanitizer checks are failing, will look into that later tonight.~~ Cache issue

WillLillis · 2026-01-27T01:52:10Z

Looking into this further, I think we may not be sacrificing any functionality with this change. If I'm reading the documentation correctly for the various wint_t functions in wctype.c (for example, iswupper), the unicode paths are only hit if setlocale was previously called (i.e. setlocale(LC_ALL, "en_US.utf8")). We don't export setlocale for wasm builds, so it actually seems very reasonable to treat these code paths as unreachable and just handle the ASCII cases as I've done in this PR.

These re-implementations will need to be tested further before I think this is ok to merge, but I believe the previous unicode concerns I raised are actually a non-issue.

A quick GH search shows that no parsers use setlocale at all, besides tree-sitter-haskell and tree-sitter-synquid. For both of these parsers, the calls to setlocale are gated behind #ifdef TREE_SITTER_DEBUG, so they're not actually used outside of development.

maxbrunsfeld · 2026-01-27T06:15:40Z

Oh, I actually thought those functions would use UTF-8 character categories by default. I think it would be better to keep the unicode support compiled in, and look into setting the locale somehow. I think those functions are used in some scanners where UTF8 support is intended.

- Use wasi-defined wide character functions for full unicode support - Depend on custom implemenations for `string.h` functions

WillLillis · 2026-01-27T07:54:36Z

Oh, I actually thought those functions would use UTF-8 character categories by default. I think it would be better to keep the unicode support compiled in, and look into setting the locale somehow. I think those functions are used in some scanners where UTF8 support is intended.

Gotcha. In that case, the recent additions to wctype.c should probably be moved to wctype.h as static inlines so that the wasi implementations are used. This will grow the stdlib size by about 4k to ~18,984. Using our own implementations for the functions in string.c brings the size back down to 15559. As long as those minimal impls are sufficient, that seems like a reasonable tradeoff.

Since we don't export setlocale, the default C locale should always be in use by wasm parsers. Given this, maybe we can use a pared down version of wasi's casemap table and its surrounding functionality to get some more size savings since we don't need to support other locales. Just out of curiosity, I tried adding setlocale to the export list (stdlibsymbols.txt) and run into some odd duplicate symbol errors:

Failed to compile the Tree-sitter Wasm stdlib:
wasm-ld: error: duplicate symbol: malloc
>>> defined in /tmp/stdlib-3eba72.o
>>> defined in /home/lillis/.cache/tree-sitter/wasi-sdk/bin/../share/wasi-sysroot/lib/wasm32-wasi/libc.a(dlmalloc.o)

wasm-ld: error: duplicate symbol: free
>>> defined in /tmp/stdlib-3eba72.o
>>> defined in /home/lillis/.cache/tree-sitter/wasi-sdk/bin/../share/wasi-sysroot/lib/wasm32-wasi/libc.a(dlmalloc.o)

wasm-ld: error: duplicate symbol: calloc
>>> defined in /tmp/stdlib-3eba72.o
>>> defined in /home/lillis/.cache/tree-sitter/wasi-sdk/bin/../share/wasi-sysroot/lib/wasm32-wasi/libc.a(dlmalloc.o)

wasm-ld: error: duplicate symbol: realloc
>>> defined in /tmp/stdlib-3eba72.o
>>> defined in /home/lillis/.cache/tree-sitter/wasi-sdk/bin/../share/wasi-sysroot/lib/wasm32-wasi/libc.a(dlmalloc.o)
clang: error: linker command failed with exit code 1 (use -v to see invocation)

I also tested a rust wasm32-unknown-unknown build via the repro from #5158 and that use case is still supported with these changes.

WillLillis requested a review from maxbrunsfeld January 24, 2026 21:27

WillLillis marked this pull request as draft January 24, 2026 21:30

github-actions bot removed the request for review from maxbrunsfeld January 24, 2026 21:30

WillLillis force-pushed the wasm_stdlib_impls branch from 297fb3f to 0819bc0 Compare January 24, 2026 21:33

WillLillis marked this pull request as ready for review January 24, 2026 21:34

WillLillis requested a review from maxbrunsfeld January 24, 2026 21:35

WillLillis added the wasm label Jan 25, 2026

wasm(stdlib): silence incompatible pointer type warnings in alloc impl

eeebf1b

WillLillis force-pushed the wasm_stdlib_impls branch from 0819bc0 to b289c80 Compare January 27, 2026 02:22

wasm: tweak stdlib

e7dd5b0

- Use wasi-defined wide character functions for full unicode support - Depend on custom implemenations for `string.h` functions

WillLillis force-pushed the wasm_stdlib_impls branch from b289c80 to e7dd5b0 Compare January 27, 2026 07:54

fix(wasm): align wctype.h definitions with common impls

2801ff3

WillLillis changed the title ~~wasm: greatly reduce stdlib size~~ wasm: correct unicode support in stdlib Jan 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wasm: correct unicode support in stdlib #5255

wasm: correct unicode support in stdlib #5255

WillLillis commented Jan 24, 2026 •

edited

Loading

Uh oh!

WillLillis commented Jan 27, 2026 •

edited

Loading

Uh oh!

maxbrunsfeld commented Jan 27, 2026

Uh oh!

WillLillis commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

wasm: correct unicode support in stdlib #5255

Are you sure you want to change the base?

wasm: correct unicode support in stdlib #5255

Conversation

WillLillis commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillLillis commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxbrunsfeld commented Jan 27, 2026

Uh oh!

WillLillis commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WillLillis commented Jan 24, 2026 •

edited

Loading

WillLillis commented Jan 27, 2026 •

edited

Loading