Skip to content

Conversation

@WillLillis
Copy link
Member

@WillLillis WillLillis commented Jan 24, 2026

This PR changes our minimal wasm stdlib such that all of the functions we export are our own implementations, instead of those from the wasi headers.

There are a few benefits to this, namely the >7x size reduction for the stdlib header (~14k -> ~2k), as well as matched behavior between parsers built for wasm (ts b --wasm) and rust projects built targeting wasm32-unknown-unknown).

The downside is mainly the lack of full unicode support for towupper and towlower. A lot (most, it seems) of the bloat from wasi's implementations comes from unicode tables, which we don't cover in our own stdlib impl. Instead, we only handle ASCII characters, which is the approach already taken for the existing stdlib implementation. We could also just use the wasi implementations for these functions, bringing the header size back up to 7315 bytes.

@maxbrunsfeld Do you think the tradeoff here is worth it?

Edit: Not sure why the sanitizer checks are failing, will look into that later tonight. Cache issue

@WillLillis WillLillis marked this pull request as draft January 24, 2026 21:30
@github-actions github-actions bot removed the request for review from maxbrunsfeld January 24, 2026 21:30
@WillLillis WillLillis marked this pull request as ready for review January 24, 2026 21:34
@WillLillis
Copy link
Member Author

WillLillis commented Jan 27, 2026

Looking into this further, I think we may not be sacrificing any functionality with this change. If I'm reading the documentation correctly for the various wint_t functions in wctype.c (for example, iswupper), the unicode paths are only hit if setlocale was previously called (i.e. setlocale(LC_ALL, "en_US.utf8")). We don't export setlocale for wasm builds, so it actually seems very reasonable to treat these code paths as unreachable and just handle the ASCII cases as I've done in this PR.

These re-implementations will need to be tested further before I think this is ok to merge, but I believe the previous unicode concerns I raised are actually a non-issue.

A quick GH search shows that no parsers use setlocale at all, besides tree-sitter-haskell and tree-sitter-synquid. For both of these parsers, the calls to setlocale are gated behind #ifdef TREE_SITTER_DEBUG, so they're not actually used outside of development.

@maxbrunsfeld
Copy link
Contributor

Oh, I actually thought those functions would use UTF-8 character categories by default. I think it would be better to keep the unicode support compiled in, and look into setting the locale somehow. I think those functions are used in some scanners where UTF8 support is intended.

- Use wasi-defined wide character functions for full unicode support
- Depend on custom implemenations for `string.h` functions
@WillLillis
Copy link
Member Author

Oh, I actually thought those functions would use UTF-8 character categories by default. I think it would be better to keep the unicode support compiled in, and look into setting the locale somehow. I think those functions are used in some scanners where UTF8 support is intended.

Gotcha. In that case, the recent additions to wctype.c should probably be moved to wctype.h as static inlines so that the wasi implementations are used. This will grow the stdlib size by about 4k to ~18,984. Using our own implementations for the functions in string.c brings the size back down to 15559. As long as those minimal impls are sufficient, that seems like a reasonable tradeoff.

Since we don't export setlocale, the default C locale should always be in use by wasm parsers. Given this, maybe we can use a pared down version of wasi's casemap table and its surrounding functionality to get some more size savings since we don't need to support other locales. Just out of curiosity, I tried adding setlocale to the export list (stdlibsymbols.txt) and run into some odd duplicate symbol errors:

Failed to compile the Tree-sitter Wasm stdlib:
wasm-ld: error: duplicate symbol: malloc
>>> defined in /tmp/stdlib-3eba72.o
>>> defined in /home/lillis/.cache/tree-sitter/wasi-sdk/bin/../share/wasi-sysroot/lib/wasm32-wasi/libc.a(dlmalloc.o)

wasm-ld: error: duplicate symbol: free
>>> defined in /tmp/stdlib-3eba72.o
>>> defined in /home/lillis/.cache/tree-sitter/wasi-sdk/bin/../share/wasi-sysroot/lib/wasm32-wasi/libc.a(dlmalloc.o)

wasm-ld: error: duplicate symbol: calloc
>>> defined in /tmp/stdlib-3eba72.o
>>> defined in /home/lillis/.cache/tree-sitter/wasi-sdk/bin/../share/wasi-sysroot/lib/wasm32-wasi/libc.a(dlmalloc.o)

wasm-ld: error: duplicate symbol: realloc
>>> defined in /tmp/stdlib-3eba72.o
>>> defined in /home/lillis/.cache/tree-sitter/wasi-sdk/bin/../share/wasi-sysroot/lib/wasm32-wasi/libc.a(dlmalloc.o)
clang: error: linker command failed with exit code 1 (use -v to see invocation)

I also tested a rust wasm32-unknown-unknown build via the repro from #5158 and that use case is still supported with these changes.

@WillLillis WillLillis changed the title wasm: greatly reduce stdlib size wasm: correct unicode support in stdlib Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants