feat: allow the external scanner to use the internal character ranges #4864

amaanq · 2025-09-22T00:03:36Z

Closes Use the internal regexp in external scanners #2620

Problem

Currently, grammars with complex tokens cannot make use of the character sets generated in the parser. This causes friction, where now users have to either copy the character set in the scanner or give up on matching the entirety of the complex token.

Solution

Now, tree-sitter will generate helper functions that can be used in a scanner to match against these complex tokens. For every token with a "complex" token, tree-sitter will generate a function in the form of matches_sym_{name}. For tokens with multiple distinct groups (e.g. [a-zA-Z][a-zA-Z0-9]* consists of two groups), tree-sitter will now generate a function that matches each distinct group, in case you only want to match a specific part of the token. A more detailed explanation along with an example can be found in the docs changes.

ObserverOfTime

Can the functions be added to a new header instead?

amaanq · 2025-09-23T03:09:28Z

Can the functions be added to a new header instead?

They can, but what for? They're defined in parser.c, so declaring them in parser.h seems correct.

maxbrunsfeld · 2025-09-23T03:31:12Z

What is the linkage on these symbols? If they are not static, then they must be prefixed with ‘tree_sitter_$GRAMMAR_NAME’ to avoid name collisions. It would be nice to not export these unless the grammar author opts in somehow.

WillLillis · 2025-09-23T03:20:56Z

crates/generate/src/generate.rs

    grammar_json: &str,
    semantic_version: Option<(u8, u8, u8)>,
-) -> GenerateResult<(String, String)> {
+) -> GenerateResult<(String, String, String)> {


It's borderline, but I think it would be better for readability to have a struct with named fields here rather than 3 Strings in a tuple.

WillLillis · 2025-09-23T03:24:52Z

crates/generate/src/render.rs

+        let mut symbols_with_char_sets = self
+            .symbol_to_character_sets
+            .iter()
+            .filter(|(_, sets)| !sets.is_empty())


Is it possible for these Vecs to ever be empty? It looks like they'll always have at least one entry.

WillLillis · 2025-09-23T03:29:26Z

crates/generate/src/render.rs

+                .collect::<Vec<_>>();
+
+            if !used_char_sets.is_empty() {
+                if !header_added {


Couldn't this just be if self.header_buffer.is_empty()? I think you had that in a previous push.

WillLillis · 2025-09-23T03:35:27Z

crates/generate/src/render.rs

+                        // Match function for this character set
+
+                        writeln!(
+                        self.header_buffer,


rustfmt died, please save it 😆

WillLillis · 2025-09-23T03:43:45Z

docs/src/creating-parsers/4-external-scanners.md

+If your grammar contains complex patterns, Tree-sitter automatically generates helper functions that your external scanner
+can use to check if a character matches those character sets. This avoids the need to reimplement logic for matching complex
+tokens in your scanner. These matching functions are added to the `parser.h` header file. The function names follow the pattern
+`matches_sym_<symbol_name>(int32_t lookahead)`.


Include the bool return type here, and the distinct character set group list below?

maxbrunsfeld · 2025-09-23T04:09:15Z

Thinking about this more - I feel like this API needs a little work. The character sets that tree-sitter generates are based on some fairly arbitrary heuristics for what helps compile time. This is fine for an internal optimization, but not suitable for a public API. It needs to be a bit more specified and under the grammar author’s control, IMO.

Also, the fact that a symbol can contain multiple character sets, and they just get defined with these number suffixes, that again seems like an implementation detail, but not clear enough to be a stable API.

ObserverOfTime · 2025-09-23T04:50:08Z

They can, but what for? They're defined in parser.c, so declaring them in parser.h seems correct.

parser.h is the bread-and-butter header. I'd like to keep it minimal and identical for all grammars.

Much like the array API, these functions won't be needed by most authors, so they should live in a separate header.

ObserverOfTime · 2025-09-23T09:49:30Z

What is the linkage on these symbols? If they are not static, then they must be prefixed with ‘tree_sitter_$GRAMMAR_NAME’ to avoid name collisions. It would be nice to not export these unless the grammar author opts in somehow.

I agree. We should declare them with a prefix and either as static inline or __attribute__((visibility("hidden"))) (on GNU). For ease of use, we can define a macro like:

#define matches(sym, lookahead) tree_sitter_foo_matches_sym_##sym(lookahead)

amaanq · 2025-09-24T20:17:16Z

Thinking about this more - I feel like this API needs a little work. The character sets that tree-sitter generates are based on some fairly arbitrary heuristics for what helps compile time. This is fine for an internal optimization, but not suitable for a public API. It needs to be a bit more specified and under the grammar author’s control, IMO.

Also, the fact that a symbol can contain multiple character sets, and they just get defined with these number suffixes, that again seems like an implementation detail, but not clear enough to be a stable API.

Should we instead separate the internal character sets and instead expose a field in the grammar.js for users to explicitly pass in terminal rules that should be available in the scanner?

amaanq force-pushed the character-sets-scanner branch 2 times, most recently from 736fde2 to f8e5f46 Compare September 22, 2025 00:26

amaanq and others added 3 commits September 21, 2025 20:57

feat: allow the external scanner to use the internal character ranges

ea40fc1

test: add test leveraging internal character ranges in scanner

7bcd07d

docs: document the character set functions

5446abf

amaanq force-pushed the character-sets-scanner branch from f8e5f46 to 5446abf Compare September 22, 2025 00:57

ObserverOfTime reviewed Sep 22, 2025

View reviewed changes

WillLillis approved these changes Sep 23, 2025

View reviewed changes

Uh oh!

feat: allow the external scanner to use the internal character ranges #4864

Are you sure you want to change the base?

feat: allow the external scanner to use the internal character ranges #4864

Uh oh!

Conversation

amaanq commented Sep 22, 2025

Problem

Solution

Uh oh!

ObserverOfTime left a comment

Choose a reason for hiding this comment

Uh oh!

amaanq commented Sep 23, 2025

Uh oh!

maxbrunsfeld commented Sep 23, 2025

Uh oh!

WillLillis Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

WillLillis Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

WillLillis Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

WillLillis Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

WillLillis Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

maxbrunsfeld commented Sep 23, 2025

Uh oh!

ObserverOfTime commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ObserverOfTime commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amaanq commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ObserverOfTime commented Sep 23, 2025 •

edited

Loading

ObserverOfTime commented Sep 23, 2025 •

edited

Loading