Skip to content

Conversation

@amaanq
Copy link
Member

@amaanq amaanq commented Sep 22, 2025

Problem

Currently, grammars with complex tokens cannot make use of the character sets generated in the parser. This causes friction, where now users have to either copy the character set in the scanner or give up on matching the entirety of the complex token.

Solution

Now, tree-sitter will generate helper functions that can be used in a scanner to match against these complex tokens. For every token with a "complex" token, tree-sitter will generate a function in the form of matches_sym_{name}. For tokens with multiple distinct groups (e.g. [a-zA-Z][a-zA-Z0-9]* consists of two groups), tree-sitter will now generate a function that matches each distinct group, in case you only want to match a specific part of the token. A more detailed explanation along with an example can be found in the docs changes.

@amaanq amaanq force-pushed the character-sets-scanner branch 2 times, most recently from 736fde2 to f8e5f46 Compare September 22, 2025 00:26
@amaanq amaanq force-pushed the character-sets-scanner branch from f8e5f46 to 5446abf Compare September 22, 2025 00:57
Copy link
Member

@ObserverOfTime ObserverOfTime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the functions be added to a new header instead?

@amaanq
Copy link
Member Author

amaanq commented Sep 23, 2025

Can the functions be added to a new header instead?

They can, but what for? They're defined in parser.c, so declaring them in parser.h seems correct.

@maxbrunsfeld
Copy link
Contributor

What is the linkage on these symbols? If they are not static, then they must be prefixed with ‘tree_sitter_$GRAMMAR_NAME’ to avoid name collisions. It would be nice to not export these unless the grammar author opts in somehow.

grammar_json: &str,
semantic_version: Option<(u8, u8, u8)>,
) -> GenerateResult<(String, String)> {
) -> GenerateResult<(String, String, String)> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's borderline, but I think it would be better for readability to have a struct with named fields here rather than 3 Strings in a tuple.

let mut symbols_with_char_sets = self
.symbol_to_character_sets
.iter()
.filter(|(_, sets)| !sets.is_empty())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for these Vecs to ever be empty? It looks like they'll always have at least one entry.

.collect::<Vec<_>>();

if !used_char_sets.is_empty() {
if !header_added {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this just be if self.header_buffer.is_empty()? I think you had that in a previous push.

// Match function for this character set

writeln!(
self.header_buffer,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rustfmt died, please save it 😆

If your grammar contains complex patterns, Tree-sitter automatically generates helper functions that your external scanner
can use to check if a character matches those character sets. This avoids the need to reimplement logic for matching complex
tokens in your scanner. These matching functions are added to the `parser.h` header file. The function names follow the pattern
`matches_sym_<symbol_name>(int32_t lookahead)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include the bool return type here, and the distinct character set group list below?

@maxbrunsfeld
Copy link
Contributor

Thinking about this more - I feel like this API needs a little work. The character sets that tree-sitter generates are based on some fairly arbitrary heuristics for what helps compile time. This is fine for an internal optimization, but not suitable for a public API. It needs to be a bit more specified and under the grammar author’s control, IMO.

Also, the fact that a symbol can contain multiple character sets, and they just get defined with these number suffixes, that again seems like an implementation detail, but not clear enough to be a stable API.

@ObserverOfTime
Copy link
Member

ObserverOfTime commented Sep 23, 2025

They can, but what for? They're defined in parser.c, so declaring them in parser.h seems correct.

parser.h is the bread-and-butter header. I'd like to keep it minimal and identical for all grammars.

Much like the array API, these functions won't be needed by most authors, so they should live in a separate header.

@ObserverOfTime
Copy link
Member

ObserverOfTime commented Sep 23, 2025

What is the linkage on these symbols? If they are not static, then they must be prefixed with ‘tree_sitter_$GRAMMAR_NAME’ to avoid name collisions. It would be nice to not export these unless the grammar author opts in somehow.

I agree. We should declare them with a prefix and either as static inline or __attribute__((visibility("hidden"))) (on GNU). For ease of use, we can define a macro like:

#define matches(sym, lookahead) tree_sitter_foo_matches_sym_##sym(lookahead)

@amaanq
Copy link
Member Author

amaanq commented Sep 24, 2025

Thinking about this more - I feel like this API needs a little work. The character sets that tree-sitter generates are based on some fairly arbitrary heuristics for what helps compile time. This is fine for an internal optimization, but not suitable for a public API. It needs to be a bit more specified and under the grammar author’s control, IMO.

Also, the fact that a symbol can contain multiple character sets, and they just get defined with these number suffixes, that again seems like an implementation detail, but not clear enough to be a stable API.

Should we instead separate the internal character sets and instead expose a field in the grammar.js for users to explicitly pass in terminal rules that should be available in the scanner?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use the internal regexp in external scanners

4 participants