-
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
feat: allow the external scanner to use the internal character ranges #4864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
736fde2 to
f8e5f46
Compare
f8e5f46 to
5446abf
Compare
ObserverOfTime
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the functions be added to a new header instead?
They can, but what for? They're defined in |
|
What is the linkage on these symbols? If they are not static, then they must be prefixed with ‘tree_sitter_$GRAMMAR_NAME’ to avoid name collisions. It would be nice to not export these unless the grammar author opts in somehow. |
| grammar_json: &str, | ||
| semantic_version: Option<(u8, u8, u8)>, | ||
| ) -> GenerateResult<(String, String)> { | ||
| ) -> GenerateResult<(String, String, String)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's borderline, but I think it would be better for readability to have a struct with named fields here rather than 3 Strings in a tuple.
| let mut symbols_with_char_sets = self | ||
| .symbol_to_character_sets | ||
| .iter() | ||
| .filter(|(_, sets)| !sets.is_empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for these Vecs to ever be empty? It looks like they'll always have at least one entry.
| .collect::<Vec<_>>(); | ||
|
|
||
| if !used_char_sets.is_empty() { | ||
| if !header_added { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this just be if self.header_buffer.is_empty()? I think you had that in a previous push.
| // Match function for this character set | ||
|
|
||
| writeln!( | ||
| self.header_buffer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rustfmt died, please save it 😆
| If your grammar contains complex patterns, Tree-sitter automatically generates helper functions that your external scanner | ||
| can use to check if a character matches those character sets. This avoids the need to reimplement logic for matching complex | ||
| tokens in your scanner. These matching functions are added to the `parser.h` header file. The function names follow the pattern | ||
| `matches_sym_<symbol_name>(int32_t lookahead)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include the bool return type here, and the distinct character set group list below?
|
Thinking about this more - I feel like this API needs a little work. The character sets that tree-sitter generates are based on some fairly arbitrary heuristics for what helps compile time. This is fine for an internal optimization, but not suitable for a public API. It needs to be a bit more specified and under the grammar author’s control, IMO. Also, the fact that a symbol can contain multiple character sets, and they just get defined with these number suffixes, that again seems like an implementation detail, but not clear enough to be a stable API. |
Much like the |
I agree. We should declare them with a prefix and either as #define matches(sym, lookahead) tree_sitter_foo_matches_sym_##sym(lookahead) |
Should we instead separate the internal character sets and instead expose a field in the grammar.js for users to explicitly pass in terminal rules that should be available in the scanner? |
Problem
Currently, grammars with complex tokens cannot make use of the character sets generated in the parser. This causes friction, where now users have to either copy the character set in the scanner or give up on matching the entirety of the complex token.
Solution
Now, tree-sitter will generate helper functions that can be used in a scanner to match against these complex tokens. For every token with a "complex" token, tree-sitter will generate a function in the form of
matches_sym_{name}. For tokens with multiple distinct groups (e.g.[a-zA-Z][a-zA-Z0-9]*consists of two groups), tree-sitter will now generate a function that matches each distinct group, in case you only want to match a specific part of the token. A more detailed explanation along with an example can be found in the docs changes.