Tekken tokenizer class added to support the Mistral models#434
Conversation
Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>
…ils/tokenizers.py Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>
Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>
Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>
kaoutar55
left a comment
There was a problem hiding this comment.
Other tokenizer classes in FMS expose vocab_size. _TekkenTokenizer should support vocab_size as well.
If other tokenizers in FMS expose vocab_size, then _TekkenTokenizer should too.
| List of string tokens | ||
| """ | ||
| warnings.warn( | ||
| "this method will be deprecated in future versions, this will be a lot more inefficient than encode, directly use encode(text: str) -> List[int] methed instead", |
There was a problem hiding this comment.
Fix typo: methed → method
| return self.ids | ||
| except Exception as e: | ||
| raise RuntimeError( | ||
| f"Misrtral tokenizer error: convert_tokens_to_ids() must be used in tandum with tokenize() Error: {type(e).__name__} occurred: {e}" |
There was a problem hiding this comment.
Fix typo:
Misrtral tokenizer → Mistral tokenizer
in tandum → in tandem
kaoutar55
left a comment
There was a problem hiding this comment.
No tests included in the PR. I think we should add some unit tests to:
- Load config
- Load tekken tokenizer
- Encode/decode round-trip
- Special token handling
| return list(map(self.tokenizer.decode, [[i] for i in ids])) | ||
|
|
||
| def convert_tokens_to_string(self, tokens: list[str]) -> str: | ||
| """Conver list of string tokens |
| """Conver list of string tokens | ||
|
|
||
| Args: | ||
| tokens (list[str]): _description_ |
There was a problem hiding this comment.
describe the inputs, don't leave the stubs
| List of token strings | ||
| """ | ||
| warnings.warn( | ||
| "this method will be deprecated in future versions, this will be a lot more inefficient, use decode( token_ids: List[int]) -> str methed instead", |
ani300
left a comment
There was a problem hiding this comment.
This is almost ready, I think we need to add encode/decode to the base class so we can use it in the inference scripts, and fixing a bunch of typos
Added a method |
Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>
| def decode( | ||
| self, | ||
| token_ids: List[int], | ||
| stp=0, # SpecialTokenPolicy = SpecialTokenPolicy.IGNORE, |
There was a problem hiding this comment.
this parameter does not match the decode interface, can you make that interface more generic to accept an int instead of a boolean? then you can also modify the HFTokenizer signature and transform the int to what the boolean would be for HF Tokenizers.
Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>
This PR added Tekken tokenizer class support for the
mistralai/Devstral-Small-2505model.This is split from #427 to take care of the comment:
#427 (review)
Closes #433
Minimal Test
Note:
It is normal to get the following warning message.