Tekken tokenizer class added to support the Mistral models by rzbhatti · Pull Request #434 · foundation-model-stack/foundation-model-stack

rzbhatti · 2025-06-27T00:15:41Z

This PR added Tekken tokenizer class support for the mistralai/Devstral-Small-2505 model.
This is split from #427 to take care of the comment:
#427 (review)

We might want to split up the tokenizer and model implementation into 2 separate PRs

Closes #433

Minimal Test

from fms.utils import tokenizers
MODEL_PATH="/mnt/aiu-models-en-shared/models/mistralai/Devstral-Small-2505"
tokenizer = tokenizers.get_tokenizer(MODEL_PATH)

Note:

It is normal to get the following warning message.

mistral_common/tokens/tokenizers/tekken.py:240: FutureWarning: Special tokens not found in /mnt/aiu-models-en-shared/models/mistralai/Devstral-Small-2505/tekken.json and default to ({'rank': 0, 'token_str': <SpecialTokens.unk: '<unk>'>, 'is_control': True}, {'rank': 1, 'token_str': <SpecialTokens.bos: '<s>'>, 'is_control': True}, {'rank': 2, 'token_str': <SpecialTokens.eos: '</s>'>, 'is_control': True}, {'rank': 3, 'token_str': <SpecialTokens.begin_inst: '[INST]'>, 'is_control': True}, {'rank': 4, 'token_str': <SpecialTokens.end_inst: '[/INST]'>, 'is_control': True}, {'rank': 5, 'token_str': <SpecialTokens.begin_tools: '[AVAILABLE_TOOLS]'>, 'is_control': True}, {'rank': 6, 'token_str': <SpecialTokens.end_tools: '[/AVAILABLE_TOOLS]'>, 'is_control': True}, {'rank': 7, 'token_str': <SpecialTokens.begin_tool_results: '[TOOL_RESULTS]'>, 'is_control': True}, {'rank': 8, 'token_str': <SpecialTokens.end_tool_results: '[/TOOL_RESULTS]'>, 'is_control': True}, {'rank': 9, 'token_str': <SpecialTokens.tool_calls: '[TOOL_CALLS]'>, 'is_control': True}, {'rank': 10, 'token_str': <SpecialTokens.img: '[IMG]'>, 'is_control': True}, {'rank': 11, 'token_str': <SpecialTokens.pad: '<pad>'>, 'is_control': True}, {'rank': 12, 'token_str': <SpecialTokens.img_break: '[IMG_BREAK]'>, 'is_control': True}, {'rank': 13, 'token_str': <SpecialTokens.img_end: '[IMG_END]'>, 'is_control': True}, {'rank': 14, 'token_str': <SpecialTokens.prefix: '[PREFIX]'>, 'is_control': True}, {'rank': 15, 'token_str': <SpecialTokens.middle: '[MIDDLE]'>, 'is_control': True}, {'rank': 16, 'token_str': <SpecialTokens.suffix: '[SUFFIX]'>, 'is_control': True}, {'rank': 17, 'token_str': <SpecialTokens.begin_system: '[SYSTEM_PROMPT]'>, 'is_control': True}, {'rank': 18, 'token_str': <SpecialTokens.end_system: '[/SYSTEM_PROMPT]'>, 'is_control': True}, {'rank': 19, 'token_str': <SpecialTokens.begin_tool_content: '[TOOL_CONTENT]'>, 'is_control': True}). This behavior will be deprecated going forward. Please update your tokenizer file and include all special tokens you need.
  warnings.warn(

Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>

…ils/tokenizers.py Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>

Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>

kaoutar55

Other tokenizer classes in FMS expose vocab_size. _TekkenTokenizer should support vocab_size as well.
If other tokenizers in FMS expose vocab_size, then _TekkenTokenizer should too.

kaoutar55 · 2025-07-02T14:58:08Z

+            List of string tokens
+        """
+        warnings.warn(
+            "this method will be deprecated in future versions, this will be a lot more inefficient than encode, directly use encode(text: str) -> List[int] methed instead",


Fix typo: methed → method

kaoutar55 · 2025-07-02T14:58:43Z

+                return self.ids
+        except Exception as e:
+            raise RuntimeError(
+                f"Misrtral tokenizer error: convert_tokens_to_ids() must be used in tandum with tokenize() Error: {type(e).__name__} occurred: {e}"


Fix typo:
Misrtral tokenizer → Mistral tokenizer
in tandum → in tandem

kaoutar55

No tests included in the PR. I think we should add some unit tests to:

Load config
Load tekken tokenizer
Encode/decode round-trip
Special token handling

kaoutar55 · 2025-07-02T15:16:22Z

+        return list(map(self.tokenizer.decode, [[i] for i in ids]))
+
+    def convert_tokens_to_string(self, tokens: list[str]) -> str:
+        """Conver list of string tokens


typo: conver-> convert

ani300 · 2025-07-02T15:50:38Z

+        """Conver list of string tokens
+
+        Args:
+            tokens (list[str]): _description_


describe the inputs, don't leave the stubs

ani300 · 2025-07-02T15:51:08Z

+            List of token strings
+        """
+        warnings.warn(
+            "this method will be deprecated in future versions, this will be a lot more inefficient, use decode( token_ids: List[int]) -> str methed instead",


typo: "( " to "(token_ids"

ani300

This is almost ready, I think we need to add encode/decode to the base class so we can use it in the inference scripts, and fixing a bunch of typos

rzbhatti · 2025-07-02T17:02:39Z

Other tokenizer classes in FMS expose vocab_size. _TekkenTokenizer should support vocab_size as well. If other tokenizers in FMS expose vocab_size, then _TekkenTokenizer should too.

Added a method def vocab_size(self) -> int

Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>

ani300 · 2025-07-02T19:01:52Z

+    def decode(
+        self,
+        token_ids: List[int],
+        stp=0,  # SpecialTokenPolicy = SpecialTokenPolicy.IGNORE,


this parameter does not match the decode interface, can you make that interface more generic to accept an int instead of a boolean? then you can also modify the HFTokenizer signature and transform the int to what the boolean would be for HF Tokenizers.

Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>

ani300

lgtm!

Tekken tokenizer class added to support the Mistral models

156955a

Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>

rzbhatti linked an issue Jun 27, 2025 that may be closed by this pull request

Add support for the Mistral Tekken Tokenizer. #433

Closed

Rashed Z. Bhatti, PhD added 3 commits June 27, 2025 00:23

ruff format tokenizers.py, mypy --exclude hf --exclude testing fms/ut…

248a2e0

…ils/tokenizers.py Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>

added a mypy # type: ignore for conditional import mistral_common

5bea612

Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>

ruff format

fe4f197

Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>

kaoutar55 requested review from JRosenkranz, ani300 and kaoutar55 July 2, 2025 14:55

kaoutar55 reviewed Jul 2, 2025

View reviewed changes

kaoutar55 self-requested a review July 2, 2025 15:18

ani300 reviewed Jul 2, 2025

View reviewed changes

Comment thread fms/utils/tokenizers.py

ani300 reviewed Jul 2, 2025

View reviewed changes

added encode and decode to base class, and fixed typos

ac226f2

Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>

rzbhatti mentioned this pull request Jul 2, 2025

Add unit tests for the tekken tokenizer class #439

Open

ani300 reviewed Jul 2, 2025

View reviewed changes

Matched decode interface with that of the transformers

3f2b0e9

Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>

ani300 approved these changes Jul 3, 2025

View reviewed changes

ani300 merged commit 8fbe465 into main Jul 3, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tekken tokenizer class added to support the Mistral models#434

Tekken tokenizer class added to support the Mistral models#434
ani300 merged 6 commits into
mainfrom
mistral-tekken-tokenizer

rzbhatti commented Jun 27, 2025

Uh oh!

kaoutar55 left a comment

Uh oh!

kaoutar55 Jul 2, 2025

Uh oh!

kaoutar55 Jul 2, 2025

Uh oh!

kaoutar55 left a comment

Uh oh!

kaoutar55 Jul 2, 2025

Uh oh!

ani300 Jul 2, 2025

Uh oh!

ani300 Jul 2, 2025

Uh oh!

Uh oh!

ani300 left a comment

Uh oh!

rzbhatti commented Jul 2, 2025

Uh oh!

ani300 Jul 2, 2025

Uh oh!

ani300 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

rzbhatti commented Jun 27, 2025

Minimal Test

Note:

Uh oh!

kaoutar55 left a comment

Choose a reason for hiding this comment

Uh oh!

kaoutar55 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

kaoutar55 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

kaoutar55 left a comment

Choose a reason for hiding this comment

Uh oh!

kaoutar55 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

ani300 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

ani300 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ani300 left a comment

Choose a reason for hiding this comment

Uh oh!

rzbhatti commented Jul 2, 2025

Uh oh!

ani300 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

ani300 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants