Skip to content

Tekken tokenizer class added to support the Mistral models#434

Merged
ani300 merged 6 commits into
mainfrom
mistral-tekken-tokenizer
Jul 3, 2025
Merged

Tekken tokenizer class added to support the Mistral models#434
ani300 merged 6 commits into
mainfrom
mistral-tekken-tokenizer

Conversation

@rzbhatti

Copy link
Copy Markdown
Contributor

This PR added Tekken tokenizer class support for the mistralai/Devstral-Small-2505 model.
This is split from #427 to take care of the comment:
#427 (review)

We might want to split up the tokenizer and model implementation into 2 separate PRs

Closes #433

Minimal Test

from fms.utils import tokenizers
MODEL_PATH="/mnt/aiu-models-en-shared/models/mistralai/Devstral-Small-2505"
tokenizer = tokenizers.get_tokenizer(MODEL_PATH)

Note:

It is normal to get the following warning message.

mistral_common/tokens/tokenizers/tekken.py:240: FutureWarning: Special tokens not found in /mnt/aiu-models-en-shared/models/mistralai/Devstral-Small-2505/tekken.json and default to ({'rank': 0, 'token_str': <SpecialTokens.unk: '<unk>'>, 'is_control': True}, {'rank': 1, 'token_str': <SpecialTokens.bos: '<s>'>, 'is_control': True}, {'rank': 2, 'token_str': <SpecialTokens.eos: '</s>'>, 'is_control': True}, {'rank': 3, 'token_str': <SpecialTokens.begin_inst: '[INST]'>, 'is_control': True}, {'rank': 4, 'token_str': <SpecialTokens.end_inst: '[/INST]'>, 'is_control': True}, {'rank': 5, 'token_str': <SpecialTokens.begin_tools: '[AVAILABLE_TOOLS]'>, 'is_control': True}, {'rank': 6, 'token_str': <SpecialTokens.end_tools: '[/AVAILABLE_TOOLS]'>, 'is_control': True}, {'rank': 7, 'token_str': <SpecialTokens.begin_tool_results: '[TOOL_RESULTS]'>, 'is_control': True}, {'rank': 8, 'token_str': <SpecialTokens.end_tool_results: '[/TOOL_RESULTS]'>, 'is_control': True}, {'rank': 9, 'token_str': <SpecialTokens.tool_calls: '[TOOL_CALLS]'>, 'is_control': True}, {'rank': 10, 'token_str': <SpecialTokens.img: '[IMG]'>, 'is_control': True}, {'rank': 11, 'token_str': <SpecialTokens.pad: '<pad>'>, 'is_control': True}, {'rank': 12, 'token_str': <SpecialTokens.img_break: '[IMG_BREAK]'>, 'is_control': True}, {'rank': 13, 'token_str': <SpecialTokens.img_end: '[IMG_END]'>, 'is_control': True}, {'rank': 14, 'token_str': <SpecialTokens.prefix: '[PREFIX]'>, 'is_control': True}, {'rank': 15, 'token_str': <SpecialTokens.middle: '[MIDDLE]'>, 'is_control': True}, {'rank': 16, 'token_str': <SpecialTokens.suffix: '[SUFFIX]'>, 'is_control': True}, {'rank': 17, 'token_str': <SpecialTokens.begin_system: '[SYSTEM_PROMPT]'>, 'is_control': True}, {'rank': 18, 'token_str': <SpecialTokens.end_system: '[/SYSTEM_PROMPT]'>, 'is_control': True}, {'rank': 19, 'token_str': <SpecialTokens.begin_tool_content: '[TOOL_CONTENT]'>, 'is_control': True}). This behavior will be deprecated going forward. Please update your tokenizer file and include all special tokens you need.
  warnings.warn(

Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>
@rzbhatti rzbhatti linked an issue Jun 27, 2025 that may be closed by this pull request
Rashed Z. Bhatti, PhD added 3 commits June 27, 2025 00:23
…ils/tokenizers.py

Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>
Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>
Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>

@kaoutar55 kaoutar55 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other tokenizer classes in FMS expose vocab_size. _TekkenTokenizer should support vocab_size as well.
If other tokenizers in FMS expose vocab_size, then _TekkenTokenizer should too.

Comment thread fms/utils/tokenizers.py Outdated
List of string tokens
"""
warnings.warn(
"this method will be deprecated in future versions, this will be a lot more inefficient than encode, directly use encode(text: str) -> List[int] methed instead",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix typo: methed → method

Comment thread fms/utils/tokenizers.py Outdated
return self.ids
except Exception as e:
raise RuntimeError(
f"Misrtral tokenizer error: convert_tokens_to_ids() must be used in tandum with tokenize() Error: {type(e).__name__} occurred: {e}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix typo:
Misrtral tokenizer → Mistral tokenizer
in tandum → in tandem

@kaoutar55 kaoutar55 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests included in the PR. I think we should add some unit tests to:

  • Load config
  • Load tekken tokenizer
  • Encode/decode round-trip
  • Special token handling

Comment thread fms/utils/tokenizers.py Outdated
return list(map(self.tokenizer.decode, [[i] for i in ids]))

def convert_tokens_to_string(self, tokens: list[str]) -> str:
"""Conver list of string tokens

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: conver-> convert

@kaoutar55 kaoutar55 self-requested a review July 2, 2025 15:18
Comment thread fms/utils/tokenizers.py Outdated
"""Conver list of string tokens

Args:
tokens (list[str]): _description_

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

describe the inputs, don't leave the stubs

Comment thread fms/utils/tokenizers.py Outdated
List of token strings
"""
warnings.warn(
"this method will be deprecated in future versions, this will be a lot more inefficient, use decode( token_ids: List[int]) -> str methed instead",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "( " to "(token_ids"

Comment thread fms/utils/tokenizers.py

@ani300 ani300 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost ready, I think we need to add encode/decode to the base class so we can use it in the inference scripts, and fixing a bunch of typos

@rzbhatti

rzbhatti commented Jul 2, 2025

Copy link
Copy Markdown
Contributor Author

Other tokenizer classes in FMS expose vocab_size. _TekkenTokenizer should support vocab_size as well. If other tokenizers in FMS expose vocab_size, then _TekkenTokenizer should too.

Added a method def vocab_size(self) -> int

Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>
Comment thread fms/utils/tokenizers.py Outdated
def decode(
self,
token_ids: List[int],
stp=0, # SpecialTokenPolicy = SpecialTokenPolicy.IGNORE,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this parameter does not match the decode interface, can you make that interface more generic to accept an int instead of a boolean? then you can also modify the HFTokenizer signature and transform the int to what the boolean would be for HF Tokenizers.

Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@us.ibm.com>

@ani300 ani300 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@ani300 ani300 merged commit 8fbe465 into main Jul 3, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for the Mistral Tekken Tokenizer.

3 participants