Added granite_v4.py to support 8b model#497
Conversation
Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@gmail.com>
kaoutar55
left a comment
There was a problem hiding this comment.
Thanks @rzbhatti for adding granite_v4 and wiring GraniteMoeHybrid* into FMS. This is the missing piece we need to start enabling Granite 4.0 8B in FMS and on AIU.
this is a good first enabling step. I have a few comments around configuration correctness (alignment with HF), future MoE support, code reuse, and tests.
| "bamba", | ||
| "gpt_bigcode", | ||
| "granite", | ||
| "granite_v4", |
There was a problem hiding this comment.
We have:
Architecture name: granite_v4
Class names: GraniteMoeHybrid*
HF architecture: GraniteMoeHybridForCausalLM
This creates confusion. Consider either:
Renaming classes to GraniteV4* to match the file/architecture name
Or explain in this PR why "MoeHybrid" is used for a dense model
There was a problem hiding this comment.
This is done deliberately to ensure consistency, as all Granite 4.0 models use a single "superset architecture" name GraniteMoeHybridForCausalLM, for simplified model management during training and loading, even for those variants (like the non-Hybrid/non-MoE models) where the name is technically a superset of the functional architecture.
https://huggingface.co/collections/ibm-granite/granite-40-language-models
The HF transformer implementation here uses the granitemoehybrid. If we do not want to associate this architecture with version 4, a better move could be to use granitemoehybrid everywhere.
|
|
||
|
|
||
| @dataclass | ||
| class GraniteMoeHybridConfig(ModelConfig): |
There was a problem hiding this comment.
If this is supposed to support MoE models (even if 8B is dense), the config lacks.
num_experts
num_experts_per_tok
router_aux_loss_coef
Expert-specific parameters
We should at least add a comment to mention that this will be addressed in a future PR or clarity that the scope of this PR is to support only the the dense 8B; MoE variants will be added later.
| attention_multiplier=0.0078125, | ||
| ) | ||
|
|
||
| _3_1_2b_config = GraniteMoeHybridConfig( |
There was a problem hiding this comment.
_3_1_2b_config is defined but not registered.
Register it: models.register_model(_architecture_name, "3.1-2b", ...)
or remove it from this PR
| "FMS model implementations currently only support LlamaForCausalLM, GPTBigCodeForCausalLM, MixtralForCausalLM, RobertaForMaskedLM, RobertaForQuestionAnswering, RobertaForSequenceClassification, GraniteForCausalLM, MistralForCausalLM, BambaForCausalLM, SiglipModel, LlavaNextForConditionalGeneration, MPNetForMaskedLM, BertForMaskedLM, and BertForSequenceClassification" | ||
| "FMS model implementations currently only support LlamaForCausalLM, GPTBigCodeForCausalLM, MixtralForCausalLM, RobertaForMaskedLM, RobertaForQuestionAnswering, RobertaForSequenceClassification, GraniteForCausalLM, GraniteMoeHybridForCausalLM, MistralForCausalLM, BambaForCausalLM, SiglipModel, LlavaNextForConditionalGeneration, MPNetForMaskedLM, BertForMaskedLM, and BertForSequenceClassification" | ||
| ) | ||
|
|
There was a problem hiding this comment.
The PR should include:
- Unit tests for the new model architecture
- HF weight loading tests
- Output verification tests
There was a problem hiding this comment.
Please add docstrings for classes and key methods
Comments explaining Granite 4.0-specific features (embedding_multiplier, residual_multiplier, attention_multiplier, logits_scaling)
| return preds | ||
|
|
||
|
|
||
| _8b_config = GraniteMoeHybridConfig( |
There was a problem hiding this comment.
Can confirm that _8b_config matches the actual HF config for ibm-granite/granite-4.0-8b (or 8b-base)
Right now, some values (notably src_vocab_size=49155, pad_id=0, and max_expected_seq_len=8192) look more like the older “tiny-preview” tokenizer / config than the final Granite 4.0 configs, which generally use a 100k vocab and long contexts. I suggest to populate _8b_config directly from HF config
Use config.vocab_size and config.pad_token_id instead of hard-coding the 49k/0 pair unless this is the behavior we want here.
| ) | ||
|
|
||
| # hf -> fms requires a transpose operation for the query and key | ||
| # weight and bias parameters for Llama models |
There was a problem hiding this comment.
Since this is now applied to Granite 4 as well, it might be clearer to make that comment architecture-agnostic or explicitly say “Llama/Granite”.
| config, "head_dim", config.hidden_size // config.num_attention_heads | ||
| ) | ||
| elif architecture == "GraniteMoeHybridForCausalLM": | ||
| inner_dim = config.intermediate_size |
There was a problem hiding this comment.
inner_dim is computed and never used. We can use it to sanity-check or derive hidden_grow_factor from intermediate_size:
hidden_grow_factor = config.intermediate_size / config.hidden_size
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
| activation_fn=str_to_activation(self.config.activation_fn), | ||
| p_dropout=self.config.p_dropout, | ||
| use_bias=False, # Granite 4 does not define MLP bias | ||
| fused=True, # Granite 4 comes with fused weights |
There was a problem hiding this comment.
I don't think we have to default to fused, this can be handled through an adapter
There was a problem hiding this comment.
Can you please provide more details on what kind of adapter you are referring here?
JRosenkranz
left a comment
There was a problem hiding this comment.
Can we add a test for correctness for this model against hf outputs?
…comment Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
| for pattern, repl in replacements: | ||
| new_name = re.sub(pattern, repl, new_name) | ||
|
|
||
| if "shared_mlp.input_linear.weight" in new_name: |
There was a problem hiding this comment.
Implemented this fused logic to let the GraniteBlock accept both fused and unfused weight.
While we could handle this with wg1_fused and allow FMS to work with fused weights, vllm-spyre actually sets fused_weights=False, which creates a problem with mixed configuration.
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
5f0401d to
3a77ba6
Compare
* Added granite_v4.py to support 8b model Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@gmail.com> * ♻️✅ Refactor for granite_v4 to inherit from granite and add unit tests Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com> * 🚚 Rename granite_v4 to granite_moe_hybrid Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com> * 🎨 Fix linting issues Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com> * 🎨 Fix ruff warning Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com> * ♻️🐛 Fix glu unit fused weight processing Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com> * ♻️ Move fused weights logic outside of GraniteBlock class per review comment Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com> * 📝 Add comment for unfusing, fused weight logic for naming function Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com> * 🐛 Fix initialization issue with GraniteMoe Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com> --------- Signed-off-by: Rashed Z. Bhatti, PhD <rzbhatti@gmail.com> Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com> Co-authored-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
Adds
granite_v4.pyto support Granite4 family of models. This PR addresses only the granite4-8b model for now.Mamba layers are not implemented yet in this PR.
Basic test run with AFTU: