Add vision only mode#537
Conversation
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
Signed-off-by: gkumbhat <Gaurav.Kumbhat@ibm.com>
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
Signed-off-by: gkumbhat <Gaurav.Kumbhat@ibm.com> Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
d14e05d to
2d6eee5
Compare
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
Signed-off-by: Gaurav-Kumbhat <Gaurav.Kumbhat@ibm.com>
Signed-off-by: gkumbhat <Gaurav.Kumbhat@ibm.com>
flaviabeo
left a comment
There was a problem hiding this comment.
Tested locally with mistralai/Ministral-3-3B-Instruct-2512, all seems to be working.
Image size: (547, 365)
Embeddings shape: torch.Size([1, 834, 3072])
(batch=1, seq_len=834, hidden_dim=3072)
Let me know if there are any other models of interest that would be nice to test with this mode.
flaviabeo
left a comment
There was a problem hiding this comment.
So, testing with ibm-granite/granite-vision-3.2-2b I am getting an error - I used the almost the same script from Mistral3 so maybe some issue related specific to granite vision?
| if not isinstance(original_image_size, (list, tuple)): | ||
| original_size = original_image_size.tolist() | ||
|
|
||
| original_height, original_width = original_size |
There was a problem hiding this comment.
As I test with granite-vision, I got and error that traces to this line:
File "/home/flaviabeodebayser/fms/foundation-model-stack/fms/models/llava_next.py", line 270, in select_best_resolution
original_height, original_width = original_size
^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'original_size' where it is not associated with a value
INFO:torch._dynamo.eval_frame:TorchDynamo attempted to trace the following frames: [
]
INFO:torch._dynamo.utils:TorchDynamo compilation metrics:
Function, Runtimes (s)
There was a problem hiding this comment.
this seems to be an existing bug which gets uncovered in specific scenario .. I have pushed a fix now.
Signed-off-by: gkumbhat <kumbhat.gaurav@gmail.com>
flaviabeo
left a comment
There was a problem hiding this comment.
Now granite-vision also works :)
LGTM!
Changes
This PR adds ability to load vision part of the model from a given multimodal model. This allows ability for sendnn-inference to load the model's different parts differently for efficient processing