pyiqa.archs.qrealign.qwen3_5_src.modeling_qwen3_5¶
Module Contents¶
- class pyiqa.archs.qrealign.qwen3_5_src.modeling_qwen3_5.Qwen3_5PreTrainedModel[source]¶
Bases:
pyiqa.archs.modeling_utils.PreTrainedModel
- class pyiqa.archs.qrealign.qwen3_5_src.modeling_qwen3_5.Qwen3_5VisionModel(config, *inputs, **kwargs)[source]¶
Bases:
Qwen3_5PreTrainedModel- forward(hidden_states: torch.Tensor, grid_thw: torch.Tensor, **kwargs) torch.Tensor[source]¶
- Parameters:
hidden_states (torch.Tensor of shape (seq_len, hidden_size)) – The final hidden states of the model.
grid_thw (torch.Tensor of shape (num_images_or_videos, 3)) – The temporal, height and width of feature shape of each image in LLM.
- Returns:
hidden_states.
- Return type:
torch.Tensor
- class pyiqa.archs.qrealign.qwen3_5_src.modeling_qwen3_5.Qwen3_5TextModel(config: pyiqa.archs.qrealign.qwen3_5_src.configuration_qwen3_5.Qwen3_5TextConfig)[source]¶
Bases:
Qwen3_5PreTrainedModel- forward(input_ids: torch.LongTensor | None = None, attention_mask: torch.Tensor | None = None, position_ids: torch.LongTensor | None = None, past_key_values: pyiqa.archs.cache_utils.Cache | None = None, inputs_embeds: torch.FloatTensor | None = None, use_cache: bool | None = None, cache_position: torch.LongTensor | None = None, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) pyiqa.archs.modeling_outputs.BaseModelOutputWithPast[source]¶
- class pyiqa.archs.qrealign.qwen3_5_src.modeling_qwen3_5.Qwen3_5Model(config)[source]¶
Bases:
Qwen3_5PreTrainedModel- get_rope_index(input_ids: torch.LongTensor | None = None, image_grid_thw: torch.LongTensor | None = None, video_grid_thw: torch.LongTensor | None = None, attention_mask: torch.Tensor | None = None, **kwargs) tuple[torch.Tensor, torch.Tensor][source]¶
Different from the original implementation, Qwen3_5 use timestamps rather than absolute time position ids.
- get_video_features(pixel_values_videos: torch.FloatTensor, video_grid_thw: torch.LongTensor | None = None, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) tuple | pyiqa.archs.modeling_outputs.BaseModelOutputWithPooling[source]¶
- pixel_values_videos (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)):
The tensors corresponding to the input videos.
- video_grid_thw (torch.LongTensor of shape (num_videos, 3), optional):
The temporal, height and width of feature shape of each video in LLM.
- get_image_features(pixel_values: torch.FloatTensor, image_grid_thw: torch.LongTensor | None = None, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) tuple | pyiqa.archs.modeling_outputs.BaseModelOutputWithPooling[source]¶
- pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)):
The tensors corresponding to the input images.
- image_grid_thw (torch.LongTensor of shape (num_images, 3), optional):
The temporal, height and width of feature shape of each image in LLM.
- get_placeholder_mask(input_ids: torch.LongTensor, inputs_embeds: torch.FloatTensor, image_features: torch.FloatTensor | None = None, video_features: torch.FloatTensor | None = None)[source]¶
Obtains multimodal placeholder mask from input_ids or inputs_embeds, and checks that the placeholder token count is equal to the length of multimodal features. If the lengths are different, an error is raised.
- compute_3d_position_ids(input_ids: torch.Tensor | None, inputs_embeds: torch.Tensor | None, image_grid_thw: torch.Tensor | None = None, video_grid_thw: torch.Tensor | None = None, attention_mask: torch.Tensor | None = None, past_key_values: torch.Tensor | None = None) torch.Tensor | None[source]¶
- forward(input_ids: torch.LongTensor = None, attention_mask: torch.Tensor | None = None, position_ids: torch.LongTensor | None = None, past_key_values: pyiqa.archs.cache_utils.Cache | None = None, inputs_embeds: torch.FloatTensor | None = None, pixel_values: torch.Tensor | None = None, pixel_values_videos: torch.FloatTensor | None = None, image_grid_thw: torch.LongTensor | None = None, video_grid_thw: torch.LongTensor | None = None, cache_position: torch.LongTensor | None = None, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) tuple | Qwen3_5ModelOutputWithPast[source]¶
- image_grid_thw (torch.LongTensor of shape (num_images, 3), optional):
The temporal, height and width of feature shape of each image in LLM.
- video_grid_thw (torch.LongTensor of shape (num_videos, 3), optional):
The temporal, height and width of feature shape of each video in LLM.
- class pyiqa.archs.qrealign.qwen3_5_src.modeling_qwen3_5.Qwen3_5ForCausalLM(config)[source]¶
Bases:
Qwen3_5PreTrainedModel,pyiqa.archs.generation.GenerationMixin- forward(input_ids: torch.LongTensor | None = None, attention_mask: torch.Tensor | None = None, position_ids: torch.LongTensor | None = None, past_key_values: pyiqa.archs.cache_utils.Cache | None = None, inputs_embeds: torch.FloatTensor | None = None, labels: torch.LongTensor | None = None, use_cache: bool | None = None, cache_position: torch.LongTensor | None = None, logits_to_keep: int | torch.Tensor = 0, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) pyiqa.archs.modeling_outputs.CausalLMOutputWithPast[source]¶
- labels (torch.LongTensor of shape (batch_size, sequence_length), optional):
Labels for computing the masked language modeling loss. Indices should either be in [0, …, config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size].
Example:
```python >>> from transformers import AutoTokenizer, Qwen3_5ForCausalLM
>>> model = Qwen3_5ForCausalLM.from_pretrained("Qwen/Qwen3_5-8B") >>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3_5-8B")
>>> prompt = "Hey, are you conscious? Can you talk to me?" >>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate >>> generate_ids = model.generate(inputs.input_ids, max_length=30) >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ```
- class pyiqa.archs.qrealign.qwen3_5_src.modeling_qwen3_5.Qwen3_5ForConditionalGeneration(config)[source]¶
Bases:
Qwen3_5PreTrainedModel,pyiqa.archs.generation.GenerationMixin- get_video_features(pixel_values_videos: torch.FloatTensor, video_grid_thw: torch.LongTensor | None = None, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) tuple | pyiqa.archs.modeling_outputs.BaseModelOutputWithPooling[source]¶
- pixel_values_videos (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)):
The tensors corresponding to the input videos.
- video_grid_thw (torch.LongTensor of shape (num_videos, 3), optional):
The temporal, height and width of feature shape of each video in LLM.
- get_image_features(pixel_values: torch.FloatTensor, image_grid_thw: torch.LongTensor | None = None, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) tuple | pyiqa.archs.modeling_outputs.BaseModelOutputWithPooling[source]¶
- pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)):
The tensors corresponding to the input images.
- image_grid_thw (torch.LongTensor of shape (num_images, 3), optional):
The temporal, height and width of feature shape of each image in LLM.
- forward(input_ids: torch.LongTensor = None, attention_mask: torch.Tensor | None = None, position_ids: torch.LongTensor | None = None, past_key_values: pyiqa.archs.cache_utils.Cache | None = None, inputs_embeds: torch.FloatTensor | None = None, labels: torch.LongTensor | None = None, pixel_values: torch.Tensor | None = None, pixel_values_videos: torch.FloatTensor | None = None, image_grid_thw: torch.LongTensor | None = None, video_grid_thw: torch.LongTensor | None = None, cache_position: torch.LongTensor | None = None, logits_to_keep: int | torch.Tensor = 0, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) tuple | Qwen3_5CausalLMOutputWithPast[source]¶
- labels (torch.LongTensor of shape (batch_size, sequence_length), optional):
Labels for computing the masked language modeling loss. Indices should either be in [0, …, config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size].
- image_grid_thw (torch.LongTensor of shape (num_images, 3), optional):
The temporal, height and width of feature shape of each image in LLM.
- video_grid_thw (torch.LongTensor of shape (num_videos, 3), optional):
The temporal, height and width of feature shape of each video in LLM.
Example:
```python >>> from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
>>> model = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3-VL-8B-Instruct") >>> processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-8B-Instruct")
>>> messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", }, {"type": "text", "text": "Describe the image."}, ], } ]
>>> inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" )
>>> # Generate >>> generated_ids = model.generate(**inputs, max_new_tokens=1024) >>> generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)] >>> output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] >>> print(output_text) ```