pyiqa.archs.qrealign.qwen3_5_src.modeling_qwen3_5

Module Contents

class pyiqa.archs.qrealign.qwen3_5_src.modeling_qwen3_5.Qwen3_5PreTrainedModel[source]

Bases: pyiqa.archs.modeling_utils.PreTrainedModel

class pyiqa.archs.qrealign.qwen3_5_src.modeling_qwen3_5.Qwen3_5VisionModel(config, *inputs, **kwargs)[source]

Bases: Qwen3_5PreTrainedModel

rot_pos_emb(grid_thw: torch.Tensor) torch.Tensor[source]
fast_pos_embed_interpolate(grid_thw)[source]
forward(hidden_states: torch.Tensor, grid_thw: torch.Tensor, **kwargs) torch.Tensor[source]
Parameters:
  • hidden_states (torch.Tensor of shape (seq_len, hidden_size)) – The final hidden states of the model.

  • grid_thw (torch.Tensor of shape (num_images_or_videos, 3)) – The temporal, height and width of feature shape of each image in LLM.

Returns:

hidden_states.

Return type:

torch.Tensor

class pyiqa.archs.qrealign.qwen3_5_src.modeling_qwen3_5.Qwen3_5TextModel(config: pyiqa.archs.qrealign.qwen3_5_src.configuration_qwen3_5.Qwen3_5TextConfig)[source]

Bases: Qwen3_5PreTrainedModel

forward(input_ids: torch.LongTensor | None = None, attention_mask: torch.Tensor | None = None, position_ids: torch.LongTensor | None = None, past_key_values: pyiqa.archs.cache_utils.Cache | None = None, inputs_embeds: torch.FloatTensor | None = None, use_cache: bool | None = None, cache_position: torch.LongTensor | None = None, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) pyiqa.archs.modeling_outputs.BaseModelOutputWithPast[source]
class pyiqa.archs.qrealign.qwen3_5_src.modeling_qwen3_5.Qwen3_5Model(config)[source]

Bases: Qwen3_5PreTrainedModel

get_input_embeddings()[source]
set_input_embeddings(value)[source]
get_rope_index(input_ids: torch.LongTensor | None = None, image_grid_thw: torch.LongTensor | None = None, video_grid_thw: torch.LongTensor | None = None, attention_mask: torch.Tensor | None = None, **kwargs) tuple[torch.Tensor, torch.Tensor][source]

Different from the original implementation, Qwen3_5 use timestamps rather than absolute time position ids.

get_video_features(pixel_values_videos: torch.FloatTensor, video_grid_thw: torch.LongTensor | None = None, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) tuple | pyiqa.archs.modeling_outputs.BaseModelOutputWithPooling[source]
pixel_values_videos (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)):

The tensors corresponding to the input videos.

video_grid_thw (torch.LongTensor of shape (num_videos, 3), optional):

The temporal, height and width of feature shape of each video in LLM.

get_image_features(pixel_values: torch.FloatTensor, image_grid_thw: torch.LongTensor | None = None, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) tuple | pyiqa.archs.modeling_outputs.BaseModelOutputWithPooling[source]
pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)):

The tensors corresponding to the input images.

image_grid_thw (torch.LongTensor of shape (num_images, 3), optional):

The temporal, height and width of feature shape of each image in LLM.

get_placeholder_mask(input_ids: torch.LongTensor, inputs_embeds: torch.FloatTensor, image_features: torch.FloatTensor | None = None, video_features: torch.FloatTensor | None = None)[source]

Obtains multimodal placeholder mask from input_ids or inputs_embeds, and checks that the placeholder token count is equal to the length of multimodal features. If the lengths are different, an error is raised.

compute_3d_position_ids(input_ids: torch.Tensor | None, inputs_embeds: torch.Tensor | None, image_grid_thw: torch.Tensor | None = None, video_grid_thw: torch.Tensor | None = None, attention_mask: torch.Tensor | None = None, past_key_values: torch.Tensor | None = None) torch.Tensor | None[source]
forward(input_ids: torch.LongTensor = None, attention_mask: torch.Tensor | None = None, position_ids: torch.LongTensor | None = None, past_key_values: pyiqa.archs.cache_utils.Cache | None = None, inputs_embeds: torch.FloatTensor | None = None, pixel_values: torch.Tensor | None = None, pixel_values_videos: torch.FloatTensor | None = None, image_grid_thw: torch.LongTensor | None = None, video_grid_thw: torch.LongTensor | None = None, cache_position: torch.LongTensor | None = None, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) tuple | Qwen3_5ModelOutputWithPast[source]
image_grid_thw (torch.LongTensor of shape (num_images, 3), optional):

The temporal, height and width of feature shape of each image in LLM.

video_grid_thw (torch.LongTensor of shape (num_videos, 3), optional):

The temporal, height and width of feature shape of each video in LLM.

class pyiqa.archs.qrealign.qwen3_5_src.modeling_qwen3_5.Qwen3_5ForCausalLM(config)[source]

Bases: Qwen3_5PreTrainedModel, pyiqa.archs.generation.GenerationMixin

forward(input_ids: torch.LongTensor | None = None, attention_mask: torch.Tensor | None = None, position_ids: torch.LongTensor | None = None, past_key_values: pyiqa.archs.cache_utils.Cache | None = None, inputs_embeds: torch.FloatTensor | None = None, labels: torch.LongTensor | None = None, use_cache: bool | None = None, cache_position: torch.LongTensor | None = None, logits_to_keep: int | torch.Tensor = 0, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) pyiqa.archs.modeling_outputs.CausalLMOutputWithPast[source]
labels (torch.LongTensor of shape (batch_size, sequence_length), optional):

Labels for computing the masked language modeling loss. Indices should either be in [0, …, config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size].

Example:

```python >>> from transformers import AutoTokenizer, Qwen3_5ForCausalLM

>>> model = Qwen3_5ForCausalLM.from_pretrained("Qwen/Qwen3_5-8B")
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3_5-8B")
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```
class pyiqa.archs.qrealign.qwen3_5_src.modeling_qwen3_5.Qwen3_5ForConditionalGeneration(config)[source]

Bases: Qwen3_5PreTrainedModel, pyiqa.archs.generation.GenerationMixin

get_input_embeddings()[source]
set_input_embeddings(value)[source]
get_video_features(pixel_values_videos: torch.FloatTensor, video_grid_thw: torch.LongTensor | None = None, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) tuple | pyiqa.archs.modeling_outputs.BaseModelOutputWithPooling[source]
pixel_values_videos (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)):

The tensors corresponding to the input videos.

video_grid_thw (torch.LongTensor of shape (num_videos, 3), optional):

The temporal, height and width of feature shape of each video in LLM.

get_image_features(pixel_values: torch.FloatTensor, image_grid_thw: torch.LongTensor | None = None, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) tuple | pyiqa.archs.modeling_outputs.BaseModelOutputWithPooling[source]
pixel_values (torch.FloatTensor of shape (batch_size, num_channels, image_size, image_size)):

The tensors corresponding to the input images.

image_grid_thw (torch.LongTensor of shape (num_images, 3), optional):

The temporal, height and width of feature shape of each image in LLM.

forward(input_ids: torch.LongTensor = None, attention_mask: torch.Tensor | None = None, position_ids: torch.LongTensor | None = None, past_key_values: pyiqa.archs.cache_utils.Cache | None = None, inputs_embeds: torch.FloatTensor | None = None, labels: torch.LongTensor | None = None, pixel_values: torch.Tensor | None = None, pixel_values_videos: torch.FloatTensor | None = None, image_grid_thw: torch.LongTensor | None = None, video_grid_thw: torch.LongTensor | None = None, cache_position: torch.LongTensor | None = None, logits_to_keep: int | torch.Tensor = 0, **kwargs: pyiqa.archs.processing_utils.Unpack[pyiqa.archs.utils.TransformersKwargs]) tuple | Qwen3_5CausalLMOutputWithPast[source]
labels (torch.LongTensor of shape (batch_size, sequence_length), optional):

Labels for computing the masked language modeling loss. Indices should either be in [0, …, config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size].

image_grid_thw (torch.LongTensor of shape (num_images, 3), optional):

The temporal, height and width of feature shape of each image in LLM.

video_grid_thw (torch.LongTensor of shape (num_videos, 3), optional):

The temporal, height and width of feature shape of each video in LLM.

Example:

```python >>> from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration

>>> model = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3-VL-8B-Instruct")
>>> processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-8B-Instruct")
>>> messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
            },
            {"type": "text", "text": "Describe the image."},
        ],
    }
]
>>> inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
>>> # Generate
>>> generated_ids = model.generate(**inputs, max_new_tokens=1024)
>>> generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
>>> output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
>>> print(output_text)
```
prepare_inputs_for_generation(input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, cache_position=None, position_ids=None, use_cache=True, pixel_values=None, pixel_values_videos=None, image_grid_thw=None, video_grid_thw=None, is_first_iteration=False, **kwargs)[source]