pyiqa.archs.q_align.visual_encoder¶

Module Contents¶

pyiqa.archs.q_align.visual_encoder.find_pruneable_heads_and_indices(heads, n_heads, head_size, already_pruned_heads)[source]¶: Compatibility fallback for Transformers>=5 where this helper was removed.

pyiqa.archs.q_align.visual_encoder.get_abs_pos(abs_pos, tgt_size)[source]¶

pyiqa.archs.q_align.visual_encoder.get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False)[source]¶: grid_size: int of the grid height and width return: pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)

pyiqa.archs.q_align.visual_encoder.get_2d_sincos_pos_embed_from_grid(embed_dim, grid)[source]¶

pyiqa.archs.q_align.visual_encoder.get_1d_sincos_pos_embed_from_grid(embed_dim, pos)[source]¶: embed_dim: output dimension for each position pos: a list of positions to be encoded: size (M,) out: (M, D)

class pyiqa.archs.q_align.visual_encoder.MplugOwlVisionEmbeddings(config)[source]¶

Bases: torch.nn.Module

forward(pixel_values: torch.FloatTensor) → torch.Tensor[source]¶

class pyiqa.archs.q_align.visual_encoder.MplugOwlVisionAttention(config)[source]¶

Bases: torch.nn.Module

Multi-headed attention from ‘Attention Is All You Need’ paper

forward(hidden_states: torch.Tensor, head_mask: torch.Tensor | None = None, output_attentions: bool | None = False) → Tuple[torch.Tensor, torch.Tensor | None, Tuple[torch.Tensor] | None][source]¶: Input shape: Batch x Time x Channel

class pyiqa.archs.q_align.visual_encoder.QuickGELU[source]¶

Bases: torch.nn.Module

forward(x: torch.Tensor)[source]¶

class pyiqa.archs.q_align.visual_encoder.MplugOwlMLP(config)[source]¶

Bases: torch.nn.Module

forward(hidden_states: torch.Tensor) → torch.Tensor[source]¶

class pyiqa.archs.q_align.visual_encoder.MplugOwlVisionEncoderLayer(config)[source]¶

Bases: torch.nn.Module

forward(hidden_states: torch.Tensor, attention_mask: torch.Tensor, output_attentions: bool | None = False) → Tuple[torch.FloatTensor][source]¶

Parameters:

hidden_states (torch.FloatTensor) – input to the layer of shape (batch, seq_len, embed_dim)
attention_mask (torch.FloatTensor) – attention mask of size (batch, 1, tgt_len, src_len) where padding elements are indicated by very large negative values. (config.encoder_attention_heads,).
output_attentions (bool, optional) – Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

class pyiqa.archs.q_align.visual_encoder.MplugOwlVisionEncoder(config)[source]¶

Bases: torch.nn.Module

Transformer encoder consisting of config.num_hidden_layers self attention layers. Each layer is a [MplugOwlVisionEncoderLayer].

Parameters:: config (MplugOwlVisionConfig) – The corresponding vision configuration for the MplugOwlEncoder.

Parameters:

inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Embedded representation of the inputs. Should be float, not int tokens.
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) –
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
[What are attention masks?](../glossary#attention-mask)
output_attentions (bool, optional) – Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) – Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) – Whether or not to return a [~utils.ModelOutput] instead of a plain tuple.

class pyiqa.archs.q_align.visual_encoder.MplugOwlVisionModel(config)[source]¶

Bases: transformers.modeling_utils.PreTrainedModel

forward(pixel_values: torch.FloatTensor | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None) → Tuple | transformers.modeling_outputs.BaseModelOutputWithPooling[source]¶: Returns:

get_input_embeddings()[source]¶

class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorMLP(config)[source]¶

Bases: torch.nn.Module

forward(hidden_states: torch.Tensor) → torch.Tensor[source]¶

class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorMultiHeadAttention(config)[source]¶

Bases: torch.nn.Module

save_attn_gradients(attn_gradients)[source]¶

get_attn_gradients()[source]¶

save_attention_map(attention_map)[source]¶

get_attention_map()[source]¶

transpose_for_scores(x)[source]¶

forward(hidden_states, attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_value=None, output_attentions=False)[source]¶

class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorCrossOutput(config)[source]¶

Bases: torch.nn.Module

forward(hidden_states: torch.Tensor, input_tensor: torch.Tensor) → torch.Tensor[source]¶

class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorAttention(config)[source]¶

Bases: torch.nn.Module

prune_heads(heads)[source]¶

forward(hidden_states: torch.Tensor, attention_mask: torch.FloatTensor | None = None, head_mask: torch.FloatTensor | None = None, encoder_hidden_states: torch.FloatTensor | None = None, encoder_attention_mask: torch.FloatTensor | None = None, past_key_value: Tuple[Tuple[torch.FloatTensor]] | None = None, output_attentions: bool | None = False) → Tuple[torch.Tensor][source]¶

class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorLayer(config, layer_idx)[source]¶

Bases: torch.nn.Module

forward(hidden_states, attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, output_attentions=False)[source]¶

class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorEncoder(config)[source]¶

Bases: torch.nn.Module

forward(hidden_states, attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_values=None, output_attentions=False, output_hidden_states=False, return_dict=True)[source]¶

class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorModel(config, language_hidden_size)[source]¶

Bases: transformers.modeling_utils.PreTrainedModel

get_head_mask(head_mask, num_hidden_layers, is_attention_chunked=False)[source]¶: Compatibility helper for Transformers>=5 where PreTrainedModel.get_head_mask was removed.

get_extended_attention_mask(attention_mask: torch.Tensor, input_shape: Tuple[int], device: torch.device) → torch.Tensor[source]¶

Makes broadcastable attention and causal masks so that future and masked tokens are ignored.

Parameters:

attention_mask (torch.Tensor) – Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
input_shape (Tuple[int]) – The shape of the input to the model.
device – (torch.device): The device of the input to the model.

Returns:

torch.Tensor The extended attention mask, with a the same dtype as attention_mask.dtype.

forward(attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_values=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶

encoder_hidden_states (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional):: Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
encoder_attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional):: Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]: - 1 for tokens that are not masked, - 0 for tokens that are masked.
past_key_values (tuple(tuple(torch.FloatTensor)) of length config.n_layers with each tuple having 4 tensors of:: shape (batch_size, num_heads, sequence_length - 1, embed_size_per_head)): Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).