pyiqa.archs.q_align.visual_encoder

Module Contents

pyiqa.archs.q_align.visual_encoder.find_pruneable_heads_and_indices(heads, n_heads, head_size, already_pruned_heads)[source]

Compatibility fallback for Transformers>=5 where this helper was removed.

pyiqa.archs.q_align.visual_encoder.get_abs_pos(abs_pos, tgt_size)[source]
pyiqa.archs.q_align.visual_encoder.get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False)[source]

grid_size: int of the grid height and width return: pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)

pyiqa.archs.q_align.visual_encoder.get_2d_sincos_pos_embed_from_grid(embed_dim, grid)[source]
pyiqa.archs.q_align.visual_encoder.get_1d_sincos_pos_embed_from_grid(embed_dim, pos)[source]

embed_dim: output dimension for each position pos: a list of positions to be encoded: size (M,) out: (M, D)

class pyiqa.archs.q_align.visual_encoder.MplugOwlVisionEmbeddings(config)[source]

Bases: torch.nn.Module

forward(pixel_values: torch.FloatTensor) torch.Tensor[source]
class pyiqa.archs.q_align.visual_encoder.MplugOwlVisionAttention(config)[source]

Bases: torch.nn.Module

Multi-headed attention from ‘Attention Is All You Need’ paper

forward(hidden_states: torch.Tensor, head_mask: torch.Tensor | None = None, output_attentions: bool | None = False) Tuple[torch.Tensor, torch.Tensor | None, Tuple[torch.Tensor] | None][source]

Input shape: Batch x Time x Channel

class pyiqa.archs.q_align.visual_encoder.QuickGELU[source]

Bases: torch.nn.Module

forward(x: torch.Tensor)[source]
class pyiqa.archs.q_align.visual_encoder.MplugOwlMLP(config)[source]

Bases: torch.nn.Module

forward(hidden_states: torch.Tensor) torch.Tensor[source]
class pyiqa.archs.q_align.visual_encoder.MplugOwlVisionEncoderLayer(config)[source]

Bases: torch.nn.Module

forward(hidden_states: torch.Tensor, attention_mask: torch.Tensor, output_attentions: bool | None = False) Tuple[torch.FloatTensor][source]
Parameters:
  • hidden_states (torch.FloatTensor) – input to the layer of shape (batch, seq_len, embed_dim)

  • attention_mask (torch.FloatTensor) – attention mask of size (batch, 1, tgt_len, src_len) where padding elements are indicated by very large negative values. (config.encoder_attention_heads,).

  • output_attentions (bool, optional) – Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

class pyiqa.archs.q_align.visual_encoder.MplugOwlVisionEncoder(config)[source]

Bases: torch.nn.Module

Transformer encoder consisting of config.num_hidden_layers self attention layers. Each layer is a [MplugOwlVisionEncoderLayer].

Parameters:

config (MplugOwlVisionConfig) – The corresponding vision configuration for the MplugOwlEncoder.

forward(inputs_embeds, attention_mask: torch.Tensor | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None) Tuple | transformers.modeling_outputs.BaseModelOutput[source]
Parameters:
  • inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Embedded representation of the inputs. Should be float, not int tokens.

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) –

    Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

    [What are attention masks?](../glossary#attention-mask)

  • output_attentions (bool, optional) – Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

  • output_hidden_states (bool, optional) – Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

  • return_dict (bool, optional) – Whether or not to return a [~utils.ModelOutput] instead of a plain tuple.

class pyiqa.archs.q_align.visual_encoder.MplugOwlVisionModel(config)[source]

Bases: transformers.modeling_utils.PreTrainedModel

forward(pixel_values: torch.FloatTensor | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None) Tuple | transformers.modeling_outputs.BaseModelOutputWithPooling[source]

Returns:

get_input_embeddings()[source]
class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorMLP(config)[source]

Bases: torch.nn.Module

forward(hidden_states: torch.Tensor) torch.Tensor[source]
class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorMultiHeadAttention(config)[source]

Bases: torch.nn.Module

save_attn_gradients(attn_gradients)[source]
get_attn_gradients()[source]
save_attention_map(attention_map)[source]
get_attention_map()[source]
transpose_for_scores(x)[source]
forward(hidden_states, attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_value=None, output_attentions=False)[source]
class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorCrossOutput(config)[source]

Bases: torch.nn.Module

forward(hidden_states: torch.Tensor, input_tensor: torch.Tensor) torch.Tensor[source]
class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorAttention(config)[source]

Bases: torch.nn.Module

prune_heads(heads)[source]
forward(hidden_states: torch.Tensor, attention_mask: torch.FloatTensor | None = None, head_mask: torch.FloatTensor | None = None, encoder_hidden_states: torch.FloatTensor | None = None, encoder_attention_mask: torch.FloatTensor | None = None, past_key_value: Tuple[Tuple[torch.FloatTensor]] | None = None, output_attentions: bool | None = False) Tuple[torch.Tensor][source]
class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorLayer(config, layer_idx)[source]

Bases: torch.nn.Module

forward(hidden_states, attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, output_attentions=False)[source]
class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorEncoder(config)[source]

Bases: torch.nn.Module

forward(hidden_states, attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_values=None, output_attentions=False, output_hidden_states=False, return_dict=True)[source]
class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorModel(config, language_hidden_size)[source]

Bases: transformers.modeling_utils.PreTrainedModel

get_head_mask(head_mask, num_hidden_layers, is_attention_chunked=False)[source]

Compatibility helper for Transformers>=5 where PreTrainedModel.get_head_mask was removed.

get_extended_attention_mask(attention_mask: torch.Tensor, input_shape: Tuple[int], device: torch.device) torch.Tensor[source]

Makes broadcastable attention and causal masks so that future and masked tokens are ignored.

Parameters:
  • attention_mask (torch.Tensor) – Mask with ones indicating tokens to attend to, zeros for tokens to ignore.

  • input_shape (Tuple[int]) – The shape of the input to the model.

  • device – (torch.device): The device of the input to the model.

Returns:

torch.Tensor The extended attention mask, with a the same dtype as attention_mask.dtype.

forward(attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_values=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]
encoder_hidden_states (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional):

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

encoder_attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional):

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]: - 1 for tokens that are not masked, - 0 for tokens that are masked.

past_key_values (tuple(tuple(torch.FloatTensor)) of length config.n_layers with each tuple having 4 tensors of:

shape (batch_size, num_heads, sequence_length - 1, embed_size_per_head)): Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).