pyiqa.archs.q_align.visual_encoder¶
Module Contents¶
- pyiqa.archs.q_align.visual_encoder.find_pruneable_heads_and_indices(heads, n_heads, head_size, already_pruned_heads)[source]¶
Compatibility fallback for Transformers>=5 where this helper was removed.
- pyiqa.archs.q_align.visual_encoder.get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False)[source]¶
grid_size: int of the grid height and width return: pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
- pyiqa.archs.q_align.visual_encoder.get_1d_sincos_pos_embed_from_grid(embed_dim, pos)[source]¶
embed_dim: output dimension for each position pos: a list of positions to be encoded: size (M,) out: (M, D)
- class pyiqa.archs.q_align.visual_encoder.MplugOwlVisionEmbeddings(config)[source]¶
Bases:
torch.nn.Module
- class pyiqa.archs.q_align.visual_encoder.MplugOwlVisionAttention(config)[source]¶
Bases:
torch.nn.ModuleMulti-headed attention from ‘Attention Is All You Need’ paper
- class pyiqa.archs.q_align.visual_encoder.MplugOwlVisionEncoderLayer(config)[source]¶
Bases:
torch.nn.Module- forward(hidden_states: torch.Tensor, attention_mask: torch.Tensor, output_attentions: bool | None = False) Tuple[torch.FloatTensor][source]¶
- Parameters:
hidden_states (torch.FloatTensor) – input to the layer of shape (batch, seq_len, embed_dim)
attention_mask (torch.FloatTensor) – attention mask of size (batch, 1, tgt_len, src_len) where padding elements are indicated by very large negative values. (config.encoder_attention_heads,).
output_attentions (bool, optional) – Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
- class pyiqa.archs.q_align.visual_encoder.MplugOwlVisionEncoder(config)[source]¶
Bases:
torch.nn.ModuleTransformer encoder consisting of config.num_hidden_layers self attention layers. Each layer is a [MplugOwlVisionEncoderLayer].
- Parameters:
config (MplugOwlVisionConfig) – The corresponding vision configuration for the MplugOwlEncoder.
- forward(inputs_embeds, attention_mask: torch.Tensor | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None) Tuple | transformers.modeling_outputs.BaseModelOutput[source]¶
- Parameters:
inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Embedded representation of the inputs. Should be float, not int tokens.
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) –
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
1 for tokens that are not masked,
0 for tokens that are masked.
[What are attention masks?](../glossary#attention-mask)
output_attentions (bool, optional) – Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) – Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) – Whether or not to return a [~utils.ModelOutput] instead of a plain tuple.
- class pyiqa.archs.q_align.visual_encoder.MplugOwlVisionModel(config)[source]¶
Bases:
transformers.modeling_utils.PreTrainedModel
- class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorMLP(config)[source]¶
Bases:
torch.nn.Module
- class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorMultiHeadAttention(config)[source]¶
Bases:
torch.nn.Module
- class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorCrossOutput(config)[source]¶
Bases:
torch.nn.Module
- class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorAttention(config)[source]¶
Bases:
torch.nn.Module- forward(hidden_states: torch.Tensor, attention_mask: torch.FloatTensor | None = None, head_mask: torch.FloatTensor | None = None, encoder_hidden_states: torch.FloatTensor | None = None, encoder_attention_mask: torch.FloatTensor | None = None, past_key_value: Tuple[Tuple[torch.FloatTensor]] | None = None, output_attentions: bool | None = False) Tuple[torch.Tensor][source]¶
- class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorLayer(config, layer_idx)[source]¶
Bases:
torch.nn.Module
- class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorEncoder(config)[source]¶
Bases:
torch.nn.Module
- class pyiqa.archs.q_align.visual_encoder.MplugOwlVisualAbstractorModel(config, language_hidden_size)[source]¶
Bases:
transformers.modeling_utils.PreTrainedModel- get_head_mask(head_mask, num_hidden_layers, is_attention_chunked=False)[source]¶
Compatibility helper for Transformers>=5 where PreTrainedModel.get_head_mask was removed.
- get_extended_attention_mask(attention_mask: torch.Tensor, input_shape: Tuple[int], device: torch.device) torch.Tensor[source]¶
Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
- Parameters:
attention_mask (torch.Tensor) – Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
input_shape (Tuple[int]) – The shape of the input to the model.
device – (torch.device): The device of the input to the model.
- Returns:
torch.Tensor The extended attention mask, with a the same dtype as attention_mask.dtype.
- forward(attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_values=None, output_attentions=None, output_hidden_states=None, return_dict=None)[source]¶
- encoder_hidden_states (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional):
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.
- encoder_attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]: - 1 for tokens that are not masked, - 0 for tokens that are masked.
- past_key_values (tuple(tuple(torch.FloatTensor)) of length config.n_layers with each tuple having 4 tensors of:
shape (batch_size, num_heads, sequence_length - 1, embed_size_per_head)): Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).