pyiqa.archs.q_align.visual_encoder ================================== .. py:module:: pyiqa.archs.q_align.visual_encoder Module Contents --------------- .. py:function:: find_pruneable_heads_and_indices(heads, n_heads, head_size, already_pruned_heads) Compatibility fallback for Transformers>=5 where this helper was removed. .. py:function:: get_abs_pos(abs_pos, tgt_size) .. py:function:: get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False) grid_size: int of the grid height and width return: pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token) .. py:function:: get_2d_sincos_pos_embed_from_grid(embed_dim, grid) .. py:function:: get_1d_sincos_pos_embed_from_grid(embed_dim, pos) embed_dim: output dimension for each position pos: a list of positions to be encoded: size (M,) out: (M, D) .. py:class:: MplugOwlVisionEmbeddings(config) Bases: :py:obj:`torch.nn.Module` .. py:method:: forward(pixel_values: torch.FloatTensor) -> torch.Tensor .. py:class:: MplugOwlVisionAttention(config) Bases: :py:obj:`torch.nn.Module` Multi-headed attention from 'Attention Is All You Need' paper .. py:method:: forward(hidden_states: torch.Tensor, head_mask: Optional[torch.Tensor] = None, output_attentions: Optional[bool] = False) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]] Input shape: Batch x Time x Channel .. py:class:: QuickGELU Bases: :py:obj:`torch.nn.Module` .. py:method:: forward(x: torch.Tensor) .. py:class:: MplugOwlMLP(config) Bases: :py:obj:`torch.nn.Module` .. py:method:: forward(hidden_states: torch.Tensor) -> torch.Tensor .. py:class:: MplugOwlVisionEncoderLayer(config) Bases: :py:obj:`torch.nn.Module` .. py:method:: forward(hidden_states: torch.Tensor, attention_mask: torch.Tensor, output_attentions: Optional[bool] = False) -> Tuple[torch.FloatTensor] :param hidden_states: input to the layer of shape `(batch, seq_len, embed_dim)` :type hidden_states: `torch.FloatTensor` :param attention_mask: attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. `(config.encoder_attention_heads,)`. :type attention_mask: `torch.FloatTensor` :param output_attentions: Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. :type output_attentions: `bool`, *optional* .. py:class:: MplugOwlVisionEncoder(config) Bases: :py:obj:`torch.nn.Module` Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a [`MplugOwlVisionEncoderLayer`]. :param config: The corresponding vision configuration for the `MplugOwlEncoder`. :type config: `MplugOwlVisionConfig` .. py:method:: forward(inputs_embeds, attention_mask: Optional[torch.Tensor] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None) -> Union[Tuple, transformers.modeling_outputs.BaseModelOutput] :param inputs_embeds: Embedded representation of the inputs. Should be float, not int tokens. :type inputs_embeds: `torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)` :param attention_mask: Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. [What are attention masks?](../glossary#attention-mask) :type attention_mask: `torch.Tensor` of shape `(batch_size, sequence_length)`, *optional* :param output_attentions: Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. :type output_attentions: `bool`, *optional* :param output_hidden_states: Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for more detail. :type output_hidden_states: `bool`, *optional* :param return_dict: Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. :type return_dict: `bool`, *optional* .. py:class:: MplugOwlVisionModel(config) Bases: :py:obj:`transformers.modeling_utils.PreTrainedModel` .. py:method:: forward(pixel_values: Optional[torch.FloatTensor] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None) -> Union[Tuple, transformers.modeling_outputs.BaseModelOutputWithPooling] Returns: .. py:method:: get_input_embeddings() .. py:class:: MplugOwlVisualAbstractorMLP(config) Bases: :py:obj:`torch.nn.Module` .. py:method:: forward(hidden_states: torch.Tensor) -> torch.Tensor .. py:class:: MplugOwlVisualAbstractorMultiHeadAttention(config) Bases: :py:obj:`torch.nn.Module` .. py:method:: save_attn_gradients(attn_gradients) .. py:method:: get_attn_gradients() .. py:method:: save_attention_map(attention_map) .. py:method:: get_attention_map() .. py:method:: transpose_for_scores(x) .. py:method:: forward(hidden_states, attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_value=None, output_attentions=False) .. py:class:: MplugOwlVisualAbstractorCrossOutput(config) Bases: :py:obj:`torch.nn.Module` .. py:method:: forward(hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor .. py:class:: MplugOwlVisualAbstractorAttention(config) Bases: :py:obj:`torch.nn.Module` .. py:method:: prune_heads(heads) .. py:method:: forward(hidden_states: torch.Tensor, attention_mask: Optional[torch.FloatTensor] = None, head_mask: Optional[torch.FloatTensor] = None, encoder_hidden_states: Optional[torch.FloatTensor] = None, encoder_attention_mask: Optional[torch.FloatTensor] = None, past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, output_attentions: Optional[bool] = False) -> Tuple[torch.Tensor] .. py:class:: MplugOwlVisualAbstractorLayer(config, layer_idx) Bases: :py:obj:`torch.nn.Module` .. py:method:: forward(hidden_states, attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, output_attentions=False) .. py:class:: MplugOwlVisualAbstractorEncoder(config) Bases: :py:obj:`torch.nn.Module` .. py:method:: forward(hidden_states, attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_values=None, output_attentions=False, output_hidden_states=False, return_dict=True) .. py:class:: MplugOwlVisualAbstractorModel(config, language_hidden_size) Bases: :py:obj:`transformers.modeling_utils.PreTrainedModel` .. py:method:: get_head_mask(head_mask, num_hidden_layers, is_attention_chunked=False) Compatibility helper for Transformers>=5 where PreTrainedModel.get_head_mask was removed. .. py:method:: get_extended_attention_mask(attention_mask: torch.Tensor, input_shape: Tuple[int], device: torch.device) -> torch.Tensor Makes broadcastable attention and causal masks so that future and masked tokens are ignored. :param attention_mask: Mask with ones indicating tokens to attend to, zeros for tokens to ignore. :type attention_mask: `torch.Tensor` :param input_shape: The shape of the input to the model. :type input_shape: `Tuple[int]` :param device: (`torch.device`): The device of the input to the model. :returns: `torch.Tensor` The extended attention mask, with a the same dtype as `attention_mask.dtype`. .. py:method:: forward(attention_mask=None, head_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_values=None, output_attentions=None, output_hidden_states=None, return_dict=None) encoder_hidden_states (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, `optional`): Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder. encoder_attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, `optional`): Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of: shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`.