pyiqa.archs.q_align.modeling_llama2 =================================== .. py:module:: pyiqa.archs.q_align.modeling_llama2 Module Contents --------------- .. py:data:: dir_path .. py:class:: LlamaLinearScalingRotaryEmbedding(*args, scaling_factor=1.0, **kwargs) Bases: :py:obj:`transformers.models.llama.modeling_llama.LlamaRotaryEmbedding` .. py:data:: logger .. py:function:: is_flash_attn_greater_or_equal_2_10() .. py:data:: index_first_axis .. py:data:: unpad_input .. py:data:: pad_input .. py:data:: flash_attn_func .. py:data:: flash_attn_varlen_func .. py:function:: apply_rotary_pos_emb(q, k, cos, sin, position_ids) Compatibility wrapper for rotary API changes across Transformers versions. .. py:class:: MultiwayNetwork(module_provider, num_multiway=2) Bases: :py:obj:`torch.nn.Module` .. py:method:: forward(hidden_states, multiway_indices) .. py:class:: LlamaAttention(config: pyiqa.archs.q_align.configuration_mplug_owl2.LlamaConfig, layer_idx: Optional[int] = None) Bases: :py:obj:`torch.nn.Module` Multi-headed attention from 'Attention Is All You Need' paper .. py:method:: forward(hidden_states: torch.Tensor, modality_indicators: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_value: Optional[Tuple[torch.Tensor]] = None, output_attentions: bool = False, use_cache: bool = False, padding_mask: Optional[torch.LongTensor] = None) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]] .. py:class:: LlamaFlashAttention2(*args, **kwargs) Bases: :py:obj:`LlamaAttention` Llama flash attention module. This module inherits from `LlamaAttention` as the weights of the module stays untouched. The only required change would be on the forward pass where it needs to correctly call the public API of flash attention and deal with padding tokens in case the input contains any of them. .. py:method:: forward(hidden_states: torch.Tensor, modality_indicators: torch.Tensor, attention_mask: Optional[torch.LongTensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_value: Optional[transformers.cache_utils.Cache] = None, output_attentions: bool = False, use_cache: bool = False, **kwargs) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]] .. py:class:: LlamaSdpaAttention(config: pyiqa.archs.q_align.configuration_mplug_owl2.LlamaConfig, layer_idx: Optional[int] = None) Bases: :py:obj:`LlamaAttention` Llama attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from `LlamaAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to SDPA API. .. py:method:: forward(hidden_states: torch.Tensor, modality_indicators: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_value: Optional[transformers.cache_utils.Cache] = None, output_attentions: bool = False, use_cache: bool = False) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]] .. py:data:: LLAMA_ATTENTION_CLASSES .. py:class:: LlamaDecoderLayer(config: pyiqa.archs.q_align.configuration_mplug_owl2.LlamaConfig, layer_idx) Bases: :py:obj:`torch.nn.Module` .. py:method:: forward(hidden_states: torch.Tensor, modality_indicators: torch.Tensor = None, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_value: Optional[Tuple[torch.Tensor]] = None, output_attentions: Optional[bool] = False, use_cache: Optional[bool] = False) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]] :param hidden_states: input to the layer of shape `(batch, seq_len, embed_dim)` :type hidden_states: `torch.FloatTensor` :param attention_mask: attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. :type attention_mask: `torch.FloatTensor`, *optional* :param output_attentions: Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. :type output_attentions: `bool`, *optional* :param use_cache: If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see `past_key_values`). :type use_cache: `bool`, *optional* :param past_key_value: cached past key and value projection states :type past_key_value: `Tuple(torch.FloatTensor)`, *optional* .. py:function:: model_forward(self, input_ids: torch.LongTensor = None, modality_indicators: torch.Tensor = None, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_values: Optional[List[torch.FloatTensor]] = None, inputs_embeds: Optional[torch.FloatTensor] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None) -> Union[Tuple, transformers.modeling_outputs.BaseModelOutputWithPast] .. py:function:: causal_model_forward(self, input_ids: torch.LongTensor = None, modality_indicators: torch.Tensor = None, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_values: Optional[List[torch.FloatTensor]] = None, inputs_embeds: Optional[torch.FloatTensor] = None, labels: Optional[torch.LongTensor] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None) -> Union[Tuple, transformers.modeling_outputs.CausalLMOutputWithPast] :param labels: Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. :type labels: `torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional* Returns: Example: ```python >>> from transformers import AutoTokenizer, LlamaForCausalLM >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS) >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER) >>> prompt = "Hey, are you conscious? Can you talk to me?" >>> inputs = tokenizer(prompt, return_tensors="pt") >>> # Generate >>> generate_ids = model.generate(inputs.input_ids, max_length=30) >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ``` .. py:function:: replace_llama_modality_adaptive() .. py:data:: config