pyiqa.archs.q_align.modeling_llama2¶

Module Contents¶

pyiqa.archs.q_align.modeling_llama2.dir_path[source]¶

class pyiqa.archs.q_align.modeling_llama2.LlamaLinearScalingRotaryEmbedding(*args, scaling_factor=1.0, **kwargs)[source]¶: Bases: transformers.models.llama.modeling_llama.LlamaRotaryEmbedding

pyiqa.archs.q_align.modeling_llama2.logger[source]¶

pyiqa.archs.q_align.modeling_llama2.is_flash_attn_greater_or_equal_2_10()[source]¶

pyiqa.archs.q_align.modeling_llama2.index_first_axis[source]¶

pyiqa.archs.q_align.modeling_llama2.unpad_input[source]¶

pyiqa.archs.q_align.modeling_llama2.pad_input[source]¶

pyiqa.archs.q_align.modeling_llama2.flash_attn_func[source]¶

pyiqa.archs.q_align.modeling_llama2.flash_attn_varlen_func[source]¶

pyiqa.archs.q_align.modeling_llama2.apply_rotary_pos_emb(q, k, cos, sin, position_ids)[source]¶: Compatibility wrapper for rotary API changes across Transformers versions.

class pyiqa.archs.q_align.modeling_llama2.MultiwayNetwork(module_provider, num_multiway=2)[source]¶

Bases: torch.nn.Module

forward(hidden_states, multiway_indices)[source]¶

class pyiqa.archs.q_align.modeling_llama2.LlamaAttention(config: pyiqa.archs.q_align.configuration_mplug_owl2.LlamaConfig, layer_idx: int | None = None)[source]¶

Bases: torch.nn.Module

Multi-headed attention from ‘Attention Is All You Need’ paper

forward(hidden_states: torch.Tensor, modality_indicators: torch.Tensor, attention_mask: torch.Tensor | None = None, position_ids: torch.LongTensor | None = None, past_key_value: Tuple[torch.Tensor] | None = None, output_attentions: bool = False, use_cache: bool = False, padding_mask: torch.LongTensor | None = None) → Tuple[torch.Tensor, torch.Tensor | None, Tuple[torch.Tensor] | None][source]¶

class pyiqa.archs.q_align.modeling_llama2.LlamaFlashAttention2(*args, **kwargs)[source]¶

Bases: LlamaAttention

Llama flash attention module. This module inherits from LlamaAttention as the weights of the module stays untouched. The only required change would be on the forward pass where it needs to correctly call the public API of flash attention and deal with padding tokens in case the input contains any of them.

forward(hidden_states: torch.Tensor, modality_indicators: torch.Tensor, attention_mask: torch.LongTensor | None = None, position_ids: torch.LongTensor | None = None, past_key_value: transformers.cache_utils.Cache | None = None, output_attentions: bool = False, use_cache: bool = False, **kwargs) → Tuple[torch.Tensor, torch.Tensor | None, Tuple[torch.Tensor] | None][source]¶

class pyiqa.archs.q_align.modeling_llama2.LlamaSdpaAttention(config: pyiqa.archs.q_align.configuration_mplug_owl2.LlamaConfig, layer_idx: int | None = None)[source]¶

Bases: LlamaAttention

Llama attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from LlamaAttention as the weights of the module stays untouched. The only changes are on the forward pass to adapt to SDPA API.

forward(hidden_states: torch.Tensor, modality_indicators: torch.Tensor, attention_mask: torch.Tensor | None = None, position_ids: torch.LongTensor | None = None, past_key_value: transformers.cache_utils.Cache | None = None, output_attentions: bool = False, use_cache: bool = False) → Tuple[torch.Tensor, torch.Tensor | None, Tuple[torch.Tensor] | None][source]¶

pyiqa.archs.q_align.modeling_llama2.LLAMA_ATTENTION_CLASSES[source]¶

class pyiqa.archs.q_align.modeling_llama2.LlamaDecoderLayer(config: pyiqa.archs.q_align.configuration_mplug_owl2.LlamaConfig, layer_idx)[source]¶

Bases: torch.nn.Module

forward(hidden_states: torch.Tensor, modality_indicators: torch.Tensor = None, attention_mask: torch.Tensor | None = None, position_ids: torch.LongTensor | None = None, past_key_value: Tuple[torch.Tensor] | None = None, output_attentions: bool | None = False, use_cache: bool | None = False) → Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, torch.FloatTensor] | None][source]¶

Parameters:

hidden_states (torch.FloatTensor) – input to the layer of shape (batch, seq_len, embed_dim)
attention_mask (torch.FloatTensor, optional) – attention mask of size (batch, 1, tgt_len, src_len) where padding elements are indicated by very large negative values.
output_attentions (bool, optional) – Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
use_cache (bool, optional) – If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).
past_key_value (Tuple(torch.FloatTensor), optional) – cached past key and value projection states

pyiqa.archs.q_align.modeling_llama2.model_forward(self, input_ids: torch.LongTensor = None, modality_indicators: torch.Tensor = None, attention_mask: torch.Tensor | None = None, position_ids: torch.LongTensor | None = None, past_key_values: List[torch.FloatTensor] | None = None, inputs_embeds: torch.FloatTensor | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None) → Tuple | transformers.modeling_outputs.BaseModelOutputWithPast[source]¶

pyiqa.archs.q_align.modeling_llama2.causal_model_forward(self, input_ids: torch.LongTensor = None, modality_indicators: torch.Tensor = None, attention_mask: torch.Tensor | None = None, position_ids: torch.LongTensor | None = None, past_key_values: List[torch.FloatTensor] | None = None, inputs_embeds: torch.FloatTensor | None = None, labels: torch.LongTensor | None = None, use_cache: bool | None = None, output_attentions: bool | None = None, output_hidden_states: bool | None = None, return_dict: bool | None = None) → Tuple | transformers.modeling_outputs.CausalLMOutputWithPast[source]¶

Parameters:: labels (torch.LongTensor of shape (batch_size, sequence_length), optional) – Labels for computing the masked language modeling loss. Indices should either be in [0, …, config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size].

Returns:

Example:

```python >>> from transformers import AutoTokenizer, LlamaForCausalLM

>>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
>>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)

>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```

pyiqa.archs.q_align.modeling_llama2.replace_llama_modality_adaptive()[source]¶

pyiqa.archs.q_align.modeling_llama2.config[source]¶