pyiqa.archs.clip_model

Module Contents

pyiqa.archs.clip_model.available_models() List[str][source]

Returns the names of available CLIP models

pyiqa.archs.clip_model.load(name: str, device: Union[str, torch.device]='cuda' if torch.cuda.is_available() else 'cpu', jit: bool = False, download_root: str = None)[source]

Load a CLIP model :param name: A model name listed by clip.available_models(), or the path to a model checkpoint containing the state_dict :type name: str :param device: The device to put the loaded model :type device: Union[str, torch.device] :param jit: Whether to load the optimized JIT model or more hackable non-JIT model (default). :type jit: bool :param download_root: path to download the model files; by default, it uses “~/.cache/clip” :type download_root: str

Returns:

  • model (torch.nn.Module) – The CLIP model

  • preprocess (Callable[[PIL.Image], torch.Tensor]) – A torchvision transform that converts a PIL image into a tensor that the returned model can take as its input

class pyiqa.archs.clip_model.Bottleneck(inplanes, planes, stride=1)[source]

Bases: torch.nn.Module

forward(x: torch.Tensor)[source]
class pyiqa.archs.clip_model.AttentionPool2d(spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None)[source]

Bases: torch.nn.Module

forward(x, return_token=False, pos_embedding=False)[source]
class pyiqa.archs.clip_model.ModifiedResNet(layers, output_dim, heads, input_resolution=224, width=64)[source]

Bases: torch.nn.Module

A ResNet class that is similar to torchvision’s but contains the following changes: - There are now 3 “stem” convolutions as opposed to 1, with an average pool instead of a max pool. - Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1 - The final pooling layer is a QKV attention instead of an average pool

forward_features(x, return_token=False, pos_embedding=False)[source]
forward(x, return_token=False, pos_embedding=False)[source]
class pyiqa.archs.clip_model.LayerNorm[source]

Bases: torch.nn.LayerNorm

Subclass torch’s LayerNorm to handle fp16.

forward(x: torch.Tensor)[source]
class pyiqa.archs.clip_model.QuickGELU[source]

Bases: torch.nn.Module

forward(x: torch.Tensor)[source]
class pyiqa.archs.clip_model.ResidualAttentionBlock(d_model: int, n_head: int, attn_mask: torch.Tensor = None)[source]

Bases: torch.nn.Module

attention(x: torch.Tensor)[source]
forward(x: torch.Tensor)[source]
class pyiqa.archs.clip_model.Transformer(width: int, layers: int, heads: int, attn_mask: torch.Tensor = None)[source]

Bases: torch.nn.Module

forward(x: torch.Tensor)[source]
class pyiqa.archs.clip_model.VisionTransformer(input_resolution: int, patch_size: int, width: int, layers: int, heads: int, output_dim: int)[source]

Bases: torch.nn.Module

forward(x: torch.Tensor, return_token=False, pos_embedding=False)[source]
class pyiqa.archs.clip_model.CLIP(embed_dim: int, image_resolution: int, vision_layers: Tuple[int, int, int, int] | int, vision_width: int, vision_patch_size: int, context_length: int, vocab_size: int, transformer_width: int, transformer_heads: int, transformer_layers: int)[source]

Bases: torch.nn.Module

initialize_parameters()[source]
build_attention_mask()[source]
property dtype[source]
encode_image(image, pos_embedding)[source]
encode_text(text)[source]
forward(image, text, pos_embedding=False, text_features=None)[source]
pyiqa.archs.clip_model.convert_weights(model: torch.nn.Module)[source]

Convert applicable model parameters to fp16

pyiqa.archs.clip_model.build_model(state_dict: dict)[source]