pyiqa.archs.clip_model ====================== .. py:module:: pyiqa.archs.clip_model Module Contents --------------- .. py:function:: available_models() -> List[str] Returns the names of available CLIP models .. py:function:: load(name: str, device: Union[str, torch.device] = 'cuda' if torch.cuda.is_available() else 'cpu', jit: bool = False, download_root: str = None) Load a CLIP model :param name: A model name listed by `clip.available_models()`, or the path to a model checkpoint containing the state_dict :type name: str :param device: The device to put the loaded model :type device: Union[str, torch.device] :param jit: Whether to load the optimized JIT model or more hackable non-JIT model (default). :type jit: bool :param download_root: path to download the model files; by default, it uses "~/.cache/clip" :type download_root: str :returns: * **model** (*torch.nn.Module*) -- The CLIP model * **preprocess** (*Callable[[PIL.Image], torch.Tensor]*) -- A torchvision transform that converts a PIL image into a tensor that the returned model can take as its input .. py:class:: Bottleneck(inplanes, planes, stride=1) Bases: :py:obj:`torch.nn.Module` .. py:method:: forward(x: torch.Tensor) .. py:class:: AttentionPool2d(spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None) Bases: :py:obj:`torch.nn.Module` .. py:method:: forward(x, return_token=False, pos_embedding=False) .. py:class:: ModifiedResNet(layers, output_dim, heads, input_resolution=224, width=64) Bases: :py:obj:`torch.nn.Module` A ResNet class that is similar to torchvision's but contains the following changes: - There are now 3 "stem" convolutions as opposed to 1, with an average pool instead of a max pool. - Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1 - The final pooling layer is a QKV attention instead of an average pool .. py:method:: forward_features(x, return_token=False, pos_embedding=False) .. py:method:: forward(x, return_token=False, pos_embedding=False) .. py:class:: LayerNorm Bases: :py:obj:`torch.nn.LayerNorm` Subclass torch's LayerNorm to handle fp16. .. py:method:: forward(x: torch.Tensor) .. py:class:: QuickGELU Bases: :py:obj:`torch.nn.Module` .. py:method:: forward(x: torch.Tensor) .. py:class:: ResidualAttentionBlock(d_model: int, n_head: int, attn_mask: torch.Tensor = None) Bases: :py:obj:`torch.nn.Module` .. py:method:: attention(x: torch.Tensor) .. py:method:: forward(x: torch.Tensor) .. py:class:: Transformer(width: int, layers: int, heads: int, attn_mask: torch.Tensor = None) Bases: :py:obj:`torch.nn.Module` .. py:method:: forward(x: torch.Tensor) .. py:class:: VisionTransformer(input_resolution: int, patch_size: int, width: int, layers: int, heads: int, output_dim: int) Bases: :py:obj:`torch.nn.Module` .. py:method:: forward(x: torch.Tensor, return_token=False, pos_embedding=False) .. py:class:: CLIP(embed_dim: int, image_resolution: int, vision_layers: Union[Tuple[int, int, int, int], int], vision_width: int, vision_patch_size: int, context_length: int, vocab_size: int, transformer_width: int, transformer_heads: int, transformer_layers: int) Bases: :py:obj:`torch.nn.Module` .. py:method:: initialize_parameters() .. py:method:: build_attention_mask() .. py:property:: dtype .. py:method:: encode_image(image, pos_embedding) .. py:method:: encode_text(text) .. py:method:: forward(image, text, pos_embedding=False, text_features=None) .. py:function:: convert_weights(model: torch.nn.Module) Convert applicable model parameters to fp16 .. py:function:: build_model(state_dict: dict)