pyiqa.archs.clipscore_arch

CLIPScore for no reference image caption matching.

Reference:

@inproceedings{hessel2021clipscore, title={{CLIPScore:} A Reference-free Evaluation Metric for Image Captioning}, author={Hessel, Jack and Holtzman, Ari and Forbes, Maxwell and Bras, Ronan Le and Choi, Yejin}, booktitle={EMNLP}, year={2021} }

Reference url: https://github.com/jmhessel/clipscore Re-implemented by: Chaofeng Chen (https://github.com/chaofengc)

Module Contents

class pyiqa.archs.clipscore_arch.CLIPScore(backbone='ViT-B/32', w=2.5, prefix='A photo depicts')[source]

Bases: torch.nn.Module

A PyTorch module for computing image-text similarity scores using the CLIP model.

Parameters:
  • backbone (str) – The name of the CLIP model backbone to use. Default is ‘ViT-B/32’.

  • w (float) – The weight to apply to the similarity score. Default is 2.5.

  • prefix (str) – The prefix to add to each caption when computing text features. Default is ‘A photo depicts’.

clip_model

The CLIP model used for computing image and text features.

Type:

CLIP

prefix[source]

The prefix to add to each caption when computing text features.

Type:

str

w[source]

The weight to apply to the similarity score.

Type:

float

forward(img, caption_list)[source]

Computes the similarity score between the input image and a list of captions.

forward(img, caption_list=None)[source]

Computes the similarity score between the input image and a list of captions.

Parameters:
  • img (torch.Tensor) – Input image tensor.

  • caption_list (list of str) – List of captions to compare with the image.

Returns:

The computed similarity scores.

Return type:

torch.Tensor