pyiqa.archs.clipscore_arch

CLIPScore for no reference image caption matching.

Reference:

@inproceedings{hessel2021clipscore, title={{CLIPScore:} A Reference-free Evaluation Metric for Image Captioning}, author={Hessel, Jack and Holtzman, Ari and Forbes, Maxwell and Bras, Ronan Le and Choi, Yejin}, booktitle={EMNLP}, year={2021} }

Reference url: https://github.com/jmhessel/clipscore Re-implemented by: Chaofeng Chen (https://github.com/chaofengc)

Module Contents

class pyiqa.archs.clipscore_arch.CLIPScore(backbone='ViT-B/32', w=2.5, prefix='A photo depicts')[source]

Bases: torch.nn.Module

Compute CLIPScore between an image and one or more captions.

The implementation follows the original CLIPScore formulation and returns a non-negative image-text similarity score:

\[s = w \cdot \max(\cos(f_{img}, f_{txt}), 0)\]
Parameters:
  • backbone (str) – CLIP backbone name accepted by clip, for example "ViT-B/32".

  • w (float) – Multiplicative scaling factor applied to cosine similarity.

  • prefix (str) – Text prefix prepended to each caption before tokenization.

Example

>>> metric = CLIPScore(backbone='ViT-B/32')
>>> img = torch.rand(2, 3, 224, 224)
>>> score = metric(img, ['a dog on grass', 'a city street'])
>>> score.shape
torch.Size([2])
forward(img, caption_list=None)[source]

Compute CLIPScore for each image-caption pair.

Parameters:
  • img (torch.Tensor) – Input tensor with shape (N, 3, H, W).

  • caption_list (list[str] | None) – List of length N containing captions paired with each image.

Returns:

Score tensor with shape (N,).

Return type:

torch.Tensor

Raises:

AssertionError – If caption_list is not provided.