pyiqa.archs.clipscore_arch¶
CLIPScore for no reference image caption matching.
- Reference:
@inproceedings{hessel2021clipscore, title={{CLIPScore:} A Reference-free Evaluation Metric for Image Captioning}, author={Hessel, Jack and Holtzman, Ari and Forbes, Maxwell and Bras, Ronan Le and Choi, Yejin}, booktitle={EMNLP}, year={2021} }
Reference url: https://github.com/jmhessel/clipscore Re-implemented by: Chaofeng Chen (https://github.com/chaofengc)
Module Contents¶
- class pyiqa.archs.clipscore_arch.CLIPScore(backbone='ViT-B/32', w=2.5, prefix='A photo depicts')[source]¶
Bases:
torch.nn.ModuleCompute CLIPScore between an image and one or more captions.
The implementation follows the original CLIPScore formulation and returns a non-negative image-text similarity score:
\[s = w \cdot \max(\cos(f_{img}, f_{txt}), 0)\]- Parameters:
backbone (str) – CLIP backbone name accepted by
clip, for example"ViT-B/32".w (float) – Multiplicative scaling factor applied to cosine similarity.
prefix (str) – Text prefix prepended to each caption before tokenization.
Example
>>> metric = CLIPScore(backbone='ViT-B/32') >>> img = torch.rand(2, 3, 224, 224) >>> score = metric(img, ['a dog on grass', 'a city street']) >>> score.shape torch.Size([2])
- forward(img, caption_list=None)[source]¶
Compute CLIPScore for each image-caption pair.
- Parameters:
img (torch.Tensor) – Input tensor with shape
(N, 3, H, W).caption_list (list[str] | None) – List of length
Ncontaining captions paired with each image.
- Returns:
Score tensor with shape
(N,).- Return type:
torch.Tensor
- Raises:
AssertionError – If
caption_listis not provided.