pyiqa.archs.clipscore_arch ========================== .. py:module:: pyiqa.archs.clipscore_arch .. autoapi-nested-parse:: CLIPScore for no reference image caption matching. Reference: @inproceedings{hessel2021clipscore, title={{CLIPScore:} A Reference-free Evaluation Metric for Image Captioning}, author={Hessel, Jack and Holtzman, Ari and Forbes, Maxwell and Bras, Ronan Le and Choi, Yejin}, booktitle={EMNLP}, year={2021} } Reference url: https://github.com/jmhessel/clipscore Re-implemented by: Chaofeng Chen (https://github.com/chaofengc) Module Contents --------------- .. py:class:: CLIPScore(backbone='ViT-B/32', w=2.5, prefix='A photo depicts') Bases: :py:obj:`torch.nn.Module` Compute CLIPScore between an image and one or more captions. The implementation follows the original CLIPScore formulation and returns a non-negative image-text similarity score: .. math:: s = w \cdot \max(\cos(f_{img}, f_{txt}), 0) :param backbone: CLIP backbone name accepted by :mod:`clip`, for example ``"ViT-B/32"``. :type backbone: str :param w: Multiplicative scaling factor applied to cosine similarity. :type w: float :param prefix: Text prefix prepended to each caption before tokenization. :type prefix: str .. rubric:: Example >>> metric = CLIPScore(backbone='ViT-B/32') >>> img = torch.rand(2, 3, 224, 224) >>> score = metric(img, ['a dog on grass', 'a city street']) >>> score.shape torch.Size([2]) .. py:method:: forward(img, caption_list=None) Compute CLIPScore for each image-caption pair. :param img: Input tensor with shape ``(N, 3, H, W)``. :type img: torch.Tensor :param caption_list: List of length ``N`` containing captions paired with each image. :type caption_list: list[str] | None :returns: Score tensor with shape ``(N,)``. :rtype: torch.Tensor :raises AssertionError: If ``caption_list`` is not provided.