pyiqa.archs.maclip_arch

Beyond Cosine Similarity: Magnitude-Aware CLIP for No-Reference Image Quality Assessment

@article{liao2025beyond,

title={Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality Assessment}, author={Liao, Zhicheng and Wu, Dongxu and Shi, Zhenshan and Mai, Sijie and Zhu, Hanwei and Zhu, Lingyu and Jiang, Yuncheng and Chen, Baoliang}, journal={arXiv preprint arXiv:2511.09948}, year={2025}

}

Accepted by AAAI 2026.

Reference:

Module Contents

class pyiqa.archs.maclip_arch.CustomCLIP(backbone: str, device='cpu')[source]

Bases: torch.nn.Module

Thin wrapper around CLIP image/text encoders used by MACLIP.

Parameters:
  • backbone (str) – CLIP backbone identifier.

  • device (str) – Device string used when initializing the model.

forward(image, text, pos_embedding=False, text_features=None)[source]

Encode image/text and return logits and unnormalized image features.

Parameters:
  • image (torch.Tensor) – Image tensor with shape (N, 3, H, W).

  • text (torch.Tensor) – Tokenized text tensor.

  • pos_embedding (bool) – Whether to enable positional embedding branch in the custom CLIP visual encoder.

  • text_features (torch.Tensor | None) – Optional precomputed text features.

Returns:

(logits_per_image, logits_per_text, image_features_org).

Return type:

tuple[torch.Tensor, torch.Tensor, torch.Tensor]

class pyiqa.archs.maclip_arch.MACLIP(model_type='clipiqa', backbone='RN50', pos_embedding=False)[source]

Bases: torch.nn.Module

Magnitude-Aware CLIP for no-reference image quality assessment.

Parameters:
  • model_type (str) – Output type identifier.

  • backbone (str) – CLIP backbone name.

  • pos_embedding (bool) – Whether to enable visual positional embedding in CLIP image encoding.

Notes

The current implementation runs on CUDA and is intended for inference.

preprocess(img)[source]

Normalize image and build overlapping 224x224 patch set.

Parameters:

img (torch.Tensor) – Input tensor with shape (1, 3, H, W).

Returns:

Patch tensor with shape (P, 3, 224, 224).

Return type:

torch.Tensor

box_cox(x, lam=0.5, epsilon=1e-06)[source]

Apply Box-Cox-like transform after per-sample standardization.

fusion(cos, norm, base_cos=1.0, base_norm=0.6, alpha=1.0)[source]

Fuse cosine and magnitude cues with adaptive softmax weighting.

Parameters:
  • cos (torch.Tensor) – Cosine-similarity based quality scores.

  • norm (torch.Tensor) – Magnitude-cue scores.

  • base_cos (float) – Base weight prior for cosine cue.

  • base_norm (float) – Base weight prior for magnitude cue.

  • alpha (float) – Adaptive weight adjustment factor.

Returns:

Fused score, cosine weight, and magnitude weight.

Return type:

tuple[torch.Tensor, torch.Tensor, torch.Tensor]

forward(x, box_lam=0.5, base_cos=1.0, base_norm=0.6, alpha=1.0)[source]

Compute MACLIP score.

Parameters:
  • x (torch.Tensor) – Input image tensor with shape (1, 3, H, W).

  • box_lam (float) – Lambda for Box-Cox transform.

  • base_cos (float) – Base weight for cosine cue.

  • base_norm (float) – Base weight for magnitude cue.

  • alpha (float) – Adaptive fusion factor.

Returns:

Scalar quality score.

Return type:

torch.Tensor