CLIP 架构
python
image_features = image_encoder(images)
text_features = text_encoder(texts)
logits = image_features @ text_features.T
loss = cross_entropy(logits, labels)image_features = image_encoder(images)
text_features = text_encoder(texts)
logits = image_features @ text_features.T
loss = cross_entropy(logits, labels)
零样本能力 · 多模态基础
CLIP 证明了简单对比学习在图文对齐上的惊人效果。