albef1 [ Paper Review ] ALBEF (Align before Fuse: Vision and Language Representation Learning with Momentum Distillation, 2021) Align before Fuse: Vision and Language Representation Learning with Momentum DistillationLarge-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) andarxiv.org 0. Abstract대규모 vision과 language representati.. 2024. 7. 18. 이전 1 다음