A simple vision-encoder text-decoder architecture for multimodal tasks ...

A simple vision-encoder text-decoder architecture for multimodal tasks ...

More to explore

Based on this image's title: “A simple vision-encoder text-decoder architecture for multimodal tasks ...