gonzo-обзоры ML статей | United States America (US)

Create: 2022-05-28 Update: 2025-07-08 10:02:59

[Google CoCa] CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu
Статья: https://arxiv.org/abs/2205.01917
Пост в блоге: https://ai.googleblog.com/2022/05/image-text-pre-training-with.html
Модель: нет :(
Код: нет :(
Реимплементация от lucidrains: https://github.com/lucidrains/CoCa-pytorch

Важная недавняя мультимодальная модель от Гугла, про которую стоит рассказать.

Движуха с тем, что сейчас называется foundation models, развивается и усложняется. Авторы рассматривают варианты картиночных моделей с natural language supervision, куда относят и обычную классификацию.

Начиналось всё когда-то (когда термина foundation model и в помине не было) с предобученных на классификации с кросс-энтропийным лоссом single-encoder models, те же предобученные VGG и т.п. Использовать эти одиночные модели на сложных задачах, требующих больше одной модальности было не очень просто и эффективно.

Потом пошли бимодальные модели, часто выполненные в виде dual-encoder models, обученные контрастным лоссом, например, CLIP (https://hottg.com/gonzo_ML/665) и ALIGN (https://hottg.com/gonzo_ML/679). Они были полезнее для различных кросс-модальных историй, но всё равно не очень подходили для сложных зрительно-языковых задач типа VQA, где требовались совмещённые картиночные и языковые репрезентации (в этих моделях они были разделены).

Параллельно существовала ветка encoder-decoder моделей, например, с картинкой на входе энкодера, текстом на входе декодера, и текстом на выходе декодера. Лосс обычный, который используется в авторегрессионных декодерах, здесь его называют captioning loss. Выход декодера можно было использовать в качестве репрезентаций для мультимодальных задач. Такие модели были хороши (SimVLM, например, https://arxiv.org/abs/2108.10904), но не давали отдельных репрезентаций для текста (было сразу замешивание с репрезентациями картиночного энкодера).

В данной работе эти три парадигмы объединяются и авторы получают одну модель, обладающую возможностями всех трёх подходов. Новое семейство моделей называется Contrastive Captioners (CoCa). Это по сути модифицированная архитектура энкодер-декодер, обучающаяся одновременно на контрастном и генеративном (captioning) лоссе.

Картиночный энкодер в CoCa по дефолту ViT (https://hottg.com/gonzo_ML/434), но может быть что угодно. Декодер трансформера разделяется на две части: унимодальный декодер (который получает только текст и на картиночные эмбеддинги никак не смотрит, n_uni слоёв) и мультимодальный декодер (который умеет делать cross-attention на эмбеддинги картиночного энкодера, n_multi слоёв).

Унимодальные картиночный энкодер и текстовый декодер обучаются через контрастный лосс (соответствие картинки и описания), а мультимодальный декодер обучается через captioning loss. Соответственно, в картиночном энкодере получаются эмбеддинги картинок, в унимодальном декодере эмбеддинги текстов, а в мультимодальном декодере мультимодальные картиночно-текстовые эмбеддинги. Под разные задачи можно брать что нужно. Профит!

Контрастный лосс считается между эмбеддингом обучаемого [CLS] токена, добавляемого к тексту, и эмбеддингом, получаемым из картиночных эмбеддингов с помощью задаче-специфичного pooler’а (однослойный self-attention с n_query обучаемыми Q, а K/V попадают из энкодера). Этот задаче-специфичный пулер выступает своего рода адаптером для новых задач. Для constrastive loss n_query=1, а для generative loss n_query=256.

В работе реализованы три варианта данной архитектуры разного размера. CoCa-Base (86M image encoder + 297M text decoder = 383M параметров), CoCa-Large (303M+484M=787M) и просто CoCa (1B+1.1B=2.1B). Везде n_uni = n_multi и для самой большой модели они равны 18, а картиночный энкодер в ней на 40 слоёв. Использовали гугловый фреймворк Lingvo (https://github.com/tensorflow/lingvo).

gonzo-обзоры ML статей

research.google

Image-Text Pre-training with Contrastive Captioners

Posted by Zirui Wang and Jiahui Yu, Research Scientists, Google Research, Brain Team Oftentimes, machine learning (ML) model developers begin their...

hottg.com/gonzo_ML/997

2.4K viewsedited May 28, 2022 at 12:19

>>Click here to continue<<

gonzo-обзоры ML статей

Share with your best friend

Telegram Desktop App Not Working on Windows?

Run Telegram as an Administrator

hen you run any Windows application as an administrator, it gains access to those OS files that are otherwise restricted. It eliminates the possibility of temporary restrictions from the OS side that prevent Telegram from accessing files necessary for its operation. Hence, it has a good chance of resolving the issue.To run Telegram as an administrator, type "Telegram" in the Windows search bar. Right-click on the Telegram icon and click Run as administrator.

[Google CoCa] CoCa: Contrastive Captioners are Image-Text Foundation Models

gonzo-обзоры ML статей TG
Webview: 997
Telegram TG Webview: hottg.com/gonzo_ML/webview
Telegram TG Channel: gonzo-обзоры ML статей
Telegram Updated: 2025-07-08 10:02:59

United States America Popular Telegram Group (US)

Telegram Q&A

Q: How does hottg.com work?

Once you've set up a username, you can give people a hottg.com/username link. Opening that link on their phone will automatically fire up their Telegram app and open a chat with you. You can share username links with friends, write them on business cards or put them up on your website.This way people can contact you on Telegram without knowing your phone number.

With Telegram, you can send messages, photos, videos and files of any type (doc, zip, mp3, etc), as well as create groups for up to 200,000 people or channels for broadcasting to unlimited audiences. You can write to your phone contacts and find people by their usernames. As a result, Telegram is like SMS and email combined — and can take care of all your personal or business messaging needs. In addition to this, we support end-to-end encrypted voice calls.

Q: What is Telegram? What do I do here?

Telegram is a messaging app with a focus on speed and security, it’s super-fast, simple and free. You can use Telegram on all your devices at the same time — your messages sync seamlessly across any number of your phones, tablets or computers.

Q: Who is Telegram for?

Telegram is for everyone who wants fast and reliable messaging and calls. Business users and small teams may like the large groups, usernames, desktop apps and powerful file sharing options. You can appoint admins with advanced tools to help these communities prosper in peace. Public groups can be joined by anyone and are powerful platforms for discussions and collecting feedback.In case you're more into pictures, Telegram has animated gif search, a state of the art photo editor, and an open sticker platform (find some cool stickers here or here). What's more, there is no need to worry about disk space on your device. With Telegram's cloud support and cache management options, Telegram can take up nearly zero space on your phone.

Q: How is Telegram different from WhatsApp?

Unlike WhatsApp, Telegram is a cloud-based messenger with seamless sync. As a result, you can access your messages from several devices at once, including tablets and computers, and share an unlimited number of photos, videos and files (doc, zip, mp3, etc.) of up to 2 GB each. And if you don't want to store all that data on your device, you can always keep it in the cloud.Thanks to our multi-data center infrastructure and encryption, Telegram is faster and way more secure. On top of that, Telegram is free and will stay free — no ads, no subscription fees, forever.

Q: Can I make calls via Telegram?

Yes! Voice calls are currently available to users around the world.

Many modern travelers appear to struggle with managing various aspects of their finances simultaneously while abroad, such as banking, budgeting, investing, trading, and saving. It is important to have apps installed on the device that will help you carry out these necessary tasks.

Hot Topic in US