Warning: mkdir(): No space left on device in /var/www/hottg/post.php on line 59

Warning: file_put_contents(aCache/aDaily/2024-05-29/post/opendatascience/--): Failed to open stream: No such file or directory in /var/www/hottg/post.php on line 72
​​OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents @Data Science by ODS.ai 🦜
TG Telegram Group & Channel
Data Science by ODS.ai 🦜 | United States America (US)
Create: Update:

​​OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

The OBELICS dataset is a game-changer in the world of machine learning and AI! Unlike existing closed-source datasets, OBELICS is a vast, open-source, web-scale dataset specially curated for training large multimodal models. Boasting 141 million web pages from Common Crawl, 353 million high-quality images, and an impressive 115 billion text tokens, OBELICS sets a new standard in the richness and diversity of training data.

But it's not just about the numbers; it's about results. To prove its mettle, models with 9 and 80 billion parameters were trained on OBELICS, showcasing competitive performance across various multimodal benchmarks. Named IDEFICS, these models outperformed or matched their closed-source counterparts, proving that OBELICS isn't just a theoretical concept—it's a practical, high-impact alternative.

Paper link: https://huggingface.co/papers/2306.16527
Model card link: https://huggingface.co/HuggingFaceM4/idefics-80b-instruct
Blogpost link: https://huggingface.co/blog/idefics

A detailed unofficial overview of the paper:
https://andlukyane.com/blog/paper-review-obelisc

#deeplearning #cv #nlp #largelanguagemodel #opensource

​​OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

The OBELICS dataset is a game-changer in the world of machine learning and AI! Unlike existing closed-source datasets, OBELICS is a vast, open-source, web-scale dataset specially curated for training large multimodal models. Boasting 141 million web pages from Common Crawl, 353 million high-quality images, and an impressive 115 billion text tokens, OBELICS sets a new standard in the richness and diversity of training data.

But it's not just about the numbers; it's about results. To prove its mettle, models with 9 and 80 billion parameters were trained on OBELICS, showcasing competitive performance across various multimodal benchmarks. Named IDEFICS, these models outperformed or matched their closed-source counterparts, proving that OBELICS isn't just a theoretical concept—it's a practical, high-impact alternative.

Paper link: https://huggingface.co/papers/2306.16527
Model card link: https://huggingface.co/HuggingFaceM4/idefics-80b-instruct
Blogpost link: https://huggingface.co/blog/idefics

A detailed unofficial overview of the paper:
https://andlukyane.com/blog/paper-review-obelisc

#deeplearning #cv #nlp #largelanguagemodel #opensource


>>Click here to continue<<

Data Science by ODS.ai 🦜






Share with your best friend
VIEW MORE

United States America Popular Telegram Group (US)


Fatal error: Uncaught TypeError: shuffle(): Argument #1 ($array) must be of type array, null given in /var/www/hottg/post.php:344 Stack trace: #0 /var/www/hottg/post.php(344): shuffle() #1 /var/www/hottg/route.php(63): include_once('...') #2 {main} thrown in /var/www/hottg/post.php on line 344