Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

Zhu, Wanrong; Hessel, Jack; Awadalla, Anas; Gadre, Samir Yitzhak; Dodge, Jesse; Fang, Alex; Yu, Youngjae; Schmidt, Ludwig; Wang, William Yang; Choi, Yejin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2304.06939 (cs)

[Submitted on 14 Apr 2023 (v1), last revised 28 Oct 2023 (this version, v3)]

Title:Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

Authors:Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, Yejin Choi

View PDF

Abstract:In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input. This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e.g., "What do image A and image B have in common?" To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text. To date, however, large-scale data of this form have not been publicly available.
We release Multimodal C4, an augmentation of the popular text-only C4 corpus with images interleaved. We use a linear assignment algorithm to place images into longer bodies of text using CLIP features, a process that we show outperforms alternatives. Multimodal C4 spans everyday topics like cooking, travel, technology, etc. A manual inspection of a random sample of documents shows that a vast majority (88%) of images are topically relevant, and that linear assignment frequently selects individual sentences specifically well-aligned with each image (80%). After filtering NSFW images, ads, etc., the resulting corpus consists of 101.2M documents with 571M images interleaved in 43B English tokens.

Comments:	NeurIPS D&B 2023. Project homepage: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2304.06939 [cs.CV]
	(or arXiv:2304.06939v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2304.06939

Submission history

From: Wanrong Zhu [view email]
[v1] Fri, 14 Apr 2023 06:17:46 UTC (2,462 KB)
[v2] Fri, 9 Jun 2023 21:49:58 UTC (2,494 KB)
[v3] Sat, 28 Oct 2023 04:19:41 UTC (2,496 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators