VALID (Video-Audio Large Interleaved Dataset)

The VALID (Video-Audio Large Interleaved Dataset) is a multimodal dataset comprising approximately 720,000 Creative Commons licensed videos crawled from YouTube, and processed into audio-video-text data records for machine learning research. The dataset provides a unique opportunity for training models to understand relationships between modalities such as video frames, audio clips, and multilingual textual data, making it suitable for applications like multimodal representation learning.

References