MixtureVitae - A Permissive, High-Performance, Open-Access Pretraining Dataset
MixtureVitae is an open-source, permissive, high-quality dataset designed for pretraining large language models (LLMs) across a wide variety of modalities, domains, and languages. The goal of MixtureVitae is to accelerate the development of transparent, open-access AI while lowering legal uncertainty around copyright and data provenance.