open source | Ken Tsui

pretraining

Low Latency CPU Based Educational Value Classifier With Generic Educational Value

The low latency classifier and fineweb-edu-fasttext-classifier present a promising way to 1) filter dataset in a cheap and scalable way and 2) evaluating pretraining dataset at scale, before pretraining, that will help researchers and practitioners with less compute resources to train large/ small language model in a more efficient way.

FastText Model for Pretraining Data Curation

A collections of FastText classifiers that classifies high educational value data, maths domain, code domain, etc.

GoFormer - Language Model That Plays Go

Can GoFormer perform reasonably well just by next move (token) prediction, without MCTS?

MixtureVitae - A Permissive, High-Performance, Open-Access Pretraining Dataset

MixtureVitae is an open-source, permissive, high-quality dataset designed for pretraining large language models (LLMs) across a wide variety of modalities, domains, and languages. The goal of MixtureVitae is to accelerate the development of transparent, open-access AI while lowering legal uncertainty around copyright and data provenance.

VALID (Video-Audio Large Interleaved Dataset)

The VALID (Video-Audio Large Interleaved Dataset) is a multimodal dataset comprising approximately 720,000 Creative Commons licensed videos crawled from YouTube, and processed into audio-video-text data records for machine learning research. The dataset provides a unique opportunity for training models to understand relationships between modalities such as video frames, audio clips, and multilingual textual data, making it suitable for applications like multimodal representation learning.

posttraining

THE OIG DATASET

The Open Instruction Generalist (OIG) dataset is a large open source instruction dataset that currently contains ~43M instructions.

LongTalk-CoT v0.1 - A Very Long Chain-of-Thought Dataset for Reasoning Model Post-Training

A dataset designed for post training o1-like reasoning model. Each response is prompted using QwQ-32B-Preview, and specifically handcrafted system message that encourages more vocalised thinking, and self reflection.

others

∞🧙🏼‍♂️AnyClassifier - Generating Synthetic Data For Text Classification

AnyClassifier is a framework that empowers you to create high-performance classifiers without any labels or data, using minimal code. It's designed to revolutionize the machine learning development process by eliminating the need for extensive data curation and labeling.