Ken Tsui

London, United Kingdom

I am a seasoned machine learning engineer with over a decade of experience in applied research and AI product development including document intelligence and generative media. I aspire to build reliable AI models that benefit humanity.

For the past six years, my expertise has centered on language models, computer vision (detection, text recognition), data curation, synthetic data generation, and distributed training. Prior to this, I specialized in machine learning and statistical modeling for structured data, with applications in credit scoring, stress testing, and anti-attrition modeling.

As an active open-source researcher, I regularly contribute to large language model and vision-language model pretraining and post-training datasets, and reasoning benchmarks.

My earlier career as an external auditor and qualified accountant informs my rigorous, systematic approach to model evaluation and testing, ensuring robust and reliable AI solutions.

My research interests:

pretraining and post training data curation
reasoning benchmark, in particular, self-correction and inductive reasoning
vision language model
world model

HuggingFace
Github

latest posts

Jan 13, 2025	Embodied AI == Unlimited Training Data
Jan 09, 2024	Large Language Model 2023 Review and 2024 Outlook
Oct 16, 2023	Paper Digest: Think before you speak: Training Language Models With Pause Tokens

selected publications

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Huu Nguyen, Victor May, Harsh Raj, and 14 more authors

2025

Website
Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

Ken Tsui

2025

Website
NumSeqBench: Benchmarking Inductive Reasoning in Language Models via Number Sequences

Ken Tsui

2025

Website