publications

publications by categories in reversed chronological order. generated by jekyll-scholar.

2025

  1. MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources
    Huu Nguyen, Victor May, Harsh Raj, and 14 more authors
    2025
  2. blind_spot_summary_default_non_reasoning.png
    Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs
    Ken Tsui
    2025
  3. numseqbench_accuracy.png
    NumSeqBench: Benchmarking Inductive Reasoning in Language Models via Number Sequences
    Ken Tsui
    2025

2024

  1. Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code
    Taishi Nakamura, Mayank Mishra, Simone Tedeschi, and 42 more authors
    2024