TLDR: 2023 is The Year of Open Source LLM; 2024 will be The Year of SLM and Synthetic Data

What a year in 2023! 2023 kickstarted with everyone including executives talking about ChatGPT at the beginning of the year. Given the realisation of imperfect yet unlimited potential (research and business) of LLM, the LLM space has drawn a lot of resources from elsewhere. LLM space has been growing exponentially, unlike the rest of the world.

To explain why I feel LLM space is exponential, one anecdote is I did an hour sharing on LLM back in Feb 2023 focusing on the language model scaling law, supervised finetuning (SFT) and reinforcement learning with human feedback (RLHF), and when I looked back the slide now, it felt like they are no longer (so) relevant.

As a open source contributor who had the opportunity to work with some brilliant independent researchers to build dataset/ train model; and also as a ML engineer who applies LLM in production and faces new production challenges, I have been with LLM more than ever.

Same as last year, I made an attempt to summarise what had come into my attention in 2023, and write an article in a day without the help of any LLM (You can tell from my writing) as I trust my first instinct without rethinking can bring me the most memorable moments in 2023.

Despite reading > 200+ paper, I might still miss quite a lot as an average person trying to catch an exponential trend, so feedbacks welcomed.

2023 Trends:

1.Open-source LLMs

The release of Llama 2 for research and commercial use is probably a defining moment in 2023 because in my opinion it not only started a culture of open sourcing high quality LLM without commercial limitation — the open culture that is quite deep rooted in AI research community, but also it has shown a belief to researchers that open source models can be at par with closed source models.

Of course HuggingFace has always been a driving force behind to make sure any new model or new data is accessible.

2.Smaller Model Thanks to Quality Data

Alpaca-7B demonstrated that with 52k high-quality alignment data and LLama-7B model, it is able to train a model that behaves similarly with GPT3 text-davinci-003.

Less is More showed that you do not need that much data to do alignment. 1,000 carefully curated prompts and responses are good enough.

Textbooks are All You Need trained small models from scatch with textbook quality data, and synthetically generated textbook. phi-2 shows SOTA performance among model ≤ 13 B.

Orca demonstrated that small language model can have better reasoning by imitating the reasoning process of larger LM.

The drop of Mistral 7B was a big thing in the open source community because at the time of release, it was the best. Mixtral is the first time that open source model can match and even outperform GPT-3.5, with a sparse mixture-of-expert network.

3.AI Feedback Instead of Human Feedback

ChatGPT outperforms crowd workers for text-annotation tasks has shown that ChatGPT has demonstrated the potential of leveraging AI feedback without compromise of quality.

It is not surprising to replace human feedback with AI feedback, once an aligned and good enough model exists as human feedback is expensive, and not scalable, and sometimes is wrong due to various factors like emotion, boredom, etc.

Zephyr-7B surpassed LLama2-Chat-70B. In particularly, it uses GPT-4 preference to rank the synthetic response data, which is later on used for Direct Preference Optimisation.

4. Synthetic Data

With better and better language model, synthetic data generated by LLM starts making sense.

Alpaca-7B leveraged language model to generate instruction data with Self-Instruct.

WizardLM introduced instruction evolution to turn an initial instruction into more complex instruction. It found that it outperforms ChatGPT when instruction is more complex.

5.Attempt to Lengthen Context Window

Long context window is important not only because it can cater more real life use case like understanding a long document, but also it enables model to solve problem that is very complex which requires long term reasoning.

NTK-Aware Scaled RoPE introduced neural tangent theory (NTK) to interpolate RoPE Fourier space. Without finetuning, it’s able to achieve low perplexity in extended context window beyond training by only changing 3 lines of code

StreamingLLM allowed a model trained with limited context window to inference on infinite sequence length by leveraging the finding that keeping the hidden states (KV) of initial tokens can recover most performance.

Lost in the Middle showed that performance degrades in the middle of long context.

Long context prompting increased long term recall significantly by adding “Here is the most relevant sentence in the context:”. Yet, this testing is not extended to other models because it is not a research paper.

6.Multimodel LLM

A lot of our daily use cases that needs visual cues can be empowered. It basically completes a personal assistant. GPT-4-V and open sourced LLava are great examples.

7.Prompt Engineering

There are quite a lot of research papers in prompt engineering.

Tree of Thoughts is the one that generalises Chain of Thoughts approach by allowing LLM to consider different reasoning paths towards the same problem. It can be further elevated to Monte Carlo tree search, which empowered AlphaGo to defeat human champion by 5 games to 0.

ReAct prompting is to add action on top of thoughts/ reasons so that a LLM does not limit itself to its own representation. It can interact with the external world for accurate and updated information.

Toolformer is an approach to explicitly teach language model to use tools like APIs or a calculator.

With action added, it turns an LLM into an autonomous agent.

8.Quantised/ optimisation for training

This is a last but not least point.

From float32 to bfloat16, to 8-bit, to 4-bit; from LoRa to QLoRA. It not just reduces GPU RAM requirement, but also training time. It empowers talented people without rich GPU access to perform proof of concept and research, so that the open source community continues to thrive.

In short, rather than saying 2023 is The Year of LLM, I would say 2023 is The Year of Open Source LLM. LLM has become a commodity, and become accessible by people who are willing to learn. Having said that, using them in production still remains challenging because:

the no. of use case in this world is almost unlimited
100% correctness of LLM output still remains an open question
limited supply of high end GPU.

2024 Trends:

1.Small Language model (SLM, <=2B)

I think deep learning has come to an era that everyone starts realising quality data is important, just like traditional machine learning. Quality data reduces conflicting learning objective, avoid local minimum and encourages faster convergence.

The problems are:

what is “quality” data? Diverse and correct with sufficient reasoning and different complexities?
how to execute it at scale?

Out of the trillion tokens, how much is actually useful? If only 5% is useful, a much smaller model is possible to “compress” all training data and learn a meaningful representation. For example, phi-1 model is only trained on 7B tokens, while Llama2 is trained on 2T tokens.

The secret recipe to a good LLM is the underlying data. As you can notice, everyone releases their model, but not everyone releases their training data.

2.Synthetic data

It is an inevitable route. There are 3 variables to a model (Dataset Size, Model Size, Compute) that determines a model’s power. Given the fixed amount compute, you can trade off between model size and data size. But what if you don’t have unlimited compute? You can always scale up a model, but not data.

If the world does not contain enough high quality data, what can one do?

Generate.

If the world does not contain intelligent enough data to learn from to achieve superintelligence, what can one do?

Generate.

It makes sense to generate if you have a powerful LLM with verifiers.

3.Industry Specific Dataset & Benchmark

The internet does not have all the data in this universe after all.

Number of use case in this planet is a long tail distribution.

Opportunity goes back to industry players who have a moat — private data that they could capitalise. There is no better time than now to create business value by digitalising physical document.

One model to rule them all? Nah, centralised LLM will become decentralised and tailored to specific use cases.

4.LLM as a Search Engine

Before LLM, search engine can only search what already exists; now LLM can search for something that does not exist yet. It completely changes the universe of search space from limited to unlimited.

Searching with LLM becomes meaningful if latency drastically decreases because one can simulate more given the same time. FunSearch searches for how to solve a problem by combining genetic algorithm and language model which is used as a solution generator and as an evaluator.

Search becomes scalable with small language model, and quantitative change leads to qualitative change.

5.Reasoning Engine in Robotics

With smaller model, latency becomes low enough such that robot can act and react in real time, and it becomes possible to deploy LLM in edge device, so robot can run in an offline manner. Numerous researches shows that using text-only LLM as a reasoning engine can significantly improve robotics performance, not to mention leveraging multimodal LLM.

6.The Quest for AGI and Superintelligence Continues

It depends on the definition of AGI or superintelligence. Let’s define it as an intelligent agent who surpasses the intelligence of the most gifted human being.

Crystalized from the past trends, there are few possible paths that will inevitably happen in my opinion:

Quantity : can 100 gifted minds imitated by language model surpass the most gifted human minds? It is not impossible if you view problem as considering all possible reasoning paths like Monte Carlo Tree Search and pick the weight adjusted choice.
Quality: can we generate data more intelligent than the original data that the generator was trained from? WizardLM is a good showcase that the difficulty of instruction be increased by LLM. LLM behaves like an approximate retrieval system because of its autoregressive nature so it might fail reasoning test if training data does not have so. Can we generate data that is not in training set of the generator model?

Quantity is possibly an easier path forward. After such system exists, the training signal can be distilled by another language model.

In both cases, there exists a limit in the verification/ falsifiability of knowledge that that there is no LLM which can validate the generated solution. There is an exception through: a set of problems can be quickly checkable but not quickly solvable (in NP but not in P). They will become the new testbed for LLM.

What about knowledge that is not quickly checkable? Maybe in some day LLM can prove P=NP or the vice versa.

So, will superintelligence arrive at 2024? I don’t know, but the stage is set and it’s just a matter of time. For most people, it’s game changing enough when LLM can match the intelligence of an average person, and you can have multiple of them working for you.

What do you think? What’s waiting us in 2024?