Motivation:

Next token prediction imposes a constraint that the no. of operation predicting the next token is limited by the no. of token seen so far.

Hypothesis:

For some task, it demands more computation than there is, in next token prediction. By introducing <pause> token, it induces more computation, and therefor improve performance.

Reasoning Type:

Deductive

Reasoning Step:

To explain what it means, I drew the below to illustrate the effect of delaying next token prediction. The graph on the left is the standard next token prediction, the next token prediction goes over 2 transformer blocks to predict “jumps”.

The graph on the right is after injecting the <pause> token. Due to the autoregressive nature of the model where the output in the previous step fed back as an input for the next step, the prediction goes through 4 transformers blocks before the model output “jumps”.

As such, by introducing <pause> token, it induces more computation without changing model size.

Testing Approach:

Two objectives were tested:

does delay token prediction improvements improvement?
when shall <pause> token be introduced, in pretraining or finetuning or both?

Decoder model of 1B and 130M was pretrained and finetuned, (probably due to computation resources because pretraining is required). Please also not that the <pause> token does not contribute into loss being optimised.

Nine datasets were used.

GSM8k — high quality grade school maths
SQuAD V1 —question answering based on Wikipedia
CommonSenseQA — questions on common sense
LAMBADA — last word prediction on long context
WebQA — open Domain questions with multi-hop nature
NaturalQA — open Domain questions based on Wikipedia
CoQA — question answering based on context
PhysicalIQA — questions on physical common sense
HellaSwag — commonsense natural language inference

Findings:

Introducing <pause> tokens in pretraining and finetuning outperforms standard pretraining and finetuning in most QA tasks except HellaSwag. Finetuning with <pause> tokens after standard pretraining has mixed result.

Adding filler token during inference time in standard pretrained and finetuned does not help performance.

For each dataset, there exists an optimal no of <pause>.

My Take:

Despite limitations such as limited testing in bigger model sizes and benchmarking performance with larger model, I personally like this paper very much because it revisits the next token prediction objective, which is quite under-researched relatively in my opinion. The authors have opened the paradigm of delayed token prediction, and challenged the implicit assumption that the no. of operation predicting the next token is limited by the no. of token seen.

Also, the author also laid out a very important open question: model parameter count vs computation pathways.

In short, the idea is simple, but it challenged the existing assumption (next token prediction) and as such open the doors to many area to researchers.

Inspired by the paper, I am extending my thoughts further:

Required Computation Complexity is Different For Each Token

Not every token prediction requires the same amount of computation.

Sentence 1: [Description of Case] . The culprit ___

Sentence 2:[Description of Case]. The culprit is ___

For example, prediction in sentence 1 because it’s bounded by grammatical rule while prediction in sentence 2 requires very long reasoning (amount of computation) because guessing a culprit requires hypothesis, search, and elimination of hypothesis.

Perhaps this is not very surprising. My previous paper review highlights hallucination snowballing. One of the causes its authors attributed into is “Inherently sequential” - transformer cannot find an answer within one time step because of limiting reasoning ability in limited tokens. Therefore, a LM cannot answer a question requiring multiple steps of reasoning if a LM is guided to answer in one step.

This is the nature of text. There is no correlation between the no of words and the amount of thought required to produce the text, because some thoughts were documented while most thoughts were not.

This also explains why different dataset has an optimal no. of <pause> for the best performance. Some datasets are just harder than others, and requires more reasoning and computation.

Future direction can be to leverage uncertainty of a language model prediction to construct a dataset that is filled with no of <pause> that corresponds to uncertainty.

Prompt Engineering As A Rescue?

Prompt engineering like Chain of Thought, few shot learning, or simply to append “Let’s think step by step” despite different intentions share one commonalities, which is to increase the amount of computation. For example, chain of thought prompts model to generate intermediate reasoning step before producing answer, as a rescue to provide more reasoning step. Few shot learning requires seeing a few demonstration before the prediction.

Prompt engineering has a limitation that it can only be done in inference time, and sometimes there could be a mismatch between pretraining objective and inference setting. There can be an attempt to pretrain a model with a dataset with COT prompting, but it’s just so difficult to ensure the whole web text is filled with reasoning.

Soft Token vs Hard Token

Another question is if reasoning step has to be in language form? To a language model, token possesses much less information than vector. In chain of thought, one could always challenge why this reasoning step is used but not others? In fact, towards the same solution, there are many possible ways of reasoning. With language, we usually go with one way, but not all the ways. It is why language model succeeded because languaged is modeled as probabilities. And probabilities means there are different ways to express to same meaning.

I have a conjecture that <pause> token serves as the soft version of chain of thought. Indeed soft token is not something new. Soft prompt was proved to boost performance just like hard prompt. Prompt Tuning is one of the examples, which prepends a series of tokens to condition model generation for specific tasks. The token is stored in embedding table.

Allocation of FLOP budget (Width vs Depth vs Model Size)

Computation pathway is an under discussed topic in the research field. There were research on how to achieve compute-optimal training by the famous Chinchilla paper, which answers the question of how to allocate model size and no. of training token given the same compute during training.

With the model size and computation pathways being orthogonal, some open questions are:

What is the optimal balance of model parameter count vs computation pathways?
Can a 7B model perform similarly to a 175B model by increasing computation depth L to get roughly equal number of floating point operations?
How will this affect the design of transformers going forward? Can we reuses transformer block over time step, like parameter sharing across time step in RNN?

Given the same parameter count, shall we prioritise less on no. of transformers block, while re-allocate more on dimension?

These are all interesting topics that I might conduct research and write a formal paper in the future if I have time.

About Paper Digest:

Paper Digest aims to digest a paper into a short summary while maintaining the essence of the scientific process behind the research. It serves as my personal reflection after understanding an academic paper.

Paper Digest: How Language Model Hallucinations Can Snowball