It’s all about query and retrieval.

For those who have been following how Natural Language Processing evolved since 2017/2018, Transformer/ BERT is not a stranger to you. As there have been a lot of great detailed introductions written by Alexander Rush, Jay Alammar and recently Peter Bloem, I don’t intend to cover what’s under the hood; instead I am searching for the first principle and trying to make an analog with something that is easier to be understood — a database and memory.

If you could understand SQL query, probably you can get the essence of the Transformer Architecture. Even if you are not, this should not be hard to understand — I am just extracting the author and title of all NLP papers from my “paperbase”.

Select Author, Title from Paperbase where Category = “NLP”

The query to database has to flexibly accommodate all sorts of keys (Category = “NLP”)and values (Author and Title). At the same time, the key and value have to flexibly serve whatever query a user is making. Behind SQL, it’s governed by a data schema the share across query, key and value. A valid record contains all items so that they could be used interchangeably.

Both are Queries in Nature

At the time the query is passed to the computer, it’s a two-step process:

First, the computer has to search and match your query to the key.
Second, the computer returns value of the same record of the key.

This is surprisingly very similar to Attention Mechanism, the backbone of the Transformer:

QKt is matching up the query and key. Mathematically speaking, it’s measuring the cosine similarity between Q and K. The higher the similarity the more relevant a record is.
Softmax(.)V is returning value of the associated records of the key. Mathematically speaking, softmax function will return a probability. This means, every record is probability-weighted.

Formula 1 in Transformer (Vaswani et al., 2017)

There are two highlights I want to make:

The query-key matching, and data retrieval step jumps from IF/ELSE/AND/OR logic to probability based. This echos how classical/ rule base AI has evolved into probabilistic AI. After all it’s impossible to write down all the rules based on whatever conditions, the best policy is to encode so with probability.
The value being stored is a vector (an array of rational numbers), instead of an integer. There is much difference between an integer and rational numbers. After all, from 0 to 1 (2 integers), there are virtually infinite numbers inside (0.1,0.11,0.111,0.1111,…,etc.). This implies that if we associate each rational number with a record, potentially we can store a lot of records between 0 and 1.

I have drawn an analogy between SQL and Attention Mechanism.

Queries Enhance Understanding

Another good thing about attention mechanism is that we implicitly ask the machine to write the query, to use the returned result, and to integrate different returned result to form another meaningful query and result, etc.

At the beginning, if I don’t understand what CV is, I will query “CV” and all the computer vision papers will be returned so I know CV stands for computer vision. Reciprocally, if we don’t understand what a paper talks about, by querying the Title and returning Category, I know it’s a paper about NLP. Through iterative, if not exhaustive, query, we are understanding the records better via building association between records.

Likewise in Transformer, we are querying every record, so that each record’s (word’s) association across all other records (words) will be returned. For example, if I query “paper” probabilistically, it may give me back a probabilistic estimate with “read” being the most probable. Through exhaustive query of all words, the machine can develop understanding and relationship across all words.

A Transformer has multiple heads of attention, and stacks attention over attention, and so you can imagine that Transformer is like groups of smart analysts who collaboratively uses advanced semantic SQL iteratively to dig out insight from a super large database; when multiple middle level managers receive the insight from their direct reports, they present the finding to their managers (tougher than dual reporting), who ultimately distill so before passing to the CEO.

From Transformer to BERT and GPT

The famous BERT and GPT are using the Transformer architecture to capture and store the semantic relationship among words from very large corpus. Basically, most other state-of-the-art language models are based on Transformer because this database-like structure allows better retrieval and therefore storage of more complicated relationship in bigger amount of data like language.

And this database is transferable, you can think of any pre-trained model as compressed training data. Indeed, exploring these pre-trained models becomes almost the de facto exercise for most practitioners.

It’s All About Query and Retrieval

A good memory (aka database) is essential to intelligence. Every thought you have is actually making a query to your brain database which then returns relevant language, image and knowledge. It is not possible without effective query and retrieval. As a result, finding an efficient data structure for better query and retrieval is always the focus of AI research, and that’s why we got more and more novel deep learning architectures to play around!

<hr><p>If you know SQL, you probably understand Transformer, BERT and GPT. was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>