GPT3 and Commonsense Reasoning

Ratthachat Chatpatanasiri - June, 2021

As of the moment of this article, even though strongest language models such as GPT-3 or Google's newly announced LaMBDA are very powerful, one of the hottest topics discussing on limitations of such language models during the recent years literatures is about commonsense knowledge and reasoning.

Nevertheless, most commonsense experiments in recent literatures rarely test the true ability of GPT-3 who is able to read a long and complex text and write a long and complex answer.



In this article, following the methods pioneered by Elemental Cognition's research team, By using Template of Understanding, we are going to make a 4-dimension qualitiative analysis to see in-action how well GPT-3 can perform a commonsense reasoning on a complex story.

Beside this article, readers who want to get an extensive updated background on this topic of commonsense knowledge and reasoning can take a look at Yejin Choi's ICLR 2021 tutorial , also see AAAI 2021's tutorial and workshop. For the bigger picture of this topic, please see Minsky's timeless The Emotion Machine book .

Note that there are also research works on commonsense reasoning from visual images . In this article, however, we are limited ourselves to commonsense knowledge and reasoning from text-only inputs.

= Background on Commonsense Knowledge and Reasoning =

In this section, we discuss meaning and power of commonsense knowledge, as well as briefly discuss other types of human knowledge for general reasoning.

What is Commonsense Knowledge ?
According to Yejin Choi ICLR 2021 tutorial on the topic, there is no universally accepted definition of commonsense knowledge. Here, from various literatures, we could intuitively define the commonsense knowledge as follows:

"commonsense knowledge is a 'kind-of-obvious' knowledge about (1) the world, (2) human and (3) our society where most people know or can accurately guess so that people could omit those details when they communicate"

Note that there are three kinds of commonsense mentioned above, and they can also be referred as physical, human and social commonsense, respectively. To understand the above intuitive definition, consider the following examples.

Example 1.
Read this news headline, taken from Yejin Choi's ICLR 2021 tutorial:

"Breaking News: Cheeseburger Stabbing"

When we see a news headline like this, How can we interpret it?

  A cheeseburger stabbed someone ?   A cheeseburger stabbed another cheeseburger ?   Someone stabbed a cheeseburger ?  Someone stabbed some else over a cheeseburger ?  

By using commonsense, most people eliminate the first two choices, since we know that a cheeseburger is an object, and it cannot stab anybody. This is an example of physical commonsense. Also, the third choice should be eliminated since it is non-sense to stab a cheeseburger (human nature commonsense) and even get a spot on a news headline (social commonsense).

So only the last choice are left. Not only that, we can also guess that the one stabbing the others were probably felt hungry, since we also know from social commonsense that stabbing someone is bad, so people won't do that for fun, or without necessary reasons, unless you live in the stone age era. (so the commonsense does change as human's society evolves)

Notice that all the above commonsense arguments are intuitive to us, human. Nevertheless, it's not obvious for a language model at all since a language model learn knowledge from people texts, and there are very rare texts, if any, mentioned that cheeseburger (or sashimi or other physical objects) could not stab (and walk and talk, etc.). This is why commonsense knowledge is not simple to a language model or an NLP transformer in the current pretraining paradigm.

In fact, when human informally communicate with only few sentences to another person, there are a lot of hidden meaning behind those sentences. By using commonsense, human can fill-up the hidden details in multi-dimensional aspects such as spatial, temporal, casual and motivation information behind the sentences. To understand this power of commonsense, consider the next example.

Example 2.
Consider another inspiring example adapted from Elemental Cognition's article "Alice and Elsa were running toward the finish line. However, as Alice somehow fell down to the ground, Elsa turned her back and went to help Alice"

By reading the two sentences, we are likely to answer or guess the following questions not written in the texts at all:

  What were Alice and Elsa are doing together ? (causal information)  Where should they be ? (spatial information)   Should the time of the story be at day or at night ? </li>

 What was Elsa's motivation in the first sentence ? (temporal + motivational information) </li>  What was Elsa's motivation in the second sentence ? (temporal + motivational information) </li>  What would happen, if Alice was not fell down ? (temporal + causal information) </li>  etc. (we can imagine much more details than the lists above) </li> </ul>

Maybe your guesses are something similar to "competing", "a running race", "at day", "to win a race", "to help Alice", and "Elsa might win", respectively. Or at least you should think that the above guesses make sense.

In fact, maybe you have imagine a picture similar to Figure 1. in your mind.



So we can see that it is truly amazing that we, human, can guess a lot of information nowhere to be found in the original sentences.

Since there are much more information that we can imagine from the exact information written in the original sentences, commonsense is also said to be a Dark matter of intelligence , comparable to our physical dark matter which physicists believe that our universe consists mostly of dark matter, but it is extremely difficult to observe.

Transformers vs. Commonsense
Recent experiments  showed that transformers in the current pretraining  paradigm usually have poor performance on commonsense testing. They propose to finetune a transformer model on a new commonsense dataset to get much better commonsense performance.

However, the aforementioned experiments were almost tested on transformers (e.g. GPT-2) which capacities and trained texts are much less than GPT-3. The full-version (named Davinci) of GPT-3's pre-training corpus and neuron parameters are approximately 10-times and 100-times, respectively, bigger than its predecessor GPT-2 as well as other transformers in the experiments. Therefore, from reading much larger texts and storing all knowledge in its neurons, GPT-3 has much more potential to understand our world, including commonsense, compared to other smaller-size transformers.

To the best of our knowledge, only one commonsense experiment was tested on GPT-3, that is an experiment on the AtomicComet-2020 dataset. This experiment asked GPT-3 to answer a short commonsense relationship with formats such as:


 * PersonX accepts PersonY's apology xIntent [GEN] To feel peachful with PersonY [SEP]
 * PersonX accepts PersonY's apology HinderedBy [GEN] PersonY has not apologized [SEP]

where xIntent and HinderedBy are two examples of 23 encoded commonsense relationships in the AtomicComet-2020 dataset, meaning "because PersonX wanted" and "can be hindered by", respectively. [GEN], [SEP] are special tokens indicating beginning and ending, respectively, of a consequence that a transformer should generate.

In general, the training data has the following format: [GEN] [SEP]

By giving 5-shots prompted examples for each generation, the paper reported a considerably good, 73% human-acceptance rate, on generated text by GPT-3 whereas GPT-2 only get 37% acceptance rate. An encoder-decoder transformer model trained directly on the dataset got the best performance of around 84.5%.

However, since GPT-3 learned mostly from human texts, and able to reason with a long and high-quality text inputs, it can be possible that the above experiment which use a short and unnatural encoded format does not show the true ability of GPT-3.

Therefore, in this article by using Template of Understanding, we would like to see how GPT-3 reason in-action in a more complicated situation with high-quality detailed prompted examples.

Other Types of Knowledge
For completeness, this subsection briefly discusses other types of knowledge that we, human, use in a real-life everyday reasoning. .

Elementary Factoid knowledge
Commonsense knowledge is not the only simple knowledge that most people know. It's important to also differentiate a commonsense to elementary factoid knowledge e.g.

<ul>  Germany is a country</li>  French is a language of people in France </li>  Thailand is a country in Asia </li>  Human usually have 5 fingers in each hand (with a rare exception) </li>  Dog is an animal </li>  etc. </li> </ul>

Unlike commonsense knowledge, these factoid knowledges usually are found in human's texts (e.g. wikipedia, web page, text books, etc.), so a language model usually know a lot of factoids already , unlike commonsense knowledge.

In fact, there is no clear boundary between commonsense and simple factoid knowledge,

Complex Knowledge
There are also complex factoid knowledge which requires multi-hop reasoning, e.g. who is the fifth youngest president in the history of United States ?. Even though all the information to answer this question can be found on the training texts, this kind of knowledge is difficult to a language model as well .

Similarly, conceptual knowledge such as theory of physics, biology, or mathematical finance can be easily found in textbooks or websites, but a language model simply cannot calculate the option pricing using Black-Scholes formula like a financial human expert.

Human Reasoning with all Knowledge
In reality, human combines all knowledge types mentioned above to think, reason and make a decision on a given context about the real physical world. Scientists in the field of behavioral economics, cognitive science and psychology uses the terms of mental model (System 1) and conceptual model (system 2):



- System-1 or Mental Model : a simple and intuitive model in our head. This is where we use both commonsense and simple factoid knowledge. The reasoning process using this mental model is fast and is used in a daily routine e.g. when we want to have a breakfast or drive a car. The model is considerably accurate in a context of normal routines, but may not work well when the context is changed, e.g. when we just move into a new town, and want to have a breakfast or drive a car.

- System-2 or Conceptual Model : a complex model that make our specie distinc, e.g. logic, mathematics, science or theory of economics.

Human combine the usage of both System-1 and System-2 naturally as shown in Figure 2.

As mentioned before, even though the mental model reasoning using commonsense is very simple for us, human, it's a grand reserach challenge for a language model or a field of artificial intelligence in general. This is because there is not much training data for this kind of too-simple knowledge.

Hence, we focus on the commonsense ability of one of the most advanced language model, GPT-3, as the main objective of this article.

= Testing Commonsense : Template of Understanding =

As described in the previous section, by using mental model and commonsense, human is able to create the imaginative world beyond the communication input (few text sentences in our case). Template of Understanding (ToU)  is a machine understanding testing framework pioneered by Elemental Cognition's research team which, in turn, inspired from the field of cognitive science on how human understand and reason about a narrative story.

Following the prior works cited above, we are mainly interested in applying ToU on a narrative story understanding task, where given a paragraph of short story, GPT-3 has to understand the story in 7+1 reasoning dimensions - consisting of 7 basis reasoning dimensions  characters' roles, thinkings, possesions and their properties, spatial, causal and counterfactual , and a temporal dimension reasoning to all of the 7 basis dimensions.



= GPT-3 and Commonsense Reasoning = xyz

= External References =