GPT3 and Commonsense Reasoning

Ratthachat Chatpatanasiri - July, 2021 (Draft - Work in Progress)

As of the moment of this article, even though strongest language models such as GPT-3 or Google's newly announced LaMBDA are very powerful, one of the hottest topics discussing on limitations of such language models during the recent years literatures is about commonsense knowledge and reasoning.

Nevertheless, most commonsense experiments in recent literatures rarely test the true ability of GPT-3 who is able to read a long and complex text and write a long and complex answer.



In this article, following the methods pioneered by Elemental Cognition's research team, By using Template of Understanding shown in Figure X, we are going to see in-action how well GPT-3 can perform a commonsense reasoning on a complex story.

Beside this article, readers who want to get an extensive updated background on this topic of commonsense knowledge and reasoning can take a look at Yejin Choi's ICLR 2021 tutorial , also see AAAI 2021's tutorial and workshop. For the bigger picture of this topic, please see Minsky's timeless The Emotion Machine book .

Note that, for compactness of presentation, this blog article uses lots of multi-tabs display and maybe not mobile-friendly. The readers are suggested to apply Desktop-version on the browser menu.

= Background on Commonsense Knowledge and Reasoning =

In this section, we discuss meaning and power of commonsense knowledge, as well as briefly discuss other types of human knowledge for general reasoning.

What is Commonsense Knowledge ?
According to Yejin Choi ICLR 2021 tutorial on the topic, there is no universally accepted definition of commonsense knowledge. Here, from various literatures, we could intuitively define the commonsense knowledge as follows:

Commonsense knowledge is a "kind-of-obvious" knowledge about (1) the world, (2) human and (3) our society where most people know or can accurately guess so that people could omit those details when they communicate

Note that there are three kinds of commonsense mentioned above, and they can also be referred as physical, human and social commonsense, respectively. To understand the above intuitive definition, consider the following examples.

Examples
 Read this news headline, taken from Yejin Choi's ICLR 2021 tutorial:

Breaking News: Cheeseburger Stabbing

When we see a news headline like this, How can we interpret it?

  A cheeseburger stabbed someone ?   A cheeseburger stabbed another cheeseburger ?   Someone stabbed a cheeseburger ?  Someone stabbed some else over a cheeseburger ?  

By using commonsense, most people eliminate the first two choices, since we know that a cheeseburger is an object, and it cannot stab anybody. This is an example of physical commonsense. Also, the third choice should be eliminated since it is non-sense to stab a cheeseburger (human nature commonsense) and even get a spot on a news headline (social commonsense).

So only the last choice are left. Not only that, we can also guess that the one stabbing the others were probably felt hungry, since we also know from social commonsense that stabbing someone is bad, so people won't do that for fun, or without necessary reasons, unless you live in the stone age era. (so the commonsense does change as human's society evolves)

Notice that all the above commonsense arguments are intuitive to us, human. Nevertheless, it's not obvious for a language model at all since a language model learn knowledge from people texts, and there are very rare texts, if any, mentioned that cheeseburger (or sashimi or other physical objects) could not stab (and walk and talk, etc.). This is why commonsense knowledge is not simple to a language model or an NLP transformer in the current pretraining paradigm.

In fact, when human informally communicate with only few sentences to another person, there are a lot of hidden meaning behind those sentences. By using commonsense, human can fill-up the hidden details in multi-dimensional aspects such as spatial, temporal, casual and motivation information behind the sentences. To understand this power of commonsense, consider the next example.  Consider another inspiring example adapted from Elemental Cognition's article Alice and Elsa were running toward the finish line. However, as Alice somehow fell down to the ground, Elsa turned her back and went to help Alice. A teacher and friends also went to see what happen.

By reading the two sentences, we are likely to answer or guess the following questions not written in the texts at all:

  What were Alice and Elsa are doing together ? (causal information)  Where should they be ? (spatial information) </li>  Should the time of the story be at day or at night ? </li>

 What was Elsa's motivation in the first sentence ? (temporal + motivational information) </li>  What was Elsa's motivation in the second sentence ? (temporal + motivational information) </li>  What would happen, if Alice was not fell down ? (temporal + causal information) </li>  etc. (we can imagine much more details than the lists above) </li> </ul>

Maybe your guesses are something similar to "competing", "a running race", "at day", "to win a race", "to help Alice", and "Elsa might win", respectively. Or at least you should think that the above guesses make sense.

In fact, maybe you have imagine a picture similar to Figure 1. in your mind.



So we can see that it is truly amazing that we, human, can guess a lot of information nowhere to be found in the original sentences.

Since there are much more information that we can imagine from the exact information written in the original sentences, commonsense is also said to be a Dark matter of intelligence , comparable to our physical dark matter which physicists believe that our universe consists mostly of dark matter, but it is extremely difficult to observe.

Transformers vs. Commonsense
Recent experiments  showed that transformers in the current pretraining  paradigm usually have poor performance on commonsense testing. They propose to finetune a transformer model on a new commonsense dataset to get much better commonsense performance.

However, the aforementioned experiments were almost tested on transformers (e.g. GPT-2) which capacities and trained texts are much less than GPT-3. The full-version (named Davinci) of GPT-3's pre-training corpus and neuron parameters are approximately 10-times and 100-times, respectively, bigger than its predecessor GPT-2 as well as other transformers in the experiments. Therefore, from reading much larger texts and storing all knowledge in its neurons, GPT-3 has much more potential to understand our world, including commonsense, compared to other smaller-size transformers.

To the best of our knowledge, only one commonsense experiment was tested on GPT-3, that is an experiment on the AtomicComet-2020 dataset. This experiment asked GPT-3 to answer a short commonsense relationship with formats such as:


 * PersonX accepts PersonY's apology xIntent [GEN] To feel peachful with PersonY [SEP]
 * PersonX accepts PersonY's apology HinderedBy [GEN] PersonY has not apologized [SEP]

where xIntent and HinderedBy are two examples of 23 encoded commonsense relationships in the AtomicComet-2020 dataset, meaning "because PersonX wanted" and "can be hindered by", respectively. [GEN], [SEP] are special tokens indicating beginning and ending, respectively, of a consequence that a transformer should generate.

In general, the training data has the following format: [GEN] [SEP]

By giving 5-shots prompted examples for each generation, the paper reported a considerably good, 73% human-acceptance rate, on generated text by GPT-3 whereas GPT-2 only get 37% acceptance rate. An encoder-decoder transformer model trained directly on the dataset got the best performance of around 84.5%.

However, since GPT-3 learned mostly from human texts, and able to reason with a long and high-quality text inputs, it can be possible that the above experiment which use a short and unnatural encoded format does not show the true ability of GPT-3.

Therefore, in this article by using Template of Understanding, we would like to see how GPT-3 reason in-action in a more complicated situation with high-quality detailed prompted examples.

Other Types of Knowledge
For completeness, this subsection briefly discusses other types of knowledge that we, human, use in a real-life everyday reasoning. .

<tab name="Elementary Factoid knowledge"> Commonsense knowledge is not the only simple knowledge that most people know. Another type of simple knowledge is elementary factoid knowledge e.g.

<ul>  Germany is a country</li>  French is a language of people in France </li>  Thailand is a country in Asia </li>  Human usually have 5 fingers in each hand (with a rare exception) </li>  Dog is an animal </li>  etc. </li> </ul>

Unlike commonsense knowledge mentioned in the previous section, these factoid knowledges usually are found in human's texts (e.g. wikipedia, web page, text books, etc.), so a language model usually know a lot of factoids already .

Formally, note that there is no clear boundary between commonsense and simple factoid knowledge, and some researchers may also consider factoid knowledge as a subtype of commonsense too.

<tab name="Complex knowledge"> There are also complex factoid knowledge which requires multi-hop reasoning, e.g. who is the fifth youngest president in the history of United States ?. Even though all the information to answer this question can be found on the training texts, this kind of knowledge is difficult to a language model as well .

Similarly, conceptual knowledge such as theory of physics, biology, or mathematical finance can be easily found in textbooks or websites, but a language model simply cannot calculate the option pricing using Black-Scholes formula like a financial human expert.

<tab name="Human Reasoning with all Knowledge"> In reality, human combines all knowledge types mentioned above to think, reason and make a decision on a given context about the real physical world. Scientists in the field of behavioral economics, cognitive science and psychology uses the terms of mental model (System 1) and conceptual model (system 2):



- System-1 or Mental Model : a simple and intuitive model in our head. This is where we use both commonsense and simple factoid knowledge. The reasoning process using this mental model is fast and is used in a daily routine e.g. when we want to have a breakfast or drive a car. The model is considerably accurate in a context of normal routines, but may not work well when the context is changed, e.g. when we just move into a new town, and want to have a breakfast or drive a car.

- System-2 or Conceptual Model : a complex model that make our specie distinc, e.g. logic, mathematics, science or theory of economics.

Human combine the usage of both System-1 and System-2 naturally as shown in Figure 2.

As mentioned before, even though the mental model reasoning using commonsense is very simple for us, human, it's a grand reserach challenge for a language model or a field of artificial intelligence in general. This is because there is not much training data for this kind of too-simple knowledge.

Hence, we focus on the commonsense ability of one of the most advanced language model, GPT-3, as the main objective of this article.

= Testing Commonsense : Template of Understanding =

As described in the previous section, by using mental model and commonsense, human is able to create the imaginative world beyond the communication input (few text sentences in our case). Then, human uses that imaginative world model to infer or reason about things not written in the texts.

Template of Understanding (ToU)  is a machine-understanding testing framework pioneered by Elemental Cognition's research team which, in turn, inspired from the field of cognitive science on how human understand and reason about a narrative story.

ToU asks a language model such as GPT-3 to answer various fundamental and common reasoning on a given paragraph of short story, where usually most of the answers could not be found in the text input.

Therefore, if a language model such as GPT-3 could give a good answer in each fundamental common reasoning dimension, we may be able to say that the model really has high-quality common sense and mental model similar to human. On the other hand, if the model fails to reason in some particular dimensions, researchers will be able to identify the knowledge dimension which the model are lack of systematically.

8+1 Reasoning Dimensions
In this article, beside the original ToU proposal, we follow other works  of commonsense reasoning to emphasize on 8 basis reasoning dimensions plus 1 temporal dimension, see Table 1 and also Figure X.

Note that the temporal dimension can be analyzed together with most other basis reasoning dimensions, so we seperate it from the other basis dimensions.

For example, consider Example 2 in the first section, the characters' roles change with time: during the running competition, Alice and Elsa were competitors, but before and after that event, by commonsense prediction, they were likely friends (even though we could not say for sure). At the end of this section, two concrete examples of all reasoning dimensions are illustrated.

Also note that due to limitation of number of tokens in GPT-3, to save space, we have to group many similar questions together e.g. in dimension-1 we ask the models who are notable characters, what are their roles and status at the same time. Multiple-questions is challenging for the model, but it also allow the model to write many different aspects, and show what the model really knows.

Temporal-dimension could be added for each basis dimension, i.e. when applicable, we ask what should happen in that dimension before and after story to test further abductive and deductive skills. Counterfactual reasoning which is emphasized in previous literatures isincluded in dimension 8.

Examples : 2-Shots ToU Prompts Given to GPT-3
GPT-3 is well-known to work best in a few shots setting where the user should provide a few high-quality examples as a prompt, before letting GPT-3 write an output. Therefore, we will provide the following 2-shots prompt, which will be the same for each of the 10 narrative stories.

These two examples are illustrated of how the model should reason on each ToU dimension.

<tab name="Prompted Story 1 - A Race"> Alice and Elsa were running toward the finish line. However, as Alice somehow fell down to the ground, Elsa turned her back and went to help Alice. A teacher and friends also went to see what happen.

Analysis:

1) Notable characters, their roles and status: Elsa and Alice are likely students and friends because there are teacher and friends in the context. By their names, they both should be girls. When they competed a race, they were competitors and runners. The teacher was probably their referee.

2) Notable believes, feelings and motivations of characters before, during and after the story: Before the story, both Elsa and Alice wanted to win the race and they should feel a bit excitement. In the story, when something happen to Alice, Elsa belived that Alice was injured. Elsa, and everybody else, felt care toward Alice. Then Elsa wanted to help Alice instead of going to win.

3) Notable possessions of characters before, during and after the story: Throughout the story, each runner should have a pair of running shoes, a running uniform. The teacher might have a whistle and perhaps a clock watch.

4) Basic usages and general knowledge on some notable possessions: People can use a pair of running shoes for running or just wear them casually. A pair of running shoes typically costs $10 to $100. Famous and expensive brands include Nike, Adidas, Asics and New Balance.

5) Probable locations, positions and daytime of characters before, during and after the story: They probably stayed at a sport stadium, on a running race at their school throughout the story. It should be at day in the story since the class is rarely at night. Before the race started, Elsa and Alice should stay at the same starting point while the teacher and friends stay near the race. Shortly after the story, Elsa, Alice and everybody should stay closely to investigate Alice's condition.

6) Guess story genre, and general information about location and time period: The story was quite ordinary, so it could be non-fantasy or realistic fiction, maybe a bit drama. Since it looks peaceful, it might locate in no-war country. The event might took place after 1900s where the sport has been popular, and more probable after 1950s where WW-II already ended.

7) Probable events before and after the story: Before the story, it may be the time for PE class for Elsa and Alice, so they should change uniforms for the class. After the strory, if Alice was seriously hurt, maybe we had to bring Alice to a hospital, or otherwise, Alice might just take a rest.

8) Analyze the interesting event in the story, if any, and hypothesize that the interesting event would not occur if: The interesting part was when Alice got fell down. She might trip over stone or injured somewhere. The event would not happen if Alice was perfectly healthy, slept well and there were no stone on the race.

<tab name="Prompted Story 2 - A Christmas Joke"> A man called his son and daughter the day before Christmas and said he and their mom were going to divorce. The son and daughter were hurry to go back home to stop their parents. The old man turned to his wife and said "they're coming for Christmas now"

Analysis:

1) Notable characters ,their roles and status: A family of dad, mom, son and daughter. Their family status look very healthy.

2) Notable believes, feelings and motivations of characters before, during and after the story: Before the story, dad believed that their children would not come home in Christmas, so he might felt lonely and was motivated to trick their children to come home. At the end, dad believed that the children would come back home and might be happy. The children would believed the family was healthy before the story. In the story, they felt worried of the parents divorce, and that motivated them to back home. After the story, the children would initially got angry knowing that they were tricked, but were happy eventually to be back with the parents.

3) Notable possessions of characters before, during and after the story: Dad and children had phones, which could be either landline or mobile. All family members also belonged to each other in some sense.

4) Basic usages and general knowledge on some notable possessions: Average landline phone and mobile phone may cost around $100, but mobile phone price can be as high $2000. After the invention of smartphones by Steve Jobs, mobile phone can be used just like a small computer while landline phones would become obsolete.

5) Probable locations, positions and daytime of characters before, during and after the story: Before and in the story, the parents and children likely stayed in different cities or far enough that sometimes the children will not back home in Christmas. After the story, all of them would be at their home. The story could happen either day or night, but not on working hours.

6) Guess story genre, and general information about location and time period: This story genre should be a realistic fiction and comedy. The story was likely occured in either Europe or North America where most people are Chistian, so that Chirstmas day are very important. The story had to occur after phones were common to households and not in war-time which would be after 1980s.

7) Probable events before and after the story: Before the story, dad and mom would talk about possibilities that the children would not come home. So they thought about a fake divorce plan. After the story, children would be home in Chirstmas and the family should spend great time together.

8) Analyze the interesting event in the story, if any, and hypothesize that the interesting event would not occur if: The interesting part of the story was when dad happily spoke the truth that he tricked his children. This would turn out another way if the children would not care about the divorce and not back home no matter what.

= GPT-3 and Commonsense Reasoning in Action =

Here, we are going have fun on commonsense reasoning in many narrative stories with different genres with a 2-shots setting described in the previous section.

On ToU scoring on a generated reasoning, for each dimension, we shall evaluate the score with 2 metrics : Relevancy and Quality.

Relevancy determines whether the model is able to extract or identify most interesting pieces of information with respect to the story and the question. When the model does not get full score, a comment will be provided.

Quality determines the accuracy or sensibility or possibility of the given reasoning.

Consider Example 2 of Elsa and Alice in the first section.

For example, in the counterfactual analysis, it may correctly reason that the most interesting event is when "Alice got fell down". However, it may hypothesize low-quality counterfactual like "the event may not occur if Alice did not fell down" in contrast to higher-quality counterfactual like, "the event may not occur if Alice was careful enough not to trip on the stone, or Alice was perfectly healthy"
 * If the model mentions all interesting / essential events, but with no sense, it will get high relevancy score, but low qaulity.

For example, it may reason that the most interesting event is when "a teacher and friends went to see what happen" which is not really essential in the story. However, it may provide interesting hypothesis like "they will not go to Alice if Alice with a fighting spirit shout out to everyone to let her finish her race by herself"
 * On the other hand, if the model mentions uninteresting pieces of information but with good logic, it may get low relevancy score, but high quality.

Following GLUCOSE, these two metrics are given 0 - 3 scores where the intuitive interpretations are:


 * 0 = unacceptable
 * 1 = low quality
 * 2 = mostly acceptable
 * 3 = great

Commonsense Reasoning in Action on Various Story Genres
Following the two-shots given above, here we are going to have GPT-3 make similar commonsense reasoning in various stories with different genres, e.g. everyday life, biography, comedy, science and historical fiction, mystery or fantasy. It's very interesting to see how GPT-3 handle in all these genre. For each story, we use GPT-3 to generate commonsense 3 times with one labeled as Best for easy reference.

GPT-3 generates story one token at a time with random sampling. For each reasoning output, there is a color indicating probability for each token. Green indicates likely tokens. Red on the other hand indicates unlikely tokens. Low-probability red tokens are very interesting since the generated reasoning can go another way if GPT-3 randomly chooses another candidated token. In the next section, we make some observation regarding the low-probability tokens.

Note that in GPT-3 has a limitation where the number of combined reading and writing tokens are no more than 2048. Our two-shots prompt plus a tested story are roughtly 1300 tokens, so GPT-3 has around 750 tokens left to write on each ToU test.

This 2048 token limitation is also the reason that we limit ourselves to test GPT-3 in only 8+1 dimensions. Nevertheless, these 8+1 reasoning dimensions are the most fundmamental according to our best knowledge as explained in the previous section.

For each story, we attach the picture just for readers to sense the theme and get into the story easily. None of the pictures are real inputs to GPT-3.

<tab name="Biography"> On the contrary to his colleagues believes, Alain Bombard thought that people could stay alive in the sea by drinking sea water and eating small fish and plants from the sea. He set out in a small boat to cross the Atlantic Ocean. He was able to stay alive for 65 days before finishing the journey.
 * -|Best=

Some notable words with low (red) probability and its alternatives:
 * (dim-1) unfit -- poisonous, deadly, toxic, harmful and dangerous
 * (dim-1) confident -- brave, optimimistic, positive, strong
 * (dim-2) confuse - curious, uncertain, confused, worried, anxious
 * (dim-4) healthy -- nutrients, vitamins, nutrition, protein, mineral
 * (dim-4) mostly -- able, made, strong, quite, seaw, light

We evaluate the reasoning above to have the following scores:




 * -|Other 1=
 * -|Other 2=

<tab name="Mystery"> As a new job for a prominent wealthy family, one of Chandra's first task is to water all of the house plants. While Chandra is watering the lilies, one of the plants starts talking to warn him of a dark family secret.
 * -|Best=
 * -|Other 1=
 * -|Other 2=

Important observations on GPT-3 Commonsense Reasoning

 * Not answer all questions


 * Answer sensibly but in a wrong dimension e.g. answer about character in belief dimension, and vice versa. We suspect average human with only 2-shots example may do the same.

-- unfit -- poisonous, deadly, toxic, harmful and dangerous
 * Red probs -- many words can lead to similar meaning

-- confident -- brave, optimimistic, positive, strong

But when the word is in category type (mutual exclusive) e.g. hot vs. cold, genre, it can lead to totally different reasoning e.g. non-fiction vs. realistic fiction vs. drama

Summarize table

= Conclusion and Looking Forward =

Note that there are also research works on commonsense reasoning from visual images . In this article, however, we are limited ourselves to commonsense knowledge and reasoning from text-only inputs.

Feel free to discuss on the discussion tab on the top of this article.

= External References =