GPT3 and Commonsense Reasoning

Ratthachat Chatpatanasiri - July, 2021 (Draft - Work in Progress)

"Fred told the waiter he wanted some chips". By reading this one sentence, it is amazing that us, human being, know many related (highly possible) information, not written in the sentence at all : we know Fred was in a restaurant. We know Fred was a customer dining here. We know that the waiter and Fred were only few feet apart. We know Fred wanted a potato chips, not some wood chips. We know that within several minutes, the chips should be ready for Fred. And many more.

How do we all know that? This is the power of commonsense.

For a long while, we have thought that this ability belongs to only us, human, and many researchers  believes this commonsense is one of the keys to what is called "general intelligence".

Today, we have GPT-3 , one of the biggest and most intelligence computer model human ever made. Many hypes claim that its writing and reasoning capability are almost near-human levels. Unfortunately, existing works rarely test the true ability of GPT-3's commonsense reasoning (detailed below).

In this article, we are going to see GPT-3 power of commonsense on 10 stories with various genres, from everyday-life, comedy, biography, historical fiction, mystery through sci-fi. Totally, GPT-3 will face more than 300 questions to test its commonsense intelligence. We will discuss its commonsense capability in 9 fundamental reasoning dimensions as illustrated in Figure 1.



Although all demonstrations here are too small to be commonsense benchmark of GPT-3, they should give readers some good ideas on GPT-3 commonsense capability.

Before going to see GPT-3 commonsense in action in the Section 3, however, we shall discuss first, what is "commonsense knowledge and reasoning" (Section 1) and our commonsense testing framework (Section 2).

Note that, for compactness of presentation, this blog article uses lots of multi-tabs display and maybe not mobile-friendly. The readers are suggested to apply Desktop-version on the browser menu.

= Background on Commonsense Knowledge and Reasoning =

In this section, we discuss meaning and power of commonsense knowledge, as well as briefly discuss other types of human knowledge for general reasoning.

What is Commonsense Knowledge ?
According to Yejin Choi ICLR 2021 tutorial on the topic, there is no universally accepted definition of commonsense knowledge. Here, from various literatures, we could intuitively define the commonsense knowledge as follows:

Commonsense knowledge is a "kind-of-obvious" knowledge about (1) the world, (2) human and (3) our society where most people know or can accurately guess so that people could omit those details when they communicate

Note that there are three kinds of commonsense mentioned above, and they can also be referred as physical, human and social commonsense, respectively. To understand the above intuitive definition, consider the following examples.

Examples
 Read this news headline, taken from Yejin Choi's ICLR 2021 tutorial:

Breaking News: Cheeseburger Stabbing

When we see a news headline like this, How can we interpret it?

  A cheeseburger stabbed someone ?   A cheeseburger stabbed another cheeseburger ?   Someone stabbed a cheeseburger ?  Someone stabbed some else over a cheeseburger ?  

By using commonsense, most people eliminate the first two choices, since we know that a cheeseburger is an object, and it cannot stab anybody. This is an example of physical commonsense. Also, the third choice should be eliminated since it is non-sense to stab a cheeseburger (human nature commonsense) and even get a spot on a news headline (social commonsense).

So only the last choice are left. Not only that, we can also guess that the one stabbing the others were probably felt hungry, since we also know from social commonsense that stabbing someone is bad, so people won't do that for fun, or without necessary reasons, unless you live in the stone age era. (so the commonsense does change as human's society evolves)

Notice that all the above commonsense arguments are intuitive to us, human. Nevertheless, it's not obvious for a language model at all since a language model learn knowledge from people texts, and there are very rare texts, if any, mentioned that cheeseburger (or sashimi or other physical objects) could not stab (and walk and talk, etc.). This is why commonsense knowledge is not simple to a language model or an NLP transformer in the current pretraining paradigm.

In fact, when human informally communicate with only few sentences to another person, there are a lot of hidden meaning behind those sentences. By using commonsense, human can fill-up the hidden details in multi-dimensional aspects such as spatial, temporal, casual and motivation information behind the sentences. To understand this power of commonsense, consider the next example.  Consider another inspiring example adapted from Elemental Cognition's article Alice and Elsa were running toward the finish line. However, as Alice somehow fell down to the ground, Elsa turned her back and went to help Alice. A teacher and friends also went to see what happen.

By reading the two sentences, and by social commonsense, we are likely to answer or guess the following questions not written in the texts at all:

  What were Alice and Elsa are doing together ? (causal information)  Where should they be ? (spatial information) </li>  Should the time of the story be at day or at night ? (spatial information) </li>

 What was Elsa's motivation in the first sentence ? (temporal + motivational information) </li>  What was Elsa's motivation in the second sentence ? (temporal + motivational information) </li>  What would happen, if Alice was not fell down ? (temporal + causal information) </li>  etc. (we can imagine much more details than the lists above) </li> </ul>

Maybe your guesses are something similar to "competing", "a running race", "at day", "to win a race", "to help Alice", and "Elsa might win", respectively. Or at least you should think that the above guesses make sense.

In fact, maybe you have imagine a picture similar to Figure 2. in your mind.



So we can see that it is truly amazing that we, human, can guess a lot of information nowhere to be found in the original sentences.

Since there are much more information that we can imagine from the exact information written in the original sentences, commonsense is also said to be a Dark matter of intelligence , comparable to our physical dark matter which physicists believe that our universe consists mostly of dark matter, but it is extremely difficult to observe.

Literatures: Transformers and GPT-3 vs. Commonsense
Although, there are many experiments on capabilities of transformers on commonsense, most commonsense experiments in recent literatures rarely test the true ability of GPT-3 who is able to read a long and complex text and write a long and complex answer.

Recent experiments  showed that transformers in the current pretraining  paradigm usually have poor performance on commonsense testing. They propose to finetune a transformer model on a new commonsense dataset to get much better commonsense performance.

However, the aforementioned experiments were almost tested on transformers (e.g. GPT-2) which capacities and trained texts are much less than GPT-3. The full-version (named Davinci) of GPT-3's pre-training corpus and neuron parameters are approximately 10-times and 100-times, respectively, bigger than its predecessor GPT-2 as well as other transformers in the experiments. Therefore, from reading much larger texts and storing all knowledge in its neurons, GPT-3 has much more potential to understand our world, including commonsense, compared to other smaller-size transformers.

To the best of our knowledge, only two commonsense experiments was tested on GPT-3. The first one is an experiment on the AtomicComet-2020 dataset. This experiment asked GPT-3 to answer a short commonsense relationship with formats such as:


 * PersonX accepts PersonY's apology xIntent [GEN] To feel peachful with PersonY [SEP]
 * PersonX accepts PersonY's apology HinderedBy [GEN] PersonY has not apologized [SEP]

where xIntent and HinderedBy are two examples of 23 encoded commonsense relationships in the AtomicComet-2020 dataset, meaning "because PersonX wanted" and "can be hindered by", respectively. [GEN], [SEP] are special tokens indicating beginning and ending, respectively, of a consequence that a transformer should generate.

In general, the training data has the following format: [GEN] [SEP]

By giving 5-shots prompted examples for each generation, the paper reported a considerably good, 73% human-acceptance rate, on generated text by GPT-3 whereas GPT-2 only get 37% acceptance rate. An encoder-decoder transformer model trained directly on the dataset got the best performance of around 84.5%.

However, since GPT-3 learned mostly from human texts, and able to reason with a long and high-quality text inputs, it can be possible that the above experiment which use a short and unnatural encoded format does not show the true ability of GPT-3.

The second commonsense experiment is a small set of 157 sentences fed into GPT-3 . This experiment consider short-prompt sentences, and identified whether texts generated by GPT-3 contradicted the given information or not. It can be a complimentary to our experiment in this article.

In this article, following the methods pioneered by Elemental Cognition's research team . By using a well-defined Template of Understanding to test various reasoning dimensions shown in Figure 1, we are going to see in-action how well GPT-3 can perform a commonsense reasoning on diversed story genres in a more complicated situation with high-quality detailed prompted examples. The main purpose of high-quality detailed prompt is to utilize the full potential capability of GPT-3.

Readers who want to get an extensive updated background on this topic of commonsense knowledge and reasoning can take a look at Yejin Choi's ICLR 2021 tutorial , also see AAAI 2021's tutorial and workshop. For the bigger picture of this topic, please see Minsky's timeless The Emotion Machine book .

Other Types of Knowledge
For completeness, this subsection briefly discusses other types of knowledge that we, human, use in a real-life everyday reasoning. .

<tab name="Elementary Factoid knowledge"> Commonsense knowledge is not the only simple knowledge that most people know. Another type of simple knowledge is elementary factoid knowledge e.g.

<ul>  Germany is a country</li>  French is a language of people in France </li>  Thailand is a country in Asia </li>  Human usually have 5 fingers in each hand (with a rare exception) </li>  Dog is an animal </li>  etc. </li> </ul>

Unlike commonsense knowledge mentioned in the previous section, these factoid knowledges usually are found in human's texts (e.g. wikipedia, web page, text books, etc.), so a language model usually know a lot of factoids already .

Formally, note that there is no clear boundary between commonsense and simple factoid knowledge, and some researchers may also consider factoid knowledge as a subtype of commonsense too.

<tab name="Complex knowledge"> There are also complex factoid knowledge which requires multi-hop reasoning, e.g. who is the fifth youngest president in the history of United States ?. Even though all the information to answer this question can be found on the training texts, this kind of knowledge is difficult to a language model as well .

Similarly, conceptual knowledge such as theory of physics, biology, or mathematical finance can be easily found in textbooks or websites, but a language model simply cannot calculate the option pricing using Black-Scholes formula like a financial human expert.

<tab name="Human Reasoning with all Knowledge"> In reality, human combines all knowledge types mentioned above to think, reason and make a decision on a given context about the real physical world. Scientists in the field of behavioral economics, cognitive science and psychology uses the terms of mental model (System 1) and conceptual model (system 2):



- System-1 or Mental Model : a simple and intuitive model in our head. This is where we use both commonsense and simple factoid knowledge. The reasoning process using this mental model is fast and is used in a daily routine e.g. when we want to have a breakfast or drive a car. The model is considerably accurate in a context of normal routines, but may not work well when the context is changed, e.g. when we just move into a new town, and want to have a breakfast or drive a car.

- System-2 or Conceptual Model : a complex model that make our specie distinc, e.g. logic, mathematics, science or theory of economics.

Human combine the usage of both System-1 and System-2 naturally as shown in Figure 3.

As mentioned before, even though the mental model reasoning using commonsense is very simple for us, human, it's a grand reserach challenge for a language model or a field of artificial intelligence in general. This is because there is not much training data for this kind of too-simple knowledge.

Hence, we focus on the commonsense ability of one of the most advanced language model, GPT-3, as the main objective of this article.

= Testing Commonsense : Template of Understanding =

As described in the previous section, by using mental model and commonsense, human is able to create the imaginative world beyond the communication input (few text sentences in our case). Then, human uses that imaginative world model to infer or reason about things not written in the texts.

Template of Understanding (ToU)  is a machine-understanding testing framework pioneered by Elemental Cognition's research team which, in turn, inspired from the field of cognitive science on how human understand and reason about a narrative story.

ToU asks a language model such as GPT-3 to answer various fundamental and common reasoning on a given paragraph of short story, where usually most of the answers could not be found in the text input.

Therefore, if a language model such as GPT-3 could give a good answer in each fundamental common reasoning dimension, we may be able to say that the model really has high-quality common sense and mental model similar to human. On the other hand, if the model fails to reason in some particular dimensions, researchers will, systematically, be able to identify the knowledge dimension which the model are lack of.

Reasoning Dimensions
In this article, beside the original ToU proposal, we also integrate ideas from other works  of commonsense reasoning to emphasize on 8 basis reasoning dimensions plus 1 temporal dimension, see Table 1 and also Figure 1.

Note that the temporal dimension can be analyzed together with most other basis reasoning dimensions. As shown in Figure 1, we ask what should happen in that dimension before and after story to test further abductive and deductive skills.

For example, consider Example 2 in the first section, the characters' roles change with time: during the running competition, Alice and Elsa were competitors, but before and after that event, by commonsense prediction, they were likely friends (even though we could not say for sure). At the end of this section, two concrete examples of all reasoning dimensions are illustrated.

Worth mentioning, causal analysis and counterfactual reasoning emphasized in previous literatures as essential for human reasoning are included in dimensions 7 and 8.

Also note that due to limitation of number of tokens that GPT-3 can read and write, to save space, we have to group many similar questions together e.g. in dimension-1 we ask the models who are notable characters, what are their roles and status at the same time. Multiple-questions is challenging for the model, but it also allow the model to write many different aspects, and show what the model really knows.

Examples : 2-Shots ToU Prompts Given to GPT-3
GPT-3 is well-known to work best in a few shots setting where the user should provide a few high-quality examples as a prompt, before letting GPT-3 write an output. Therefore, we will provide the following 2-shots prompt, which will be the same for each narrative stories we are going to test the model.

These two examples are illustrations of how the model should ideally reason on each ToU dimension.

<tab name="Prompted Story 1 - A Race"> Alice and Elsa were running toward the finish line. However, as Alice somehow fell down to the ground, Elsa turned her back and went to help Alice. A teacher and friends also went to see what happen.

Analysis:

1) Notable characters, their roles and status: Elsa and Alice are likely students and friends because there are teacher and friends in the context. By their names, they both should be girls. When they competed a race, they were competitors and runners. The teacher was probably their referee.

2) Notable believes, feelings and motivations of characters before, during and after the story: Before the story, both Elsa and Alice wanted to win the race and they should feel a bit excitement. In the story, when something happen to Alice, Elsa belived that Alice was injured. Elsa, and everybody else, felt care toward Alice. Then Elsa wanted to help Alice instead of going to win.

3) Notable possessions of characters before, during and after the story: Throughout the story, each runner should have a pair of running shoes, a running uniform. The teacher might have a whistle and perhaps a clock watch.

4) Basic usages and general knowledge on some notable possessions: People can use a pair of running shoes for running or just wear them casually. A pair of running shoes typically costs $10 to $100. Famous and expensive brands include Nike, Adidas, Asics and New Balance.

5) Probable locations, positions and daytime of characters before, during and after the story: They probably stayed at a sport stadium, on a running race at their school throughout the story. It should be at day in the story since the class is rarely at night. Before the race started, Elsa and Alice should stay at the same starting point while the teacher and friends stay near the race. Shortly after the story, Elsa, Alice and everybody should stay closely to investigate Alice's condition.

6) Guess story genre, and general information about location and time period: The story was quite ordinary, so it could be non-fantasy or realistic fiction, maybe a bit drama. Since it looks peaceful, it might locate in no-war country. The event might took place after 1900s where the sport has been popular, and more probable after 1950s where WW-II already ended.

7) Probable events before and after the story: Before the story, it may be the time for PE class for Elsa and Alice, so they should change uniforms for the class. After the strory, if Alice was seriously hurt, maybe we had to bring Alice to a hospital, or otherwise, Alice might just take a rest.

8) Analyze the interesting event in the story, if any, and hypothesize that the interesting event would not occur if: The interesting part was when Alice got fell down. She might trip over stone or injured somewhere. The event would not happen if Alice was perfectly healthy, slept well and there were no stone on the race.

<tab name="Prompted Story 2 - A Christmas Joke"> A man called his son and daughter the day before Christmas and said he and their mom were going to divorce. The son and daughter were hurry to go back home to stop their parents. The old man turned to his wife and said "they're coming for Christmas now"

Analysis:

1) Notable characters ,their roles and status: A family of dad, mom, son and daughter. Their family status look very healthy.

2) Notable believes, feelings and motivations of characters before, during and after the story: Before the story, dad believed that their children would not come home in Christmas, so he might felt lonely and was motivated to trick their children to come home. At the end, dad believed that the children would come back home and might be happy. The children would believed the family was healthy before the story. In the story, they felt worried of the parents divorce, and that motivated them to back home. After the story, the children would initially got angry knowing that they were tricked, but were happy eventually to be back with the parents.

3) Notable possessions of characters before, during and after the story: Dad and children had phones, which could be either landline or mobile. All family members also belonged to each other in some sense.

4) Basic usages and general knowledge on some notable possessions: Average landline phone and mobile phone may cost around $100, but mobile phone price can be as high $2000. After the invention of smartphones by Steve Jobs, mobile phone can be used just like a small computer while landline phones would become obsolete.

5) Probable locations, positions and daytime of characters before, during and after the story: Before and in the story, the parents and children likely stayed in different cities or far enough that sometimes the children will not back home in Christmas. After the story, all of them would be at their home. The story could happen either day or night, but not on working hours.

6) Guess story genre, and general information about location and time period: This story genre should be a realistic fiction and comedy. The story was likely occured in either Europe or North America where most people are Chistian, so that Chirstmas day are very important. The story had to occur after phones were common to households and not in war-time which would be after 1980s.

7) Probable events before and after the story: Before the story, dad and mom would talk about possibilities that the children would not come home. So they thought about a fake divorce plan. After the story, children would be home in Chirstmas and the family should spend great time together.

8) Analyze the interesting event in the story, if any, and hypothesize that the interesting event would not occur if: The interesting part of the story was when dad happily spoke the truth that he tricked his children. This would turn out another way if the children would not care about the divorce and not back home no matter what.

ToU Scoring


We shall evaluate the score with 2 metrics on a generated reasoning, for each dimension in ToU : Relevancy and Quality.

Relevancy determines whether the model is able to extract, identify or apply hints and/or most interesting pieces of information with respect to the story and the question. For examples, in the comedy story below, there is a mention on bringing wedding gift to a friend, and this hints that the location should be wedding party. If the model does not use this information, the score may get discounted.

Quality determines the accuracy or sensibility or possibility of the given reasoning. Continue from the wedding gift example above, even if the model states that the location should be wedding party, but with non-sensible reasons, this score will be discounted.

Consider another example, Example 2 of Elsa and Alice, in the first section.

For example, in the counterfactual analysis, it may correctly reason that the most interesting event is when "Alice got fell down". However, it may hypothesize low-quality counterfactual like "the event may not occur if Alice did not fell down" in contrast to higher-quality counterfactual like, "the event may not occur if Alice was careful enough not to trip on the stone, or Alice was perfectly healthy"
 * When the model mentions all interesting / essential events, but with no sensible reason, it will get high relevancy score, but low qaulity score.

For example, it may reason that the most interesting event is when "a teacher and friends went to see what happen" which is not really essential in the story. However, it may provide interesting hypothesis like "they will not go to Alice if Alice with a fighting spirit shout out to everyone to let her finish her race by herself"
 * On the other hand, if the model mentions not-so interesting pieces of information but with good logic, it may get low relevancy score, but high quality.

Following GLUCOSE, these two metrics are given 0 - 3 scores where the intuitive interpretations are:


 * 0 = unacceptable
 * 1 = low quality
 * 2 = mostly acceptable
 * 3 = great

Because of multiple-questions on each dimension, this score are subjective to the author as evaluator, especially when the model gave a good answers on some questions and bad answers on the others at the same time. However, the reason when the scores are not perfect will be always given, and readers are free to justify the scores.

Figure 4 illustrates a scoring example.

= GPT-3 and Commonsense Reasoning in Action =

Following the two-shots given above, here we are going to have fun with GPT-3 making similar commonsense reasoning in various stories with different genres, e.g. everyday life, biography, comedy, science and historical fiction, mystery or fantasy. It's very interesting to see how GPT-3 handle in all these genres.

As mentioned, there are actually multiple questions for each dimension. Totally, there are 33 questions for each story, the model is not forced to answer all of them each time. The score will be discounted when the model ignore questions that looks important in the given story.

Below, we roughly sort the story from easiest to most difficult according to our opinion. The readers can see that the ToU scores will be high at first, but quite low at the last story.

In the results below, for each story, we use GPT-3 to generate commonsense 3 times where we manually select the best story labeled as Best below. The other generated also attached for reference. The 10 stories together with 3 generated texts are displayed in multi-tabs fashion for compactness, so the readers are encourage to turn on Desktop version in the browser.

Technically, we constantly set parameters to encourage GPT-3's creativity as follows : Temperature = 0.7, Top-P = 1.0, Frequency Penalty = 0.0, and Presence Penalty = 0.0.

GPT-3 generates story one token at a time with random sampling. For each reasoning output, there is a color indicating probability for each token. Green indicates likely sampling. Red on the other hand indicates unlikely sampling.

Low-probability red tokens are very interesting since the generated reasoning can go another way if GPT-3 randomly chooses other candidated tokens. To gain more insights, in each experiment's best text, we also investigate the possible candidated tokens beside the randomly chosen red tokens.

Note that GPT-3 has a limitation where the number of combined reading and writing tokens are no more than 2048. Our two-shots prompt plus a tested story are roughtly 1300 tokens, so GPT-3 has around 750 tokens left to write on each ToU test.

This 2048 token limitation is also the reason that we limit ourselves to test GPT-3 in only 8 basis plus 1 temporal dimensions. Nevertheless, these 8+1 reasoning dimensions are the most fundmamental according to our best knowledge as explained in the previous section.

For each story, we attach a theme picture just for readers to sense the story easily. None of the pictures are input to GPT-3.

<tab name="Biography"> On the contrary to his colleagues believes, Alain Bombard thought that people could stay alive in the sea by drinking sea water and eating small fish and plants from the sea. He set out in a small boat to cross the Atlantic Ocean. He was able to stay alive for 65 days before finishing the journey.


 * -|Best=

With respect to the reasoning above, we evaluate the reasoning above to have the following scores:



Overall, we think the reasoning given here are mostly very good. GPT-3 can also generate totally different and low-quality reasoning too as can be seen in the other generated texts tabs.

Notes on Reasoning


 * On dim-2, GPT-3 does not predict the "after" event which should be significant. Other generated texts are great, so the relevancy score is minus 1.
 * On dim-4, it states the price of boat which is too cheap to be true while other generated texts are great, so the quality score is minus 1.
 * On dim-6, it remarkably states that Alain Bombard came from 1950s, but with poor reasoning that "boats were common", so the quality score is minus 1.
 * On dim-7, it states too general and uninteresting reasoning for both "before" and "after" events, so the quality score is minus 2.
 * On dim-8, it does not explicitly answer which part of the story is interesting, and it counterfactual story looks too general and uninteresting, so we label both scores as low-quality.

Beside that, it still overall generated pretty good texts. It's remarkable that GPT-3 reasons that Alain Bombard came from France in 1950s in dim-5 and dim-6 with minimum hint beside his name. This again illustrates that GPT-3 is very knowledgable in factoid dimension.

Notable Words with Low (Red) Probability and Its Alternatives

Note that most of them seem not change the meaning of the sentence, so likely GPT-3 will generate similar reasoning even the red tokens turned out the other way.


 * (dim-1) unfit -- poisonous, deadly, toxic, harmful and dangerous
 * (dim-1) confident -- brave, optimimistic, positive, strong
 * (dim-2) confuse - curious, uncertain, confused, worried, anxious
 * (dim-4) healthy -- nutrients, vitamins, nutrition, protein, mineral
 * (dim-4) mostly -- able, made, strong, quite, seaw, light
 * (dim-5) 1950 -- 1960, 1900, 1970, 1980, 1800


 * -|Other 1=
 * -|Other 2=

<tab name="Mystery"> As a new job for a prominent wealthy family, one of Chandra's first task is to water all of the house plants. While Chandra is watering the lilies, one of the plants starts talking to warn him of a dark family secret.
 * -|Best=

With respect to the reasoning above, we evaluate the reasoning above to have the following scores:



Overall, we think the reasoning given here are mostly good. GPT-3 can also generate totally different and low-quality reasoning too as can be seen in the other generated texts tabs.

Notes on Reasoning


 * On dim-1, it does not mention the special role of the Lilies. GPT-3 mentioned like they are normal house plants. so the quality score is minus 1.
 * On dim-3, the story hinted that family has a dark secret and this is the most important point in the story, but GPT-3 ignore this. So the quality score is minus 2.
 * On dim-4, GPT-3 explains only 1 object so the relevancy score is minus 1. The explanation is too general and uninformative, so the quality score is minus 2.
 * On dim-6, it predicts the correct genre, but the explanations about time period, Europe and Ancient stuffs are not making much sense. We found later below that this poor "Ancient" stuffs is due to a poor sampling of the word "not".
 * On dim-8, the counterfactual story is too broad and not interesting.

Beside that, it still overall generated pretty acceptable texts by able to answer most questions, so its relevancy score is high. On dim-7, the "after" stories look good and could be a great sequel. Its ability to reason sensibly in mystery story is a question mark, so it got quite low quality score.

Notable Words with Low (Red) Probability and Its Alternatives

Note that most of them seem not change the meaning of the sentence, so likely GPT-3 will generate similar reasoning even the red tokens turned out the other way.


 * (dim-2) surprised -- surprise, curious, scared, confused, strange
 * (dim-2) secret -- lily, dark, family, talking, house
 * (dim-3) package -- can, pot, tool, bottle
 * (dim-4) found -- used, either, bought, a, kept
 * (dim-4) west -- wealthy, US, summer, rich
 * (dim-5) away -- at, asleep, in, still, out
 * (dim-6) fantasy -- realistic, bit, drama, mystery, horror
 * (dim-6) Europe -- the, North, western, America,
 * (dim-6) not -- common, popular, very, quite --> This make the poor explanation of "Ancient time"
 * (dim-6) Ancient -- east, past, west, US, old
 * (dim-7) talk -- know, fire, investigate, find, ask
 * (dim-8) hired -- watering, there, curious, interested, surprised
 * (dim-8) plant -- lily, family, plants, wealthy, house


 * -|Other 1=
 * -|Other 2=

<tab name="Sci-fi"> Alien race seeking refuge landed on earth on a small island in the south pacific. For a hundred years they've managed to keep the island cloaked and secret from our human population. But now they've exhausted the resources.
 * -|Best=

Following is the essence of this story which we would expect the model to know and reason:


 * There are two most interesting parts in the narrative.

With respect to the expectation, we evaluate the best reasoning above to have the following scores:



Notes on Reasoning


 * On dim-1,
 * On dim-2,
 * On dim-3,
 * On dim-4,
 * On dim-5,
 * On dim-6,
 * On dim-7,
 * On dim-8,

Beside that, it still overall generated pretty acceptable texts by able to answer most questions, so its relevancy score is high. On dim-7, the "after" stories look good and could be a great sequel. Its ability to reason sensibly in mystery story is a question mark, so it got quite low quality score.

Notable Words with Low (Red) Probability and Its Alternatives

Note that most of them seem not change the meaning of the sentence, so likely GPT-3 will generate similar reasoning even the red tokens turned out the other way.


 * (dim-1) lots -- and, big, small, the, south, resource
 * (dim-2) refugee -- the, aliens, refugees, humans, alien
 * (dim-2) expect -- believe, feel, be, have, felt
 * (dim-2) running -- likely, seeking, worried, in, surprised, afraid
 * (dim-3) technologies -- resources, things, tools, vehicles, transportation, food
 * (dim-4) planet -- human, most, advanced, technology, basic
 * (dim-4) free -- very, a, the, as, worth
 * (dim-5) unknown -- in, near, somewhere, the, at
 * (dim-6) realistic -- science, sci, fantasy, non, drama, fiction
 * (dim-6) modern -- the, a, earth, 21, present, future
 * (dim-7) helped -- would, could, should, might, will
 * (dim-7) supplies -- food, their, resource, advanced, some
 * (dim-8) critical -- desperate, bad, crisis, big, trouble
 * (dim-8) planet -- better, way, more, spaceship, lot


 * -|Other 1=
 * -|Other 2=

<tab name="Shopping"> Ling went to a big-box store selling everything on the planet to buy his favorite tennis racket. But a staff named Xin said that the store would not sell the racket since it's defective. Ling complained that he has a ATP master to participate tomorrow and he needed the racket now.
 * -|Best=

Following is the essence of this story which we would expect the model to know and reason:


 * The most interesting part should be in the last sentence since it's not usual that people can participate at ATP master
 * It is highly likely that Ling is a professional tennis player capable of playing at the ATP master
 * From the names of the two characters, it is possible (though not necessary) that this story happen in China, so in that case the competition has to be Beijing ATP master
 * It is clear that Ling felt somewhat hopeful to get the new racket provided the fact that the big-box store sell everything
 * However, in the story Ling felt frustrated / annoyed / angry that he could not get what he want
 * The after-story part is also interesting what Ling should do to get his needed racket

With respect to the expectation, we evaluate the best reasoning above to have the following scores:



Notes on Reasoning


 * On dim-1, reasoning looks perfect using all relevant information.
 * On dim-2, reasoning here looks very likely. Only the last part is random but still possible.
 * On dim-3, tennis is correct and one of the most relevant in the story. It should mention about money that Ling has to pay for the new racket though.
 * On dim-4, the model is a bit mess up to answer factoid knowledge on both dim-3 and dim-4, but overall looks acceptable. Tennis racket has be used more than 100 years is conceptually correct.
 * On dim-5, the model is able to extract Beijing as we expect in the best case. Note that in other generated texts, the locations are mostly random.
 * On dim-6, general information looks great even though the model does not predict the genre.
 * On dim-7, the before and after events stated here are probable. It would be much better if the model explains in more details (better explanation quality).
 * On dim-8, "giving him a free racket" part is purely imaginative part before counterfactual reasoning. So all reasoning here is totally wrong.

Overall reasoning in this story is good.

Notable Words with Low (Red) Probability and Its Alternatives

Note that most of them seem not change the meaning of the sentence, so likely GPT-3 will generate similar reasoning even the red tokens turned out the other way.


 * (dim-1) was -- is, and, ',', should, might --> Note that if sampling 'and', we will get a crappy story like one in 'Other2'
 * (dim-1) tennis -- customer, boy, man, young, player, professional
 * (dim-2) excited -- likely, very, a, motivated, eager
 * (dim-3) 10 -- 100, 50, 30, 20, 40, 200
 * (dim-4) 100 -- a, the, hundreds, centuries, 2000
 * (dim-5) Beijing -- a, his, the, tennis, some, home --> pure luck to have Beijing with a very low probability
 * (dim-6) 2000 - the, China, modern, present, Beijing
 * (dim-7) buy -- was, went, had, practiced, bought
 * (dim 8) be -- not, give, sell, gave, refuse --> Bad counterfactual due to bad sampling luck
 * (dim-8) too -- defective, a, broken, really, damaged, faulty --> again, 'too' has very low probability


 * -|Other 1=
 * -|Other 2=

<tab name="Travel"> It was very exciting to arrive the legendary island where "Origin of Species" was inspired from. However, as Giulia was not well-prepared, she did not even know where should he sleep tonight! At least, she had $1000 which hopefully was enough.
 * -|Best=

Following is the essence of this story which we would expect the model to know and reason:


 * There are two most interesting parts in the narrative.
 * The island itself which would be great if the model can infer to be "Galapagos".
 * The fact that Giulia did not know where to sleep!
 * The narrative likely implied that she had no idea, and should be a solo traveler.
 * So she should have a mixed feeling between exciting and worrying altogether.
 * The after-story should relate to either what marvelous stuffs she was going to see in the island, or how she could find a place to sleep.

This story is challenging for the model to extract Galapagos information.



Notes on Reasoning
 * On dim-1, the model is doing great jobs to identify possible other characters in the story. Only the assistant (due to bad-luck sampling) is unlikely.
 * On dim-2, the sentence structure is acceptable. Sadly, the model cannot directly extract the 2 most relevant information on the story about Galapagos and "no-hotel worriness"
 * On dim-3, acceptable objects.
 * On dim-4, acceptable attempts from the relevant information in dim-3, but the last sentence does not have much sense.
 * On dim-5, good attempt to use both information about island and the hotel in all "before", "current" and "after" story. It nevertheless cannot extract the word Galapagos and the written story is contracdicted the fact that Giulia was unprepared. The overall score is subjectively between acceptable and low quality.
 * On dim-6, The genre is correct, and this time it successfully (lucky-sampling) extract the continent name, South Africa. We don't have exact information on the time period so 1950 is rather random.
 * On dim-7, Before and after event are not entirely impossible but low quality since they are not related to the given narrative at all.
 * On dim-8, the reason is totally random, and contradicts the story.

Notable Words with Low (Red) Probability and Its Alternatives


 * (dim-1) tourist -- girl, student, female, woman, traveler
 * (dim-1) other -- island, others, story, man, tour, " (suppose to lead to "Origin")
 * (dim-1) assistant -- friends, tour, guide, friend, travel, boyfriend
 * (dim-2) local -- island, trip, tour, place, legendary
 * (dim-2) driver -- island, tour, place, weather, staff
 * (dim-2) excited -- disappointed, a, that, happy, sad (random emotions)
 * (dim-3) suitcase -- passport, lot, phone, bag, backpack
 * (dim-3) mobile -- camera, backpack, phone, passport, wallet (sensible)
 * (dim-4) hard -- small, suitcase, smaller, travel, hand, pink
 * (dim-5) an -- the, a, some, late, Italy, Darwin (sadly no Galapagos token)
 * (dim-5) open -- airport, island, international, airplane
 * (dim-5) daytime -- day, at, in, afternoon, during, night
 * (dim-6) South -- a, the, an, Europe, island, tropical
 * (dim-7) lost -- at, in, a, preparing, staying, worried
 * (dim-8) the -- Giul, she, a, it
 * (dim-8) driver -- tour, staff, hotel, taxi, assistant
 * (dim-8) transportation -- airport, city, island, port, beach


 * -|Other 1=
 * -|Other 2=

<tab name="Shakespere"> Being William Shakespeare’s apprentice would be great if he weren’t always stealing your ideas and claiming them as his own. So, James write a brilliant satiric play exposing him. He loves it and takes it to the stage.
 * -|Best=

This story looks difficult due to the twist at the end -- the model must understand bad intention of James at the end while knowing that Shakespere may not realize James' intention. Following is the essence of this story which we would expect the model to know and reason:


 * The most interesting part is "So, James write a brilliant satiric play exposing him", and this will not happen if their relationships are much healthier.
 * The model should use all the knowledge relating to the name Shakespere e.g. location, time period, big picture, related objects.
 * The model should understand the emotions of James in the story that is sadness, angriness, and even motivation to revenge.
 * The after story is very important either Shakespere would be exposed, or on the twist plot, Shakespere with his genius already modified some of James' play already.

With respect to the expectation, we evaluate the best reasoning above to have the following scores:



Notes on Reasoning Sadly, on all 3 generations, not only the best one, GPT-3 missed the main story point that James planned to expose Shakespere using his satiristic play. the sentence "has no motivation to write" (due to bad-luck token sampling) is quite contradicted the story.
 * On dim-1, here, the model correct identified both two characters, with all relevant information.
 * On dim-2, acceptable with good detailed analysis of thinking, only miss the main point that James would want to revenge his mentor. On the quality side,
 * On dim-3, the model reasons on the wrong question in the first sentence relating James, and an ok-but-small details regarding Shakespere's book.
 * On dim-4, too little details.
 * On dim-5, may be acceptable reasoning but the sentence structure is quite difficult to read.
 * On dim-6, showing great knowledge -- will be perfect if mentioning that this was on Great Britain or England.
 * On dim-7, we can see that the model tries to use the given information, but failed on reasoning with them. The generated sentences are messy and seem answer the wrong question of emotional instead of events.
 * On dim-8, the interesting part is correct, but the reason is of shallowest possible (negating the sentence instead of giving reason)

Overall, the narrative is quite difficult, so we are not much surprised with the low score.

Notable Words with Low (Red) Probability and Its Alternatives


 * (dim-2) annoyed -- angry, bad, sad, frustraed, upset
 * (dim-2) no -- a, motivation, the, to, some --> May totally different sentence
 * (dim-2) excited -- happy, proud, good, that, curious
 * (dim-2) bad -- good, proud, happy, great, motivated --> Also, may totally different sentence
 * (dim-3) books -- ideas, pens, writings, plays, scripts
 * (dim-5) Shakespere -- night, day, daytime, the, evening
 * (dim-6) Drama -- satire, realistic, comedy, real, tragedy
 * (dim-6) 15 -- 16, 17, Renaissance, past, modern


 * -|Other 1=
 * -|Other 2=

<tab name="CoronaVirus"> In 2020, Coronavirus surprises everybody by spreading everywhere, killing millions of people and turn off most world travels. Uğur Şahin told all staffs in his company to work extremely hard on their mRNA vaccine research before situations got worse.
 * -|Best=

This pandemic story is unknown to GPT-3 which limited its training data to 2019. Following is the essence of this story which we would expect the model to know and reason:

so this narrative is quite difficult that the model would have to reason on these two scales altogether.
 * Two most interesting parts of the story are "Coronavirus spreading everywhere and killing millions" [global information] and "work extremely hard on the mRNA vaccine" [local information],
 * On the global scale, it would be great if the model predict the following.
 * before the story, all people around the world would live normally
 * after this story, before vaccine get invented, more people will die, and there's highly possibility of economic crisis and other catastrophic consequences.
 * On the local scale, we expect the following.
 * On the factoid part, it will be best if the model know the name of Uğur Şahin who has been CEO of BioNTech and reponsible for the vaccine Pfizer-BioNTech in the real world.
 * Therefore, in the best case it would be able to infer the location of the company, and interesting facts on vaccine or mRNA technology.
 * Since millions of people are dying everywhere, it is obvious the the major emotions of every people included fear, desperate and sad.
 * The after story should relate to whether the vaccine is success or not.

With respect to the expectation above, we evaluate the best reasoning of GPT-3 to have the following scores:



Overall, we think the model mostly ignore the global information of millions dying people to incorporate in the seriousness of the story.

Notes on Reasoning
 * On dim-1, acceptable answers. Could mentioned normal people around the world which are victims of the virus, and the virus itself as the anagonist of the story.
 * On dim-2, not answer the question at all. This is due to bad luck sampling, where we can see some good reasons on the other generated texts.
 * On dim-3, acceptable, but sadly not specifying vaccine-related technology.
 * On dim-4, use very small relevant information. The quality of answer on the other hand is acceptable.
 * On dim-5, similar to dim-4.
 * On dim-6, acceptable, but does not use information about the world-spreading nor Uğur Şahin's name. The sentence quality is also acceptable albiet too short.
 * On dim-7, again, acceptable but ignoring the global aspect that before the story people should live normally, and after the story if vaccine is successful, it will save many people lives.
 * On dim-8, acceptable albeit "too-short" reasoning.

If we combine the best of all 3 generated texts on each questions, the score would possibly improve to 2.0 on average.

Notable Words with Low (Red) Probability and Its Alternatives


 * (dim-1) scientists -- workers, likely, working, employees, researchers
 * (dim-1) SARS -- the, Ebola,he, what, flu
 * (dim-1) angry -- motivated, worried, scared, excited, afraid
 * (dim-4) DNA -- research, experiments, scientific, science, chemical
 * (dim-5) air -- lab, major, signs, effective, cure
 * (dim-6) weathers -- virus, problems, epidem, diseases, S (likely SARS)
 * (dim-6) political -- economy, relationship, government, history, health, reputation, environment


 * -|Other 1=
 * -|Other 2=

<tab name="Comedy"> Eriko never used a crystal punch set she got as a wedding gift. When Praew got married, Eriko wrapped the set as her gift. When Praew opened the gift, she looked curiously and told Eriko it was the same punch set she gave her Years ago.
 * -|Best=

This is a very difficult narrative. Following is the essence of this story which we would expect the model to know and reason:


 * The most interesting sentence is when Praew told Eriko in the last sentence.
 * The punch set was given to Eriko by Praew years ago, then Eriko forgot that Praew gave her, so she gave Praew back.
 * I.e. Praew --> Eriko --> Praew is the possession flow of this punch set.
 * Grammatically, this is ambiguous due to the coreference she gave her in the last sentence.
 * It is not polite, socially, to bring the gift back to the original giver
 * Eriko must immediately have some excuses for Praew after the given story.
 * They both must feel somewhat awkward feeling on the last sentence of the story.
 * Since the gift was given to Praew in the story, the event might take place at either Praew's wedding party or Praew's house
 * Also, Praew's role as newly wed bride should be emphasized, and Eriko must be her guest or even best friend.

With respect to the expectation above, we evaluate the best reasoning of GPT-3 to have the following scores:



Notes on Reasoning Due to its difficulty, the model got very low scores on even the best generated text. The model seems confused the coreference she/her in the last sentence so its reason is quite random.
 * On dim-1, it ignore the role of bride, so the relevancy score got discounted. On quality, the divorce stuff is too random.
 * On dim-2, we can see that the model try to bring some relevancy information on gift, but the overl all reasoning is mostly random.
 * On dim-3, similar to dim-2, GPT-3 try to use relevant information, but it failed to clearly explain the Praew --> Eriko --> Praew possession flow.
 * On dim-4, ok, acceptable descriptions about punchset.
 * On dim-5, mostly use relevant information on location, but again the quality of written sentences are entirely non-structured.
 * On dim-6, guess genre is partially correct, but other information stated are quite irrelevant. The sentences quality are readable, but low quality.
 * On dim-7, GPT-3 again tried to use small relevant information on giftset (still ignore many relevant information explained above). The explained reasoning quality are totally unreadable.
 * On dim-8, surprisingly, I think GPT-3 pick the correct interesting event. The counterfactual cause of if Praew had forgot about the set is not bad.

Notable Words with Low (Red) Probability and Its Alternatives

With such low relevancy and quality in reasoning, we feel like it does not worth to analyze the red tokens here.


 * -|Other 1=
 * -|Other 2=

Important observations on GPT-3 Commonsense Reasoning

 * Not answer all questions


 * Answer sensibly but in a wrong dimension e.g. answer about character in belief dimension, and vice versa. We suspect average human with only 2-shots example may do the same.

-- unfit -- poisonous, deadly, toxic, harmful and dangerous
 * Red probs -- many words can lead to similar meaning

-- confident -- brave, optimimistic, positive, strong

But when the word is in category type (mutual exclusive) e.g. hot vs. cold, genre, it can lead to totally different reasoning e.g. non-fiction vs. realistic fiction vs. drama

Summarize table

Focusing and exploring more on most interesting events and counterfactual with more and higher quality prompts
= Conclusion and Looking Forward =

In the previous work of multiple choices testing, transformer model go as high as top 90% of the high-school student, so it is either top-class high school or undergrad level. Here, however, when it is forced to make a reason on each answer, its answer is quite low quality in higher educational standard. We suspect that its reasoning is comparable to Grade 3 to Grade 9 students on average, probably around Grade 6.

Note that there are also research works on commonsense reasoning from visual images . In this article, however, we are limited ourselves to commonsense knowledge and reasoning from text-only inputs.

Feel free to discuss on the discussion tab on the top of this article.

= External References =