This wiki has had no edits or log actions made within the last 45 days and has been automatically marked as inactive. If you would like to prevent this wiki from being closed, please start showing signs of activity here. If there are no signs of this wiki being used within the next 15 days, this wiki will be closed in accordance to the Dormancy Policy (which all wiki founders accept when requesting a wiki). If this wiki is closed and no one reopens it 135 days from now, this wiki will become eligible for deletion. Note: If you are a bureaucrat, you can go to Special:ManageWiki and uncheck "inactive" yourself.
GPT3 and Commonsense Reasoning
From Toward AGI
Ratthachat Chatpatanasiri - July, 2021 (Finished Draft on Few-shots Experiments)
"Fred told the waiter he wanted some chips"[1] . By reading this one sentence, it is amazing that us, human being, know many related (highly possible) information, not written in the sentence at all : we know Fred was in a restaurant. We know Fred was a customer dining here. We know that the waiter and Fred were only few feet apart. We know Fred wanted a potato chips, not some wood chips. We know that within several minutes, the chips should be ready for Fred. And many more.
How do we all know that? This is the power of commonsense.
For a long while, we have thought that this ability belongs to only us, human, and many researchers
[2][3][4][5]
believes this commonsense is one of the keys to what is called "general intelligence".
Today, we have a family of large language models, in particular GPT-3[6],
one of the biggest and most intelligence computer model human ever made. Many hypes claim that its writing and reasoning capability are almost near-human levels. Unfortunately, existing works rarely test the true ability of GPT-3's commonsense reasoning (detailed below).
In this article, with a few-shots setting, we are going to see GPT-3 power of commonsense on stories with various genres, from everyday-life, comedy, biography, historical fiction, mystery through sci-fi. As a result, we will see GPT-3 responds to (or not respond to) more than 300 questions testing its commonsense intelligence.
We will discuss its commonsense capability in 8 fundamental reasoning dimensions as illustrated in Figure 1.
Figure 1. Template of Understanding (ToU)[7] for a commonsense analysis in 8 fundamental reasoning dimensions. When reading a story, average grown-up human is usually able to make a good descriptive arguments on all these dimensions. Number x) indicates reasoning dimension. Details are explained in Section 2. In Section 3, we will have fun applying ToU on GPT-3 to see commonsense reasoning when reading short stories.
Before going to see GPT-3 commonsense in action in the Section 3, however, we shall discuss first, what is "commonsense knowledge and reasoning" (Section 1) and our commonsense testing framework (Section 2) .
Note
As of September 2021, this article illustrates GPT-3 (biggest model named Davinci) commonsense capability only in a few-shots setting, i.e. GPT-3 Davinci was trained with only a few examples before testing its performance. Few-shots setting has been a default inference mechanism of Davinci since complete training with many examples are not possible from OpenAI's API. However, OpenAI team may release Davinci-training module soon. With hundreds of examples, Davinci performance should be much better than the one illustrated here, and we will update our article.
Experimental evaluation of this work requires human judgement, so it is costly and challenging. In this article, we show a small set of evaluation which although could not be a commonsense benchmark of GPT-3, the results should give readers some good ideas on GPT-3 commonsense capability in a few-shots setting.
For compactness of presentation, this blog article uses lots of multi-tabs display and maybe not mobile-friendly. The readers are suggested to apply Desktop-version on the browser menu.
Background on Commonsense Knowledge and Reasoning
In this section, we discuss meaning and power of commonsense knowledge, as well as briefly discuss other types of human knowledge for general reasoning.
What is Commonsense Knowledge ?
According to Yejin Choi ICLR 2021 tutorial on the topic[2], there is no universally accepted definition of commonsense knowledge. Here, from various literatures, we could intuitively define the commonsense knowledge as follows:
Commonsense knowledge is a "kind-of-obvious" knowledge about (1) the world, (2) human and (3) our society where most people know or can accurately guess so that people could omit those details when they communicate
Note that there are three kinds of commonsense mentioned above, and they can also be referred as physical, human and social commonsense, respectively. To understand the above intuitive definition, consider the following examples.
Examples
Read this news headline, taken from Yejin Choi's ICLR 2021 tutorial:
Breaking News: Cheeseburger Stabbing
When we see a news headline like this, How can we interpret it?
A cheeseburger stabbed someone ?
A cheeseburger stabbed another cheeseburger ?
Someone stabbed a cheeseburger ?
Someone stabbed some else over a cheeseburger ?
By using commonsense, most people eliminate the first two choices, since we know that a cheeseburger is an object, and it cannot stab anybody. This is an example of physical commonsense. Also, the third choice should be eliminated since it is non-sense to stab a cheeseburger (human nature commonsense) and even get a spot on a news headline (social commonsense).
So only the last choice are left. Not only that, we can also guess that the one stabbing the others were probably felt hungry, since we also know from social commonsense that stabbing someone is bad, so people won't do that for fun, or without necessary reasons, unless you live in the stone age era. (so the commonsense does change as human's society evolves)
Notice that all the above commonsense arguments are intuitive to us, human. Nevertheless, it's not obvious for a language model at all since a language model learn knowledge from people texts, and there are very rare texts, if any, mentioned that cheeseburger (or sashimi or other physical objects) could not stab (and walk and talk, etc.).
This is why commonsense knowledge is not simple to a language model or an NLP transformer in the current pretraining paradigm.
In fact, when human informally communicate with only few sentences to another person, there are a lot of hidden meaning behind those sentences.
By using commonsense, human can fill-up the hidden details in multi-dimensional aspects such as spatial, temporal, casual and motivation information
behind the sentences. To understand this power of commonsense, consider the next example.
Consider another inspiring example adapted from Elemental Cognition's article
[8]
Alice and Elsa were running toward the finish line. However, as Alice somehow fell down to the ground, Elsa turned her back and went to help Alice. A teacher and friends also went to see what happen.
By reading the two sentences, and by social commonsense, we are likely to answer or guess the following questions not written in the texts at all:
What were Alice and Elsa are doing together ? (causal information)
Where should they be ? (spatial information)
Should the time of the story be at day or at night ? (spatial information)
What was Elsa's motivation in the first sentence ? (temporal + motivational information)
What was Elsa's motivation in the second sentence ? (temporal + motivational information)
What would happen, if Alice was not fell down ? (temporal + causal information)
etc. (we can imagine much more details than the lists above)
Maybe your guesses are something similar to "competing", "a running race", "at day", "to win a race", "to help Alice", and "Elsa might win", respectively.
Or at least you should think that the above guesses make sense.
In fact, maybe you have imagine a picture similar to Figure 2. in your mind.
So we can see that it is truly amazing that we, human, can guess a lot of information nowhere to be found in the original sentences.
Since there are much more information that we can imagine from the exact information written in the original sentences,
commonsense is also said to be a Dark matter of intelligence
[9], comparable to our physical dark matter which physicists believe that our universe consists mostly of dark matter, but it is extremely difficult to observe.
Figure 2. common mental image in human's imagination of "Alice and Elsa were running toward the finish line" (see Example 2). Image is under Pixabay license
Literatures: Transformers and GPT-3 vs. Commonsense
Although, there are many experiments on capabilities of transformers on commonsense, most commonsense experiments in recent literatures
[10][11][12][13]
rarely test the true ability of GPT-3 who is able to read a long and complex text and write a long and complex answer.
Recent experiments [10][12][13] showed that transformers in the current pretraining paradigm usually have poor performance on commonsense testing.
They propose to finetune a transformer model on a new commonsense dataset to get much better commonsense performance.
However, the aforementioned experiments were almost tested on transformers (e.g. GPT-2) which capacities and trained texts are much less than GPT-3.
The full-version (named Davinci) of GPT-3's pre-training corpus and neuron parameters are approximately 10-times and 100-times, respectively, bigger than its predecessor GPT-2 as well as other transformers in the experiments.
Therefore, from reading much larger texts and storing all knowledge in its neurons, GPT-3 has much more potential to understand our world, including commonsense, compared to other smaller-size transformers.
To the best of our knowledge, only two commonsense experiments was tested on GPT-3. The first one is an experiment on the AtomicComet-2020[10] dataset.
This experiment asked GPT-3 to answer a short commonsense relationship with formats such as:
PersonX accepts PersonY's apology xIntent [GEN] To feel peachful with PersonY [SEP]
PersonX accepts PersonY's apology HinderedBy [GEN] PersonY has not apologized [SEP]
where xIntent and HinderedBy are two examples of 23 encoded commonsense relationships in the AtomicComet-2020 dataset,
meaning "because PersonX wanted" and "can be hindered by", respectively. [GEN], [SEP] are special tokens indicating beginning and ending, respectively,
of a consequence that a transformer should generate.
In general, the training data has the following format:
<event> <relation> [GEN] <consequence> [SEP]
By giving 5-shots prompted examples for each generation, the paper reported a considerably good, 73% human-acceptance rate, on <consequence> generated text by GPT-3
whereas GPT-2 only get 37% acceptance rate. An encoder-decoder transformer model trained directly on the dataset got
the best performance of around 84.5%.
However, since GPT-3 learned mostly from human texts, and able to reason with a long and high-quality text inputs, it can be possible that
the above experiment which use a short and unnatural encoded format does not show the true ability of GPT-3.
The second commonsense experiment is a small set of 157 sentences fed into GPT-3
[14].
This experiment consider short-prompt sentences, and identified whether texts generated by GPT-3 contradicted the given information or not.
It can be a complimentary to our experiment in this article.
In this article, following the methods pioneered by Elemental Cognition's research team
[7][12].
By using a well-defined Template of Understanding[7] to test various reasoning dimensions shown in Figure 1, we are going to see in-action how well GPT-3 can perform a commonsense reasoning on diversed story genres in a more complicated situation with high-quality detailed prompted examples. The main purpose of high-quality detailed prompt is to utilize the full potential capability of GPT-3.
Readers who want to get an extensive updated background on this topic of commonsense knowledge and reasoning can take a look at
Yejin Choi's ICLR 2021 tutorial
[2], also see
AAAI 2021's
tutorial
[3]
and workshop.
[4]
For the bigger picture of this topic, please see Minsky's timeless The Emotion Machine book
[1].
Other Types of Knowledge
For completeness, this subsection briefly discusses other types of knowledge that we, human, use in a real-life everyday reasoning.
[15][16][17].
[18]
Commonsense knowledge is not the only simple knowledge that most people know.
Another type of simple knowledge is elementary factoid knowledge e.g.
Germany is a country
French is a language of people in France
Thailand is a country in Asia
Human usually have 5 fingers in each hand (with a rare exception)
Dog is an animal
etc.
Unlike commonsense knowledge mentioned in the previous section, these factoid knowledges usually are found in human's texts (e.g. wikipedia, web page, text books, etc.),
so a language model usually know a lot of factoids already
[19].
Formally, note that there is no clear boundary between commonsense and simple factoid knowledge, and some researchers may also consider factoid knowledge as a subtype of commonsense too.
There are also complex factoid knowledge which requires multi-hop reasoning, e.g. who is the fifth youngest president in the history of United States ?.
Even though all the information to answer this question can be found on the training texts,
this kind of knowledge is difficult to a language model as well
[20].
Similarly, conceptual knowledge such as theory of physics, biology, or mathematical finance can be easily found in textbooks or websites, but a language model
simply cannot calculate the option pricing using Black-Scholes formula like a financial human expert.
In reality, human combines all knowledge types mentioned above to think, reason and make a decision on a given context about the real physical world.
Scientists in the field of behavioral economics[17], cognitive science[15] and psychology[16] uses the terms of mental model (System 1) and conceptual model (system 2):
Figure 3. Human reasons with a combination of mental model (simple and intuitive) and conceptual model (complex and accurate) for the physical world. Image is taken from Hestenes' article[16]
- System-1 or Mental Model : a simple and intuitive model in our head. This is where we use both commonsense and simple factoid knowledge.
The reasoning process using this mental model is fast and is used in a daily routine e.g. when we want to have a breakfast or drive a car.
The model is considerably accurate in a context of normal routines, but may not work well when the context is changed, e.g. when we just move into a new town, and
want to have a breakfast or drive a car.
- System-2 or Conceptual Model : a complex model that make our specie distinc, e.g. logic, mathematics, science or theory of economics.
Human combine the usage of both System-1 and System-2 naturally as shown in Figure 3.
As mentioned before, even though the mental model reasoning using commonsense is very simple for us, human, it's a grand reserach challenge for a language model
or a field of artificial intelligence in general. This is because there is not much training data for this kind of too-simple knowledge.
Hence, we focus on the commonsense ability of one of the most advanced language model, GPT-3, as the main objective of this article.
Testing Commonsense : Template of Understanding
As described in the previous section, by using mental model and commonsense, human is able to create the imaginative world beyond the communication input (few text sentences in our case).
Then, human uses that imaginative world model to infer or reason about things not written in the texts.
Template of Understanding (ToU) is a machine-understanding testing framework pioneered by Elemental Cognition's research team[7] which, in turn, inspired
from the field of cognitive science on how human understand and reason about a narrative story[18].
ToU asks a language model such as GPT-3 to answer various fundamental and common reasoning on a given paragraph of short story, where usually most of the answers could not be found in the text input.
Therefore, if a language model such as GPT-3 could give a good answer in each fundamental common reasoning dimension, we may be able to say that the model really has high-quality common sense and mental model similar to human.
On the other hand, if the model fails to reason in some particular dimensions, researchers will, systematically, be able to identify the knowledge dimension which the model are lack of.
Reasoning Dimensions
In this article, beside the original ToU proposal, we also integrate ideas from other works[12][10][1] of commonsense reasoning to emphasize on 8 basis reasoning dimensions plus 1 temporal dimension, see Table 1 and also Figure 1.
Table 1. Eight Basis Reasoning Dimensions and one Temporal Dimension
Basis Dimension
Description
9 - Temporal Dimension
1 - Characters Identification
Roles of notable characters, their roles and status
Yes
2 - Analysis of Thinking
Notable believes, feelings and motivations of those characters
Yes
3 - Object Identifications
Notable objects possessed by those characters
Yes
4 - Factoid Knowledge
Basic knowledge relevant to those objects, e.g. usages, properties, price
No
5 - Location Identification
Specific locations (i.e. places) and positions of those characters
Yes
6 - Big-Picture Identification
General information about location (i.e. countries), time period and story genre (e.g. fantasy or realistic)
No
7 - Causal Analysis
Probable events before and after the story
Yes
8 - Counterfactual Analysis
Alternative story at and after the most interesting event
Yes
Caution on the phrases "Causal Analysis" and "Counterfactual Analysis". In this article, we follow the original ToU work[7] to use these two phrases informally, i.e. by "causal events", we mean a chain of events that the 2nd event logically followed from (is enabled by) the 1st event, in a non-rigorous sense where most human understands as commonsense (see examples in the next section). For a scientific treatment of these two phrases, which requires rigorous randomized experimental design, the readers could consult the work of Hernán and Robins[21].
Note also that the temporal dimension is a special reasoning dimension because it can be analyzed together with most other basis reasoning dimensions. As shown in Figure 1, we ask what should happen in that dimension before and after story to test further abductive and deductive skills.
For example, consider Example 2 in the first section, the characters' roles change with time: during the running competition, Alice and Elsa were competitors, but before and after that event, by commonsense prediction, they were likely friends (even though we could not say for sure). At the end of this section, two concrete examples of all reasoning dimensions are illustrated.
Worth mentioning, causal analysis and counterfactual reasoning emphasized in previous literatures[10][7] as essential for human reasoning are included in dimensions 7 and 8.
Also note that due to limitation of number of tokens that GPT-3 can read and write, to save space, we have to group many similar questions together e.g. in dimension-1 we ask the models who are notable characters, what are their roles and status at the same time. Multiple-questions is challenging for the model, but it also allow the model to write many different aspects, and show what the model really knows.
Examples : 2-Shots ToU Prompts Given to GPT-3
GPT-3 is well-known to work best in a few shots setting where the user should provide a few high-quality examples as a prompt, before letting GPT-3 write an output. Therefore, we will provide the following 2-shots prompt, which will be the same for each narrative stories we are going to test the model. Due to the 2048-tokens limitation mentioned above, only 2-shots prompt is possible in our experiments here.
Note that there are recent literatures on AutoPrompt[22]
where we can train another model to provide the best prompt on each task. Nevertheless, those techniques are mostly designed for classification problems and not straightforward to apply in text generation context that we are interested in here.
These two examples are illustrations of how the model should ideally reason on each ToU dimension. The Colab code to generate prompt for GPT-3 used in this article can be found here :
.
Alice and Elsa were running toward the finish line. However, as Alice somehow fell down to the ground, Elsa turned her back and went to help Alice. A teacher and friends also went to see what happen.
Analysis:
1) Notable characters, their roles and status:
Elsa and Alice are likely students and friends because there are teacher and friends in the context. By their names, they both should be girls. When they competed a race, they were competitors and runners. The teacher was probably their referee.
2) Notable believes, feelings and motivations of characters before, during and after the story:
Before the story, both Elsa and Alice wanted to win the race and they should feel a bit excitement. In the story, when something happen to Alice, Elsa belived that Alice was injured. Elsa, and everybody else, felt care toward Alice. Then Elsa wanted to help Alice instead of going to win.
3) Notable possessions of characters before, during and after the story:
Throughout the story, each runner should have a pair of running shoes, a running uniform. The teacher might have a whistle and perhaps a clock watch.
4) Basic usages and general knowledge on some notable possessions:
People can use a pair of running shoes for running or just wear them casually.
A pair of running shoes typically costs $10 to $100. Famous and expensive brands include Nike, Adidas, Asics and New Balance.
5) Probable locations, positions and daytime of characters before, during and after the story:
They probably stayed at a sport stadium, on a running race at their school throughout the story. It should be at day in the story since the class is rarely at night. Before the race started, Elsa and Alice should stay at the same starting point while the teacher and friends stay near the race. Shortly after the story, Elsa, Alice and everybody should stay closely to investigate Alice's condition.
6) Guess story genre, and general information about location and time period:
The story was quite ordinary, so it could be non-fantasy or realistic fiction, maybe a bit drama. Since it looks peaceful, it might locate in no-war country. The event might took place after 1900s where the sport has been popular, and more probable after 1950s where WW-II already ended.
7) Probable events before and after the story:
Before the story, it may be the time for PE class for Elsa and Alice, so they should change uniforms for the class. After the strory, if Alice was seriously hurt, maybe we had to bring Alice to a hospital, or otherwise, Alice might just take a rest.
8) Analyze the interesting event in the story, if any, and hypothesize that the interesting event would not occur if:
The interesting part was when Alice got fell down. She might trip over stone or injured somewhere. The event would not happen if Alice was perfectly healthy, slept well and there were no stone on the race.
A man called his son and daughter the day before Christmas and said he and their mom were going to divorce. The son and daughter were hurry to go back home to stop their parents. The old man turned to his wife and said "they're coming for Christmas now"
Analysis:
1) Notable characters ,their roles and status:
A family of dad, mom, son and daughter. Their family status look very healthy.
2) Notable believes, feelings and motivations of characters before, during and after the story:
Before the story, dad believed that their children would not come home in Christmas, so he might felt lonely and was motivated to trick their children to come home. At the end, dad believed that the children would come back home and might be happy. The children would believed the family was healthy before the story. In the story, they felt worried of the parents divorce, and that motivated them to back home. After the story, the children would initially got angry knowing that they were tricked, but were happy eventually to be back with the parents.
3) Notable possessions of characters before, during and after the story:
Dad and children had phones, which could be either landline or mobile. All family members also belonged to each other in some sense.
4) Basic usages and general knowledge on some notable possessions:
Average landline phone and mobile phone may cost around $100, but mobile phone price can be as high $2000.
After the invention of smartphones by Steve Jobs, mobile phone can be used just like a small computer while landline phones would become obsolete.
5) Probable locations, positions and daytime of characters before, during and after the story:
Before and in the story, the parents and children likely stayed in different cities or far enough that sometimes the children will not back home in Christmas.
After the story, all of them would be at their home. The story could happen either day or night, but not on working hours.
6) Guess story genre, and general information about location and time period:
This story genre should be a realistic fiction and comedy. The story was likely occured in either Europe or North America where most people are Chistian, so that Chirstmas day are very important. The story had to occur after phones were common to households and not in war-time which would be after 1980s.
7) Probable events before and after the story:
Before the story, dad and mom would talk about possibilities that the children would not come home. So they thought about a fake divorce plan. After the story, children would be home in Chirstmas and the family should spend great time together.
8) Analyze the interesting event in the story, if any, and hypothesize that the interesting event would not occur if:
The interesting part of the story was when dad happily spoke the truth that he tricked his children. This would turn out another way if the children would not care about the divorce and not back home no matter what.
ToU Scoring
Figure 4. Illustration of ToU scoring on each dimension and metric
We shall evaluate the score with 2 metrics on a generated reasoning, for each dimension in ToU : Relevancy and Quality.
Relevancy determines whether the model is able to extract, identify or apply hints and/or most interesting pieces of information with respect to the story and the question.
For examples, in the comedy story below, there is a mention on bringing wedding gift to a friend, and this hints that the location should be wedding party. If the model does not use
this information, the score may get discounted.
Quality determines the accuracy or sensibility or possibility of the given reasoning. Continue from the wedding gift example above, even if the model states that the location
should be wedding party, but with non-sensible reasons, this score will be discounted.
Consider another example, Example 2 of Elsa and Alice, in the first section.
When the model mentions all interesting / essential events, but with no sensible reason, it will get high relevancy score, but low qaulity score.
For example, in the counterfactual analysis, it may correctly reason that the most interesting event is when "Alice got fell down".
However, it may hypothesize low-quality counterfactual like "the event may not occur if Alice did not fell down" in contrast to higher-quality counterfactual like,
"the event may not occur if Alice was careful enough not to trip on the stone, or Alice was perfectly healthy"
On the other hand, if the model mentions not-so interesting pieces of information but with good logic, it may get low relevancy score, but high quality.
For example, it may reason that the most interesting event is when "a teacher and friends went to see what happen" which is not really essential in the story.
However, it may provide interesting hypothesis like "they will not go to Alice if Alice with a fighting spirit shout out to everyone to let her finish her race by herself"
Following GLUCOSE[12], these two metrics are given 0 - 3 scores where the intuitive interpretations are:
0 = unacceptable
1 = low quality
2 = mostly acceptable
3 = great
Because there are multiple-questions on each dimension, the scores are subjective to the author as evaluator, especially when the model gave a good answers on some questions and bad answers on the others at the same time. However, the reason when the scores are not perfect will be always given, and readers are free to justify the scores.
Figure 4 illustrates a scoring example.
GPT-3 and Commonsense Reasoning in Action
Following the two-shots given above, here we are going to have fun with GPT-3 making similar commonsense reasoning in various stories with different genres, e.g. everyday life, biography, comedy, science and historical fiction, mystery or fantasy. It's very interesting to see how GPT-3 handle in all these genres.
As mentioned, there are actually multiple questions for each dimension. Totally, there are at least 33 questions for each story depending on number of characters and objects. The model is not forced to answer all of them each time. The score will be discounted when the model ignore questions that looks important in the given story.
Below, we roughly sort the story from easiest to most difficult according to our opinion. The readers can see that the ToU scores will be high at the first 4 stories, but quite low at the last 4 stories. (Also see the appendix where we are able to improve some reasoning dimension scores with the cost of sacrificing other dimensions' scores)
Technically, we constantly set parameters to encourage GPT-3's creativity as follows : Temperature = 0.7, Top-P = 1.0, Frequency Penalty = 0.0, and Presence Penalty = 0.0.
To see GPT-3 capability in this random text generation scheme, for each story, we use GPT-3 to generate commonsense reasoning 3 times where we manually select the best reasoning labeled as Best below.
The other two non-best generated also attached for references. All stories together with 3 generated texts are displayed in multi-tabs for compactness presentation, so that the readers can easily navigate the results. We encourage the readers to turn on Desktop version in the browser if the multi-tab function does not work properly.
GPT-3 generates story one token at a time with random sampling. For each reasoning output, there is a color indicating probability for each token. Green indicates high-probability sampled token. Red on the other hand indicates low-probability sampled token.
Low-probability red tokens are very interesting since the generated reasoning can go another way if GPT-3 randomly chooses other candidated tokens. To gain more insights, in each experiment's best text, we also investigate the possible candidated tokens beside the randomly chosen red tokens.
Note that GPT-3 has a limitation where the number of combined reading and writing tokens are no more than 2048. Our two-shots prompt plus a tested story are roughtly 1300 tokens, so GPT-3 has around 750 tokens left to write on each ToU test.
This 2048 token limitation is the reason that we limit ourselves to test GPT-3 in only 8 basis plus 1 temporal dimensions. Nevertheless, these 8+1 reasoning dimensions are the most fundmamental according to our best knowledge as explained in the previous section.
The 2048 token limitation also limit the quality of the 2-shots texts prompt where we could not give our best reasoning examples. Below, in the appendix, we show that if we could provided more tokens as a prompt, GPT-3 reasoning quality could improve significantly.
Further, the readers may criticize that the manual best-story-selection process is not quite practical. In fact, we can teach GPT-3 to choose the best story selection automatically by giving all candidates, our best selection with explanation as a prompt. Again, we could not implement this idea due to the 2048-tokens limitation.
For each story, we attach a theme picture just for readers to sense the story easily. None of the pictures are input to GPT-3.
On the contrary to his colleagues believes, Alain Bombard thought that people could stay alive in the sea by drinking sea water and eating small fish and plants from the sea. He set out in a small boat to cross the Atlantic Ocean. He was able to stay alive for 65 days before finishing the journey.
In this Biography genre, there is one essence of this story which we would expect the model to know and reason:
There's a hint on Alain Bombard name that this person is real, and the model should retrieve all the related events
The after-story is interesting. What will he do about his success ?
With respect to the best reasoning that the model generated, we evaluate the reasoning above to have the following scores:
Reasoning Scores
Notes on Reasoning
On dim-1, the characters descriptions and roles are mostly perfect.
On dim-2, GPT-3 does provide high quality texts regarding before-story and present thinking. However, it does not predict the "after" event which is important.
On dim-3, the hypothesized possessions are great.
On dim-4, it states the price of boat which is too cheap to be true while other generated texts are great.
On dim-5 and dim-6, it remarkably states correctly that in the story Alain Bombard came from France in 1950s. One flaw on dim-6 is a poor reasoning of "boats were common", so the quality score is minus 1.
On dim-7, it states too general and uninteresting reasoning for both "before" and "after" events.
On dim-8, the model does not explicitly answer which part of the story is interesting, and it counterfactual story looks entirely missed the point, so we label both scores as low-quality.
Overall, it still overall generated pretty good texts. This again illustrates that GPT-3 is very knowledgable in factoid dimension.
GPT-3 can also generate totally different and low-quality reasoning too as can be seen in the other generated texts tabs.
Notable Words with Low (Red) Probability and Its Alternatives
Note that most of them seem not change the meaning of the sentence, so likely GPT-3 will generate similar reasoning even the red tokens turned out the other way.
(dim-1) unfit -- poisonous , deadly, toxic, harmful and dangerous
Alien race seeking refuge landed on earth on a small island in the south pacific. For a hundred years they've managed to keep the island cloaked and secret from our human population. But now they've exhausted the resources.
Following is the essence of this story which we would expect the model to know and reason:
The most interesting part is in the last sentence where we expect them to do something to gain more resources.
Before the story, somehow these Aliens had to abandon their planet.
After the story, what should the Alien do to get more resources?
We can safely assume that the Alien possessed high technology e.g. spaceship and cloaking machine.
From the narrative, Alien, at least initially, had been peaceful and had no bad intention, so they did not want to contact human.
There is a small hint that Alien island is in South Pacific, so it should be a non-habitant island similar to Henderson island.
With respect to the expectation, we evaluate the best reasoning above to have the following scores:
Reasoning Scores
Notes on Reasoning
On dim-1, it correctly identifies two main characters with correct roles of Alien as refugee. The mention of resources here is a bit random though.
On dim-2, seems like it hypothesize great before-story and after-story. Note that it hypothesize a story of peaceful contacts with human.
On dim-3, it mentioned several possible possessions.
On dim-4, roughly they state all possible knowledge of the objects mentioned in dim-3 and dim-1, even though the written sentences on spaceship/vehicle prices are quite random.
On dim-5, the reasoning is in details and all make sense.
On dim-6, the sentence structure is difficult to read, but overall information is relevant. In facts the other two generated texts exactly mention that this story should be science-fiction.
On dim-7, again very possible theory for both before and after events. Moreover, this is coherent with dim-2 reasoning.
On dim-8, the model correctly identifies most important part. It could explain deeper, though, how the resources could be more abundant to the alien.
Among all stories we test, the model reasoning best here. Do not expect this great reasoning in all stories as the readers could see on the last 4 stories.
Notable Words with Low (Red) Probability and Its Alternatives
(dim-8) planet -- better, way, more, spaceship, lot
Aliens
Ling went to a big-box store selling everything on the planet to buy his favorite tennis racket. But a staff named Xin said that the store would not sell the racket since it's defective. Ling complained that he has a ATP master to participate tomorrow and he needed the racket now.
Following is the essence of this story which we would expect the model to know and reason:
The most interesting part should be in the last sentence since it's not usual that people can participate at ATP master
It is highly likely that Ling is a professional tennis player capable of playing at the ATP master
From the names of the two characters, it is possible (though not necessary) that this story happen in China, so in that case the competition has to be Beijing ATP master
It is clear that Ling felt somewhat hopeful to get the new racket provided the fact that the big-box store sell everything
However, in the story Ling felt frustrated / annoyed / angry that he could not get what he want
The after-story part is also interesting what Ling should do to get his needed racket
With respect to the expectation, we evaluate the best reasoning above to have the following scores:
Reasoning Scores
Notes on Reasoning
On dim-1, reasoning looks perfect using all relevant information.
On dim-2, reasoning here looks very likely. Only the last part is random but still possible.
On dim-3, tennis is correct and one of the most relevant in the story. It should mention about money that Ling has to pay for the new racket though.
On dim-4, the model is a bit mess up to answer factoid knowledge on both dim-3 and dim-4, but overall looks acceptable. Tennis racket has be used more than 100 years is conceptually correct.
On dim-5, the model is able to extract Beijing as we expect in the best case. Note that in other generated texts, the locations are mostly random.
On dim-6, general information looks great even though the model does not predict the genre.
On dim-7, the before and after events stated here are probable. It would be much better if the model explains in more details (better explanation quality).
On dim-8, "giving him a free racket" part is purely imaginative part before counterfactual reasoning. So all reasoning here is totally wrong.
Overall reasoning in this story is good.
Notable Words with Low (Red) Probability and Its Alternatives
(dim-1) was -- is, and, ',', should, might --> Note that if sampling 'and', we will get a crappy story like one in 'Other2'
(dim-1) tennis -- customer, boy, man, young, player, professional
(dim-2) excited -- likely, very, a, motivated, eager
(dim-3) 10 -- 100, 50, 30, 20, 40, 200
(dim-4) 100 -- a, the, hundreds, centuries, 2000
(dim-5) Beijing -- a, his, the, tennis, some, home --> pure luck to have Beijing with a very low probability
(dim 8) be -- not, give, sell, gave, refuse --> Bad counterfactual due to bad sampling luck
(dim-8) too -- defective, a, broken, really, damaged, faulty --> again, 'too' has very low probability
Shopping at Big-box Store
As a new job for a prominent wealthy family, one of Chandra's first task is to water all of the house plants. While Chandra is watering the lilies, one of the plants starts talking to warn him of a dark family secret.
Following is the essence of this story which we would expect the model to know and reason:
The most interesting part of the story is of course when one of the lilies start talking.
It's truly exciting to know what is "dark secret", how could the lilies be able to talk, etc.
There's not much other hints in the main story, so the ToU questions may not be too difficult
With respect to the reasoning above, we evaluate the reasoning above to have the following scores:
Reasoning Scores
Overall, we think the reasoning given here are mostly good. GPT-3 can also generate totally different and low-quality reasoning too as can be seen in the other generated texts tabs.
Notes on Reasoning
On dim-1, it does not mention the special role of the Lilies, even though it know that lilies are relevant. GPT-3 mentioned like they are normal house plants. so the quality score is minus 1.
On dim-2, the generated texts look quite reasonable in all temporal aspects.
On dim-3, GPT-3 again ignore the facts that the family possessed a dark secret, a wealth, a house, etc. So the relevancy score is low quality here.
On dim-4, GPT-3 mentions only 1 object from many possibilities so the relevancy score is low quality.
On dim-5, the location reasoning looks acceptable.
On dim-6, it use dark-secret information to predict the correct genre. It also tries to use other information like plants and house to predict general information, but with bad-luck sampling,
the explanations about time period, Europe and Ancient stuffs are not making much sense. From red tokens analysis, we found later below that this poor "Ancient" stuffs is due to a poor sampling of the word "not".
On dim-7, this looks like likely before-event and good after-event.
On dim-8, it detect the correct interesting part, but the counterfactual story is too broad and not interesting.
Overall, this best version is quite acceptable text. It is able to answer most questions.
Its ability to reason sensibly in mystery story is a question mark, though.
Notable Words with Low (Red) Probability and Its Alternatives
(dim-8) plant -- lily, family, plants, wealthy, house
Mysterious House
It was very exciting to arrive the legendary island where "Origin of Species" was inspired from. However, as Giulia was not well-prepared, she did not even know where should he sleep tonight! At least, she had $1000 which hopefully was enough.
Following is the essence of this story which we would expect the model to know and reason:
There are two most interesting parts in the narrative.
The island itself which would be great if the model can infer to be "Galapagos".
The fact that Giulia did not know where to sleep!
The narrative likely implied that she had no idea, and should be a solo traveler.
So she should have a mixed feeling between exciting and worrying altogether.
The after-story should relate to either what marvelous stuffs she was going to see in the island, or how she could find a place to sleep.
This story is challenging for the model to extract Galapagos information.
Reasoning Scores
Notes on Reasoning
On dim-1, the model is doing great jobs to identify possible other characters in the story. Only the assistant (due to bad-luck sampling) is unlikely.
On dim-2, the sentence structure is acceptable. Sadly, the model cannot directly extract the 2 most relevant information on the story about Galapagos and "no-hotel worriness"
On dim-3, acceptable objects.
On dim-4, acceptable attempts from the relevant information in dim-3, but the last sentence does not have much sense.
On dim-5, good attempt to use both information about island and the hotel in all "before", "current" and "after" story. It nevertheless cannot extract the word Galapagos and the written story is contracdicted the fact that Giulia was unprepared. The overall score is subjectively between acceptable and low quality.
On dim-6, The genre is correct, and this time it successfully (lucky-sampling) extract the continent name, South Africa. We don't have exact information on the time period so 1950 is rather random.
On dim-7, Before and after event are not entirely impossible but low quality since they are not related to the given narrative at all.
On dim-8, the reason is totally random, and contradicts the story.
Notable Words with Low (Red) Probability and Its Alternatives
(dim-8) transportation -- airport, city, island, port, beach
Galapagos island
Being William Shakespeare’s apprentice would be great if he weren’t always stealing your ideas and claiming them as his own. So, James write a brilliant satiric play exposing him. He loves it and takes it to the stage.
This story looks difficult due to the twist at the end -- the model must understand bad intention of James at the end while knowing that Shakespere may not realize James' intention.
Following is the essence of this story which we would expect the model to know and reason:
The most interesting part is "So, James write a brilliant satiric play exposing him", and this will not happen if their relationships are much healthier.
The model should use all the knowledge relating to the name Shakespere e.g. location, time period, big picture, related objects.
The model should understand the emotions of James in the story that is sadness, angriness, and even motivation to revenge.
The after story is very important either Shakespere would be exposed, or on the twist plot, Shakespere with his genius already modified some of James' play already.
With respect to the expectation, we evaluate the best reasoning above to have the following scores:
Reasoning Scores
Notes on Reasoning
Sadly, on all 3 generations, not only the best one, GPT-3 missed the main story point that James planned to expose Shakespere using his satiristic play.
On dim-1, here, the model correct identified both two characters, with all relevant information.
On dim-2, acceptable with good detailed analysis of thinking, only miss the main point that James would want to revenge his mentor. On the quality side,
the sentence "has no motivation to write" (due to bad-luck token sampling) is quite contradicted the story.
On dim-3, the model reasons on the wrong question in the first sentence relating James, and an ok-but-small details regarding Shakespere's book.
On dim-4, too little details.
On dim-5, may be acceptable reasoning but the sentence structure is quite difficult to read.
On dim-6, showing great knowledge -- will be perfect if mentioning that this was on Great Britain or England.
On dim-7, we can see that the model tries to use the given information, but failed on reasoning with them. The generated sentences are messy and seem answer the wrong question of emotional instead of events.
On dim-8, the interesting part is correct, but the reason is of shallowest possible (negating the sentence instead of giving reason)
Overall, the narrative is quite difficult, so we are not much surprised with the low score.
Notable Words with Low (Red) Probability and Its Alternatives
(dim-2) annoyed -- angry, bad, sad, frustraed, upset
(dim-2) no -- a, motivation, the, to, some --> May totally different sentence
(dim-6) Drama -- satire, realistic, comedy, real, tragedy
(dim-6) 15 -- 16, 17, Renaissance, past, modern
William Shakespere
In 2020, Coronavirus surprises everybody by spreading everywhere, killing millions of people and turn off most world travels. Uğur Şahin told all staffs in his company to work extremely hard on their mRNA vaccine research before situations got worse.
This pandemic story is unknown to GPT-3 which limited its training data to 2019.
Following is the essence of this story which we would expect the model to know and reason:
Two most interesting parts of the story are "Coronavirus spreading everywhere and killing millions" [global information] and "work extremely hard on the mRNA vaccine" [local information],
so this narrative is quite difficult that the model would have to reason on these two scales altogether.
On the global scale, it would be great if the model predict the following.
before the story, all people around the world would live normally
after this story, before vaccine get invented, more people will die, and there's highly possibility of economic crisis and other catastrophic consequences.
On the local scale, we expect the following.
On the factoid part, it will be best if the model know the name of Uğur Şahin who has been CEO of BioNTech and reponsible for the vaccine Pfizer-BioNTech in the real world.
Therefore, in the best case it would be able to infer the location of the company, and interesting facts on vaccine or mRNA technology.
Since millions of people are dying everywhere, it is obvious the the major emotions of every people included fear, desperate and sad.
The after story should relate to whether the vaccine is success or not.
With respect to the expectation above, we evaluate the best reasoning of GPT-3 to have the following scores:
Reasoning Scores
Overall, we think the model mostly ignore the global information of millions dying people to incorporate in the seriousness of the story.
Notes on Reasoning
On dim-1, acceptable answers. Could mentioned normal people around the world which are victims of the virus, and the virus itself as the anagonist of the story.
On dim-2, not answer the question at all. This is due to bad luck sampling, where we can see some good reasons on the other generated texts.
On dim-3, acceptable, but sadly not specifying vaccine-related technology.
On dim-4, use very small relevant information. The quality of answer on the other hand is acceptable.
On dim-5, similar to dim-4.
On dim-6, acceptable, but does not use information about the world-spreading nor Uğur Şahin's name. The sentence quality is also acceptable albiet too short.
On dim-7, again, acceptable but ignoring the global aspect that before the story people should live normally, and after the story if vaccine is successful, it will save many people lives.
On dim-8, acceptable albeit "too-short" reasoning.
If we combine the best of all 3 generated texts on each questions, the score would possibly improve to 2.0 on average.
Notable Words with Low (Red) Probability and Its Alternatives
(dim-4) DNA -- research, experiments, scientific, science, chemical
(dim-5) air -- lab, major, signs, effective, cure
(dim-6) weathers -- virus, problems, epidem, diseases, S (likely SARS)
(dim-6) political -- economy, relationship, government, history, health, reputation, environment
CoronaVirus Pandemic - Unknown to GPT-3 having latest data on 2019
Eriko never used a crystal punch set she got as a wedding gift. When Praew got married, Eriko wrapped the set as her gift. When Praew opened the gift, she looked curiously and told Eriko it was the same punch set she gave her Years ago.
This is a very difficult narrative. Following is the essence of this story which we would expect the model to know and reason:
The most interesting sentence is when Praew told Eriko in the last sentence.
The punch set was given to Eriko by Praew years ago, then Eriko forgot that Praew gave her, so she gave Praew back.
I.e. Praew --> Eriko --> Praew is the possession flow of this punch set.
Grammatically, this is ambiguous due to the coreference she gave her in the last sentence.
It is not polite, socially, to bring the gift back to the original giver
Eriko must immediately have some excuses for Praew after the given story.
They both must feel somewhat awkward feeling on the last sentence of the story.
Since the gift was given to Praew in the story, the event might take place at either Praew's wedding party or Praew's house
Also, Praew's role as newly wed bride should be emphasized, and Eriko must be her guest or even best friend.
With respect to the expectation above, we evaluate the best reasoning of GPT-3 to have the following scores:
Reasoning Scores
Notes on Reasoning
Due to its difficulty, the model got very low scores on even the best generated text.
The model seems confused the coreference she/her in the last sentence so its reason is quite random.
On dim-1, it ignore the role of bride, so the relevancy score got discounted. On quality, the divorce stuff is too random.
On dim-2, we can see that the model try to bring some relevancy information on gift, but the overl all reasoning is mostly random.
On dim-3, similar to dim-2, GPT-3 try to use relevant information, but it failed to clearly explain the Praew --> Eriko --> Praew possession flow.
On dim-4, ok, acceptable descriptions about punchset.
On dim-5, mostly use relevant information on location, but again the quality of written sentences are entirely non-structured.
On dim-6, guess genre is partially correct, but other information stated are quite irrelevant. The sentences quality are readable, but low quality.
On dim-7, GPT-3 again tried to use small relevant information on giftset (still ignore many relevant information explained above). The explained reasoning quality are totally unreadable.
On dim-8, surprisingly, I think GPT-3 pick the correct interesting event. The counterfactual cause of if Praew had forgot about the set is not bad.
Notable Words with Low (Red) Probability and Its Alternatives
With such low relevancy and quality in reasoning, we feel like it does not worth to analyze the red tokens here.
A Crystal Punch Set
Summarize of GPT-3 commonsense reasoning
Human commonsense knowledge and reasoning is much more complex than we normally thought. This fact is clearer when we use Template of Understanding (ToU) to analyze commonsense reasoning systematically in fundamental dimentsions.
Since experimental evaluation of this work requires human judgement, it is costly and challenging. Above, we show a small set of evaluation which although could not be a commonsense benchmark of GPT-3, we hope the analyzed score below could be used as a rough guidance on the direction where machine reasoning could be improved. Since when it consistently failed, we could systematically see the reasons as well as dimensions of the failure as explained in the previous section.
Figure 5 shows average capability according to our experiments, in each commonsense reasoning dimension of GPT-3, one of the best model we have.
Figure 5. GPT-3 commonsense score summary chart from reading narrative stories on various genres
Among the reasoning dimensions, causal inference and counterfactual analysis are believed to be keys of artificial general intelligence by earlier works[1][5]. Note again that in this article, we use informal meaning of the words "causal" and "counterfactual" as explained in Section 2.
Counterfactual analysis, the ability to generate 'what-if' story with a logical consistency and possibility, is one of the most important and difficult. Our analysis in the previous section confirms the difficulty in this kind of reasoning where GPT-3 got the lowest score of 1.31 points, which labeled as mostly low-quality reasoning.
Counterfactual analysis reasoning is difficult because the model has to identify most interesing part of the story, and hypothesize a chain of alternative events that would make the story go in entirely different direction.
Moreover, the story in the new direction must come from rigid, i.e. sequence of logical and relevant, reasoning. This logical-and-relevant reasoning chain is difficult to attain in the present 'random token sampling', a gold standard in the current text generation paradigm.
The random token sampling paradigm is not effective in a chain of logical reasoning since in many cases opposite words tend to have high probabilities altogether. For example, giving the following sentence to GPT-3 (Davinci)
When we ignite little fire in the very-strong air-conditioned room, the room temperature will be either hot or cold. I think the room is
Then, words such as hot and cold/cool have roughly comparable probabilities where the meaning of the generated sentence would be entirely different thereafter.
There were usually flawed reasonings where GPT-3 often gives a sentence which contradicted its own story as described in each story's detailed analysis above. This contradicted sentence might come from the fact that GPT-3 learned sentence correlation rather than the sensibly deducted sentence from the current pretraining paradigmm. A sensibly deducted sentence is a sentence that is caused or enabled by the events explained in previous sentences.
Causal inference (from the given context) also requires a rigid chain of reasoning. This causal reasoning may be somewhat easier than counterfactual analysis since, although requiring the same rigid chain, the model still has relevant information from the given narrative, where in counterfactual analysis the model has to generate a new information by itself (the original story could be only a small hint).
Our investigation score GPT-3 capability of casual reasoning around 1.81 which still toward low-quality.
The other reasoning dimensions e.g. character, object or location description, are a bit easier since the various objects generated by token sampling can be consistent with given main story and they often do not contradict each other, e.g. Person can possess almost arbitrary objects so no matter which objects are generated, the sentence is often valid (although may not relevant to the story).
The scores in other reasoning dimensions are around 2.0 (acceptable), with prime exception in character identification skill where GPT-3 usually did amazing job (score 2.53) indicating that GPT-3 is very skillful on understanding characters in the story.
To summarize, in our ToU dimensional analysis The total average score is of 1.94, almost acceptable, but not great.
This shows that while GPT-3 reasoning is indeed able to write impressive texts, its strong points lie elsewhere beside solid reasoning, e.g. creativity, vast-knowledgeness and syntactical-correct writing. GPT-3 still have potential to improve its concrete logical reasoning shown by ToU analysis.
Appendix on More Intensive Analysis of Causal and Counterfactual Inference
To really see the limits of GPT-3's causal and counterfactual inference, we conduct more extensive experiments that focus on these two reasoning dimensions only. The experiment will add much higher quality prompt with more shot in these two reasoning dimensions. However, due to the 2048 token limitation of GPT-3, this extensive prompt comes with the cost that we have to entirely abandon the other 6 reasoning dimensions.
Readers can see the appendix here. Briefly, with higher-quality and more-shots prompt, GPT-3 reasoning on these two dimensions are better but nevertheless, with more detailed analysis we can see that it still often generates contradicted sentences. As mentioned above, this is due to the fact that GPT-3 learned sentence correlation rather than the sensibly deducted sentence (or we should call sentence causation ) from the current pretraining paradigm.
We may able to reduce this kind of inconsistency provided more-shots and even higher-quality examples. Therefore, it is very interesting to see the power of this GPT-3 if we can break the 2048-tokens limitation.
Looking Forward
In the previous work of multiple-choices New York science examination evaluations
[23]
, transformer model get a very high score of 91.6% and 83% on Grade 8 and Grade 12 exams, respectively. So it would gat an 'A' grade even in the high-school standard, so this result make us temp to think that at least AI is at least on par with undergraduate-level intelligence already.
Nevertheless, when we move out from the multiple-choice regime, and let AI shows its reasoning of each answer, the above ToU analysis on GPT-3, even though in a small scale, hint that its understanding should be quite lower than those of high school students (since high school students are less likely to put contradcited sentences together in a reasoning).
In the appendix, we showed that GPT-3 could do better if we provide much higher quality prompts. The main limitation in this paradigm is onthe 2048 tokens limit as explained in the beginning of Section 3.
Note also that in the experiment above, we manually choose the best story. The readers may criticize that this manual process cannot be scale. In fact, we can teach GPT-3 to choose the best story selection automatically by giving all candidates, our best selection with explanation as a prompt. Again, we could not implement this idea due to the 2048-tokens limitation. Therefore, the 2048-tokens limitation is significant bottleneck in our opinion.
So in the future of improving GPT-3 or even GPT-4, one potential direction is, in addition to increasing model parameters and training data, to make the model able to attend more tokens (e.g. 10x to 100x more tokens) would be another way to see GPT's full potential.
Note that there are also research works on commonsense reasoning from visual images[24].
In this article, however, we are limited ourselves to commonsense knowledge and reasoning from text-only inputs.
This work is possible with the access and credits of GPT-3 given by OpenAI. Feel free to discuss on the discussion tab on the top of this article.