Generating Intermediate Steps for NLI with Next-Step Supervision

Deepanway Ghosal

^{1}

, Somak Aditya

^{2}

, Monojit Choudhury

^{3}

^{1}

ISTD, Singapore University of Technology and Design

^{2}

Department of CSE, IIT Kharagpur

^{3}

Turing India, Microsoft
deepanway_ghosal@mymail.sutd.edu.sg
saditya@cse.iitkgp.ac.in
monojitc@microsoft.com

Abstract

The Natural Language Inference (NLI) task often requires reasoning over multiple steps to reach the conclusion. While the necessity of generating such intermediate steps (instead of a summary explanation) has gained popular support, it is unclear how to generate such steps without complete end-to-end supervision and how such generated steps can be further utilized. In this work, we train a sequence-to-sequence model to generate only the next step given an NLI premise and hypothesis pair (and previous steps); then enhance it with external knowledge and symbolic search to generate intermediate steps with only next-step supervision. We show the correctness of such generated steps through automated and human verification. Furthermore, we show that such generated steps can help improve end-to-end NLI task performance using simple data augmentation strategies, across multiple public NLI datasets.

1 Introduction

Premise (P)	Hypothesis (H)	Proof	Correctness		Minimality
Premise (P)	Hypothesis (H)	Proof	C1	C2	M1	M2
The wind propels a sailing ship on a group of cruisers.	There are many boats out.	P $\to$ I1: A group of cruisers are in the water. $\to$ I2: A group of boats are in the water. $\to$ H	✔	✔	✔	✔
A young girl wearing a red shirt and cap smiling and holding a small toy is standing in front of a group of children playing behind her.	The young girl is wearing a red shirt.	P $\to$ I1: A young girl in a red shirt and cap is in front of others. $\to$ I2: The young girl is wearing a cap and red shirt. $\to$ H	✔	✔	✖	✖
Numerous customers browsing for products in a market.	People are shopping.	P $\to$ I1: A group of people are shopping. $\to$ I2: There are a bunch of people shopping. $\to$ H	✔	✔	✖	✔
A gentleman with his eyes closed playing an old fluglehorn into a microphone.	A man plays an instruement with his eyes closed.	P $\to$ I1: The gentleman is listening to music. $\to$ H	✔	✖	✖	✔

Table 1: Correctness and minimality of proofs as outlined in Section 2.

Complex NLP tasks such as Natural Language Inference (NLI) and Question-Answering (QA) often requires reasoning over multiple steps using multiple facts and implicit commonsense knowledge trivedi-etal-2020-multihop; DBLP:conf/aaai/SapBABLRRSC19; camburu2018snli. It has been long argued that DBLP:journals/cacm/Lipton18, the state-of-the-art Deep Learning models should also output some sort of explanation (such as intermediate steps or a textual explanation) alongwith the final answer. The opaque performance and poor out-of-distribution generalization performance of Transformers-based models kaushik2019learning; ribeiro-etal-2020-beyond have refuelled this discussion. However, it is unclear how these intermediate steps can be generated for unconstrained natural language Premise-hypothesis pairs (such as in crowd-sourced NLI datasets) as it is non-trivial to collect crowd-sourced fine-grained explanations or generate them synthetically. Most importantly, it is unclear how such steps can be utilized further for the NLI task.

Despite some efforts camburu2018snli crowd-sourced collection of intermediate steps comes with complications, as human-written explanation can be subjective, and it is hard to automatically verify or utilize such explanations. Recently, researchers have explored synthetic generation of intermediate steps (or proof trees) DBLP:conf/ijcai/ClarkTR20; DBLP:conf/acl/TafjordDC21; DBLP:conf/emnlp/SahaGSB20, where the main goal was to test whether Transformers (the backbone for NLP models) can perform deductive reasoning over natural language statements. Provided their examples come from an underlying symbolic system with closed world rules and facts, it is unclear how this strategy of generating proof tree can be extended to unconstrained natural language premise-hypothesis pairs. In comparably structured domains, such as code snippet retrieval, knowledge-graph based QA and symbolic mathematics, DBLP:journals/corr/abs-2112-00114; DBLP:conf/wsdm/HeL0ZW21; agarwal2021analyzing have shown how to generate such intermediate steps automatically, and how utilizing such steps can enhance end-to-end task performance.

We take inspiration from these domains. We assume that an ideal NLI system reaches a series of intermediate conclusions to derive the final conclusion (entailment/contradiction/neutral) and we call these intermediate steps as proof. Since such proofs can widely vary linguistically, logically and in length; to make verification and generation easier, we impose various constraints of correctness, minimality, and atomicity on what we expect as natural language proofs for a given premise-hypothesis pair. Alongwith human verification, we propose corresponding automated metrics for verification. As human annotation and synthetic generation of such groundtruth proofs is non-trivial, we utilize single-step supervision to train a T5 encoder-decoder model raffel2019exploring on various available entailment datasets; learning various aspects of proof generation from SNLI, Monotonicity Entailment Dataset, and Entailment Bank. To enrich single-step generation with external knowledge, we build a fact retriever and sentence composition model; that retrieves facts from external knowledge bases and learns to deduce new facts. We then explore search-based methods that utilizes the fact-augmented T5 model to generate multiple-step proofs. Our automated and human verification results show the efficacy of the proof generation process. We further use these generated proofs as additional training data and show improvement in end-to-end NLI task performance.

Our contributions include, 1) using only next-step supervision to train a T5 encoder-decoder based model, and external fact composition model; and then use search to generate sequence of knowledge-enriched intermediate steps. We 2) propose and use automated and human verification metrics to show the correctness of the generated proofs. Lastly, 3) we also show that augmenting such proofs during training can help enhance an NLI system’s performance on MNLI, and MED (under low-training data regimes).

2 Constrained Intermediate Steps or Proofs in the NLI Context

In NLI, a premise-hypothesis pair is provided in natural language, and the task is to determine whether the hypothesis is entailed by, contradicts, or is neutral with respect to the premise. Ideally, an NLI system follows a logical sequence of steps (or proof) to come to such a conclusion. However, many valid proofs may exist due to linguistic variability and multiple ways of reaching a conclusion from the context. Hence, to make generation and verification easier, we resort to defining logical properties that valid proofs should have.

Given a premise ( $P$ ), a hypothesis ( $H$ ), and an implicit knowledge base $K B$ ¹¹1Following dagan2005pascal, we assume $K B$ encodes commonly assumed knowledge as facts and rules., an NLI proof $⟨ P, H ⟩$ (or $⟨ P, \neg H ⟩$ for contradiction) is a sequence of sentences ( $Y_{1}, \dots, Y_{m}$ ), where $Y_{j}$ is either an inferred intermediate step (denoted by $I_{j}$ ), or an external fact or rule (denoted by $F_{j}$ ). A valid proof should satisfy the following properties: i) (Correctness 1) each step is either a generic rule, or a fact which is a entailed by the premise; ii) (Correctness 2) hypothesis (or negated hypothesis) is a valid entailment of premise and intermediate steps; iii) (Minimality 1) each proof upto $Y_{j}$ (including $H$ ) should be minimal proof for $⟨ P, Y_{j} ⟩$ , iv) (Minimality 2) the sentences in the intermediate step ( $I_{j}$ ) should not be trivially decomposable (using common linguistic or logical constructs)²²2M1 captures redundancy (see E4 in Tab. 1). M2 captures some aspects of atomicity. Sentences should not be compound in nature. We prefer “John is going to Paris. Mia is going to Paris.” over “John and Mia are going to Paris”. It also entails that intermediate conclusions should correspond to a semantic frame (which can not be trivially decomposable without losing context). However the latter is quiet hard to verify. and consecutive inferred steps should be sufficiently different, and v) (Order) a step can only be generated with the help of previous steps.

We provide few example proofs in Table 1, highlighting the properties of correctness and minimality. These constraints motivate our method to generate proofs for arbitrary P-H pairs with only next-step supervision. In Figure 1, we show two types of proof that we generate, i) inference chains (i.e. only intermediate entailments) and ii) proofs with external (commonsense) facts (or rules) in sequence. For neutral cases, such properties are hard to define. Inspired from DBLP:conf/acl/KumarT20, we proposed a method for intermediate step generation of neutral instances which is used only for data augmentation experiments.

3 Provers: Generating Proofs with Next-Step Supervision

A typical possibility to build provers is by training on an NLI dataset with groundtruth proofs, collected using crowd-sourcing or created using formal logic. However, crowd-sourced explanations are highly subjective and diverse; hence not easily verifiable. Similarly, following ProofWriter, generating formal logic-based proof is possible but hard to scale for arbitrary P-H pairs. We instead rely on various NLI datasets, from which models can learn multiple ways of generating entailments, contradictions, and intermediate steps. We then enhance such provers with facts retrieved from knowledge bases and sentence composition methods. We use these techniques effectively to search for more generalized multiple-step proofs. We start from the premise P and generate the steps of the proof recursively with the Prover model. We use the term Prover to collectively denote the models in Section 3.1 and Section 3.2 as discussed next.

Figure 1: We illustrate two types of proofs that we explore: a) inference chains, and b) proofs with external facts. These proofs have been generated using unconstrained proof search (Section 4.1) and proof search with facts (Section 4.2).

3.1 Multi-Task Supervision

We train a single T5-Large model with various objectives to generate inferences from the premise. The objectives are specified by prefix tokens in the input text. The objectives are as follows.

Entailed Sentence Generation: The model is trained to generate possible entailments $E$ from the premise $P$ . The entailment instances of the SNLI dataset bowman-etal-2015-large is used for training. The hypothesis $H$ is considered as $E$ during training. The input to the model is entail: P and the output to be generated is H.

Contradictory and Neutral Sentence Generation: The model is trained to generate possible contradictions and neutral inferences from the premise $P$ . We use the relevant instances in the SNLI dataset for this objective. The input to the model is contradict: P or neutral: P and the output to be generated is the corresponding contradictory or neutral hypothesis H.

Monotonic Sentence Generation: The model is trained to generate monotonic inferences $M$ from the premise $P$ . The input is monotonic: P and the output to be generated is M. We use the Monotonicity Entailment Dataset (MED) yanaka-etal-2019-neural for this objective.

We merge and shuffle instances from the respective datasets to ensure that training is performed for all the objective functions simultaneously. More details can be found in the Appendix.

3.2 Fact Retriever and Sentence Composition

Fact Retriever

The generator model (section 3.1) can not generate proofs, which need reference to some external facts, or commonsense knowledge. Such knowledge may not be readily available in the generator model. It is thus necessary to assist the proof generation algorithm with relevant factual knowledge to generate accurate and complete proofs.

We use sentences in Open Mind Commonsense (OMCS) singh2002open and GenericsKB bhakthavatsalam2020genericskb as the knowledge base (KB). The sentence embedding model all-mpnet-base-v2 reimers-2019-sentence-bert is used to retrieve facts for a given (premise, hypothesis) pair from the KB. For a sentence $s$ , we retrieve the facts $F$ from KB based on highest embedding cosine similarity in the following way:

[leftmargin=*, topsep=0.2pt]
Retrieval with key-words: The list of noun tokens – $n_{p}, n_{h}$ are extracted from the premise and the hypothesis. The token list is then joined together to form $s$ to perform retrieval.
Retrieval with clustered key-words: $n_{p}, n_{h}$ are divided into small groups of related words using clustering with word embeddings. For instance, the following groups are created for the example in Figure 1: {dog, animal}, {snow, cold}, {frisbee, toy, plastic}. Each of the groups is then merged together in a single string $s$ for performing retrieval.

Sentence Composition

We train another T5-Large model to generate compositions from a pair of input sentences. The model is trained on sentence triplets from the Entailment Bank dalvi2021explaining and RuleTaker DBLP:conf/ijcai/ClarkTR20 datasets. Few example triplets $⟨ S_{1}, S_{2}, S_{3} ⟩$ are as follows: $⟨$ Bob is green, All green people are rough, Bob is rough $⟩$ , and $⟨$ Eruptions produce ash clouds, Ash blocks sunlight, Eruptions block sunlight $⟩$ . Here, $S_{1}$ and $S_{2}$ can be composed to conclude $S_{3}$ . In this setup, the input to the model is conclude: $S_{1}$ <sep> $S_{2}$ and the output to be generated is $S_{3}$ . The sentence composition model is then used with the fact retriever model to generate the proofs (section 4.2).

Method

Premise (P)

Hypothesis (H)

Proof

Unconstrained Proof Search

An old woman in a white hat and purple and blue clothes is sitting down by a wooden building.

There is a building.

\to

I1: A woman is sitting by a wooden building.

\to

I2: The building is made out of wood.

\to

Fact Composition

Bicyclist ride the course near the ocean as the day comes to an end.

The cyclist was riding near the ocean at sunset.

F1: Sunsets can happen at the end of the day.

&

\to

I1: A cyclist ride the course near the ocean during the sunset.

\to

A baby girl and little boy are standing next to a guitar and a drum.

The girl is near an instrument.

F1: A drum is a percussion instrument.

&

\to

I1: A baby girl and little boy are standing next to a percussion instrument.

\to

I2: A baby girl standing next to a percussion instrument.

\to

Table 2: Examples of generated proofs using different methods.

4 Proof Generation

We generate proofs with two different methods as described below, and provide some examples of generated proofs in Figure 1 and Table 2.

4.1 Unconstrained Proof Search (Inference Chains)

The entailment and monotonic sentence generation setup use the premise $P$ to create possible implications/inference chains. We use this generation setup in a recursive manner with level search (or beam search) to find multistep proofs $(I_{1}, I_{2})$ from the collection of generated chains. The recursive generation and search is performed as follows.

[leftmargin=*, topsep=0.2pt]
Given $P$ , we first generate one-step implications $_{1}$ from the T5 generator.
We denote the $n$ closest implications to the hypothesis $H$ as the filtered set $i_{1}$ . The closeness measure is performed by using highest cosine similarities with $H$ using the sentence embedding model all-mpnet-base-v2 reimers-2019-sentence-bert.
$i_{1}$ is then to generate the next set of implications $_{2}$ , which are filtered again to obtain $i_{2}$ .
The $n$ implications in $i_{2}$ and their respective source sentences in $i_{1}$ form the multistep proof set $(I_{1}, I_{2})$ for level search.
The top $n$ implications (according to closeness with $H$ ) from the merged set of $i_{1}$ and $(i_{1}, i_{2})$ form the proof set for beam search.

We use the terminology unconstrained with similarity for the above proof generation algorithm with sentence embedding based closeness measure. We use $n = 10$ in our experiments. The generation and filtering process can be performed repeatedly to form proofs with more steps: $(I_{1}, I_{2}, . . I_{m})$ . However, we observe diminishing results after $I_{2}$ , as steps tend to become repetitions of each other.

4.2 Proof Search with External Facts

Given a premise, hypothesis pair $(P, H)$ , we denote the facts retrieved from $K B$ as $F$ . Let $F$ consist of $m$ distinct facts ${F_{1}, F_{2}, . ., F_{m}}$ . We now use the sentence composition model as follows to generate the proof.

[leftmargin=*, topsep=0.2pt]
We compose premise $P$ with each of the facts ${F_{1}, F_{2}, . ., F_{m}}$ to generate the implications ${I_{1}, I_{2}, . ., I_{m}}$ .
If any composition $I_{k}$ is farther from $H$ than $P$ , then the respective fact $F_{k}$ is discarded. The distance is computed using embedding cosine similarity from all-mpnet-base-v2. We denote the filtered fact set as $¯ F$ .
Facts from $¯ F$ are first ordered based on their ranking from the retriever. This ordered set ${{¯ F}_{1}, {¯ F}_{2}, . ., {¯ F}_{j}}$ is composed with $P$ and the respective outputs recursively: $P & {¯ F}_{1} \to {¯ I}_{1}$ , followed by ${¯ I}_{1} & {¯ F}_{2} \to {¯ I}_{2}$ , … followed by ${¯ I}_{j - 1} & {¯ F}_{j} \to {¯ I}_{j}$ .
If ${¯ I}_{j}$ is not sufficiently close to $H$ , then further implications ${^I}_{j}$ are generated using the entailment and monotonic sentence generator model.

5 Experimental Results

We perform two sets of experiments – i) we use human and automatic evaluation to evaluate different aspects of the proofs generated from our proposed method, and ii) we use the generated proofs as additional labeled data for NLI and analyze its effect on the NLI classification task performance.

Algorithm	Correctness				Minimality
	$P - H$	$P - I_{1}$	$I_{1} - I_{n}$	$I_{n} - H$	$P - H$		$P - I_{1}$		$I_{1} - I_{n}$		$I_{n} - H$		$#$ Steps	$#$ Keywords
	$P - H$	$P - I_{1}$	$I_{1} - I_{n}$	$I_{n} - H$	B4	JS	B4	JS	B4	JS	B4	JS	$#$ Steps	P	$I_{1} - I_{n}$	H
Gold Proofs	95.27	94.59	98.42	94.09	0.0272	66.26	0.0981	68.34	0.2223	74.07	0.0911	65.86	2	6.82	2.94	2.54
Negated Gold Proofs		3.94	9.45	20.08			0.0189	61.61	0.0596	62.59	0.0502	59.91			2.83
Perturbed Gold Proofs		5.91	3.15	24.41			0.0139	62.37	0.0202	58.30	0.0147	56.80			3.10
T5 w/o Search	92.28	94.06	94.63	36.25	0.0751	71.07	0.0793	67.36	0.3070	78.82	0.1162	67.45	2	7.14	2.74	3.27
UC Level Search	92.28	92.46	89.85	67.85	0.0751	71.07	0.1367	70.97	0.2443	75.94	0.0987	68.76	2	7.14	3.25	3.27
UC Beam Search	92.28	91.99	82.55	72.94	0.0751	71.07	0.1308	69.81	0.2652	75.82	0.2240	76.09	1.71	7.14	3.21	3.27
Proof Search with Facts	92.42	68.69	67.68	90.15	0.0066	65.52	0.6261	90.10	0.3540	80.63	0.0826	71.05	2.93	7.36	4.44	2.92

Table 3: Evaluation of proofs with correctness and minimality metrics for: i) gold, noisy gold proofs ii) unconstrained (UC) level, beam search in all SNLI test set entailment pairs, and iii) proof search with facts in a subset of SNLI test set entailment pairs.

P, H, I_{1}, I_{n}

, B4, JS denotes the premise, hypothesis, first, last step of the proof, BLEU-4, and Jaccard similarity respectively.

I_{1} - I_{n}

denotes the average score of all intermediate pairs. Correctness scores indicate the percentage of pairs predicted as entailment by the roberta-large-mnli model. Minimality scores are computed with BLEU-4, Jaccard similarity, number of proof steps and number of keywords.

5.1 Human Verification of Proofs

We perform human verification on a subset of generated proofs in the SNLI dataset. We train four CS graduate students (trained in NLP) with explicit instructions (in Appendix). We select top-2 proofs from unconstrained proof search for 500 randomly selected SNLI entailment instances – resulting in a total of 1000 proofs. Human annotators score each proof based on their correctness and minimality. For a proof with two intermediate sentences, a correctness score of 3, 2, 1, or 0 is given, signifying all three steps are correct, the first two steps are correct, only the first step is correct, and all other cases, respectively. In this schema, steps at the beginning have more importance than a step towards the end. Annotators also give a minimality score of 1 or 0, signifying all steps are minimal or at least one step is non-minimal, respectively. The human verification process results in a normalized correctness score of 67.28% and a minimality score of 0.7517. More details can be found in the appendix.

5.2 Automatic Evaluation of Proofs

How to Evaluate the Generated Proofs?

We also automatically assess the quality of proofs generated using unconstrained proof search and proof search with facts methods. We design a number of metrics corresponding to the constraints over the proofs as introduced in Section 2. For Correctness, we use the entailment probability score from RoBERTa-MNLI. Minimality is more non-trivial to measure. We instead use proxy metrics: i) BLEU4 and Jaccard similarity to quantify distinctiveness of consecutive steps, ii) Avg. number of keywords (nouns, verbs and adjectives) to quantify atomicity of steps, and iii) Avg. number of steps for generated proofs.

Assume an entailment instance from SNLI: $(P r e, H y p)$ whose proof is the sequence $(I_{1}, I_{2})$ . We compute the correctness and minimality for all consecutive step pairs $(P r e - I_{1}), (I_{1} - I_{2}),$ and $(I_{2} - H y p)$ . A high percentage of step pairs to be predicted as entailment is desired for correctness. For minimality (BLEU-4 and Jaccard), we expect scores to be neither very high nor very low. A very high score (close to 1) would suggest that the pair is not sufficiently different. A very low score (close to 0) would suggest that the pair may not be related at all. Similarly, we expect length of proofs to be not too high nor too low (such as zero or single-step proofs). To calibrate this range and set a baseline, we first report these metrics in Table 3 on a subset of gold proofs that are evaluated to be correct by human annotators. We also systematically introduce noise to observe how these metrics behave.

Automatic Evaluation of Gold Proofs: The suitability of the above metrics can be questioned for two reasons. Firstly, we do not have any corresponding gold proofs (that satisfies the constraints in Section 2). Second, it is hard to guess the range (or the best value) for the metrics, as we use proxy metrics and neural network models that are prone to noise. To mitigate these issues, we select a set of 200 proofs from unconstrained proof search in SNLI, judged as correct and minimal by human annotators. We call this the Gold Proofs set and measure its correctness and minimality through the automatic metrics. The Gold Proofs are correct from human judgement and are also estimated as highly accurate from the automated correctness metric in Table 3. The BLEU4 and Jaccard similarity (for minimality) range between 0.09 to 0.23 and 65% to 75%, respectively. Together, they establish the range for the metrics that we expect valid proofs should exhibit. When gold proofs are negated or perturbed (more in Appendix), we observe significant drops in correctness scores, and deviation from the minimality range. We thus expect the proposed metrics to be suitable proxies to quantify the properties of proofs in Section 2.

Results for Unconstrained Proof Search: We report automatic evaluation results of proofs generated from unconstrained proof search (using level and beam search) in Table 3. We select top-2 proofs from unconstrained proof search for all entailment pairs in SNLI and report the mean scores in Table 3. Results suggest that the unconstrained search algorithm is capable of generating proofs with high correctness and ideal minimality. The minimality scores (BLEU4, Jaccard similarity) of the level search generated proofs are very close to that of the Gold Proofs. A general trend in the correctness measure is that the score decreases as we move from left to right pairs. This is expected as the hypothesis is not directly used for proof generation in unconstrained search, resulting in the final pair being less entailing compared to previous ones.

Results for T5 w/o Search: We also report results for the T5 model (Section 3.1) where we don’t use any search until the last $I_{n} \to H$ step. We obtain significantly lower correctness for the final step and mostly out of range minimality scores with this method. The T5 w/o search method only generates entailed statements from the premise. Thus we obtain high correctness scores in the initial steps, as reported in Table 3. However, without guiding the intermediate steps towards the hypothesis a complete proof cannot be generated. This is why the correctness score of the last step is abysmally low for T5 w/o search. The minimality scores of the intermediate steps $(I_{1} - I_{n}, I_{n} - H)$ for the unconstrained beam / level search method are also closer to the gold proofs compared to T5 w/o search.

Trained On		MNLI				MED
MNLI %	Proofs %	Entailment	Neutral	Contradict	Average	Entailment	Neutral	Average
1	0	83.86	75.84	85.49	81.84	44.84	35.15	40.02
1	1	85.05	77.16	83.98	82.19	49.98	35.72	42.89
2	0	86.26	79.38	86.22	84.06	48.40	39.33	43.89
2	2	86.08	79.93	86.68	84.32	51.34	39.68	45.54
10	0	87.65	80.87	88.47	85.76	48.81	35.61	42.24
5	5	89.14	83.89	89.96	87.74	53.81	34.22	44.07
10	10	89.72	83.56	89.89	87.80	54.76	39.82	47.32

Table 4: F1 scores for MNLI (val-matched) and MED datasets with the RoBERTa-Large model. Numbers on the MNLI % column indicate the percentage of the MNLI train set instance used for training. Numbers on the Proofs % column indicates the equivalent number of instances from the Prover generated data used for training. Hence, Proofs % of 1 implies the number of Prover generated instances is the same as 1% of MNLI train instances. Scores are average of 3 runs.

Data (MNLI %, Proofs %) $\to$	1%, 0%	2%, 0%	1%, 1%	10%, 0%	5%, 5%	10%, 10%
Category $↓$
Boolean	55.23	60.58	62.46	62.83	60.09	60.5
Causal	89.19	82.72	85.42	87.86	90.51	93.13
Comparative	49.05	48.68	49.66	57.33	55.90	59.12
Conditional	36.19	49.05	35.43	41.00	44.51	40.30
Coreference	51.08	52.62	51.91	50.00	50.00	53.82
Implicature	17.09	19.86	15.55	22.91	23.07	22.61
Lexical	92.44	96.43	95.62	93.99	97.41	96.15
Negation	97.86	97.61	91.42	99.61	99.81	99.95
Numerical	44.01	51.62	56.74	59.48	70.10	62.13
Presupposition	78.85	79.22	81.98	87.53	93.24	91.24
Quantifier	65.40	64.78	60.30	65.20	67.71	68.65
Relational	65.74	75.49	78.44	89.84	90.12	94.19
Spatial	50.92	46.86	50.56	50.26	44.95	57.60
Syntactic	83.37	81.69	76.08	93.88	99.30	99.92
Taxonomic	76.24	65.89	64.36	64.73	72.34	60.23
Temporal	28.63	27.26	37.89	45.62	44.78	47.29
World	93.82	98.98	97.92	99.75	99.50	99.24
Average	63.23	64.67	64.22	68.93	70.78	70.94

Table 5: Reasoning category wise F1 scores on the Lo-NLI dataset for the RoBERTa-Large model. Scores are average of 3 runs.

Results for Proof Search with Facts: We identify entailment instances in SNLI where the cosine similarity between premise and hypothesis is less than 0.65 with a low overlap between tokens. We use proof search with facts to generate proofs for these instances. In Table 3, the minimality scores suggest that generated steps become more minimal as we move towards the hypothesis, having progressively less difference with the ideal minimality scores (of Gold Proofs). Recursive composition of facts result in reduced correctness scores, as pairs are not always entailed because of the introduction of new knowledge³³3Aligns with observations from joshi2020taxinli; tarunesh2021trusting that RoBERTa-Large-MNLI model often fails on NLI pairs requiring external knowledge.. However, the final step $I_{n}$ reaches a high correctness score of 90.15%. Furthermore, the monotonic sentence generator setup can be used on $I_{n}$ to generate inferences which are even closer to $H$ with average cosine similarity of 0.9489. Thus, the proof search with facts algorithm is suitable for generating proofs that requires external commonsense knowledge.

5.3 Usefulness of the Prover for NLI Tasks

As illustrated in section 3.1, the Prover model is trained on various objective functions to generate inferences from the premise. Here, we show that these inferences could be used to improve end-to-end NLI task accuracy. We consider the entailed, contradictory, neutral, and monotonic sentence generation setup of section 3.1 and apply it on the premises of the SNLI dataset to create corresponding inferences. For a particular premise $P$ , the inferences generated from the above four setups are considered to have a label of entailment, contradiction, neutral, and entailment, respectively.

The inferences generated from the Prover are considered as additional labeled data for supervised learning in NLI tasks. In particular, we train RoBERTa-Large classification models for the MNLI williams2018broad dataset in a low-data regime. The models are trained with either i) MNLI-only data or ii) a mix of MNLI data and Prover generated data. We show that performance consistently improves when trained with additional Prover generated data, signifying the efficacy of our method. We also evaluate the models trained on MNLI in two other datasets – Monotonicity Entailment Dataset (MED) and LoNLI tarunesh2021trusting. The incorporation of Prover generated data helps in improving performance across most settings in these two datasets.

Results for MNLI and MED: We report results across various combination of amounts of MNLI data and Prover generated data in Table 4. In the very low data regime (1% MNLI + 0% Proofs vs. 1% MNLI + 1% Proofs), the additional data from Prover helps in improving performance across all the label categories in MNLI and MED. However, the performance of 1% MNLI + 1% Proofs setting is lesser compared to 2% MNLI + 0% Proofs, signifying that more MNLI training data helps in better generalization for very low data setting. The improvement in performance is more prominent in the 10% MNLI + 0% Proofs vs. 5% MNLI + 5% Proofs case. An identical number of training instances are used in both settings, and thus the improvement in performance can be directly attributed to the Prover generated data. We observe improvements in all but one metric with large gains in MNLI neutral and MED entailment for 5% MNLI + 5% Proofs.

Results for LoNLI: The LoNLI tarunesh2021trusting dataset consists of CheckList templates of (premise, hypothesis) pairs, associated examples, and corresponding entailment, neutral, or contradiction labels. The templates are categorized according to reasoning categories, such as causal, lexical, and syntactic. We report results for evaluation on LoNLI with models trained on subsets of MNLI and optionally the Prover generated data in Table 5. We observe a similar trend, where incorporation of Prover generated data helps in improving the overall performance with significant gains across a number of reasoning categories. The improvement in numerical, presupposition, relational, and temporal categories are most prominent. We also find an almost 2% overall improvement for 5% MNLI + 5% Proofs over 10% MNLI + 0% Proofs.

6 Related Work

We explore generation of multiple-step natural language proofs for the NLI task. For verification and generation, we take inspiration from the automated theorem proving and research in deductive reasoning with Transformers. We also explore how proofs can be further utilized.

Categories of Proofs: The NLP community has introduced several NLU tasks and datasets with accompanying explanations wiegreffe-marasovic-2021-review, ranging from highlighted context, natural language based or structured proof trees. Highlighted explanations hardly help in scenarios requiring external knowledge. Human-provided natural language explanations camburu2018snli are hard to validate; and structured explanations are hard to scale. For certain closed-world setting DBLP:conf/acl/TafjordDC21 and symbolic domains, synthetically generated explanations has been popular. Authors in DBLP:journals/corr/abs-2112-00114 explore NaturalProofs, where the proof consists of both natural language and mathematical symbols. However, the underlying reasoning task being mathematical, makes the task more well-defined. A large body of work in the ATP community DBLP:conf/aaai/PaliwalLRBS20; aygun2020learning; agarwal2021analyzing generates synthetic fine-grained steps as proofs for symbolic domains. Here, we define constraints over natural language proofs to ease validation and generation.

Iterative Proof Generation and Search: DBLP:conf/acl/TafjordDC21 utilize synthetically generated end-to-end proof trees as supervision. Authors train a strong encoder-decoder model such as T5 to iteratively generate the next step of a proof, and use search to generate full proofs. Similarly, in the Automated Theorem Proving literature, recent work DBLP:conf/aaai/PaliwalLRBS20; aygun2020learning; agarwal2021analyzing has shown that Graph Neural Networks and Transformers can be trained to perform impressively on the theorem-proving task as part of a neuro-symbolic system. Theorem-proving is a multiple-step search-based task, where the neural method learns to simplify an input goal into several sub-goals. This is done recursively using search, until all sub-goals are proven (i.e., simplified to empty goals). Further, These proof systems are easier to define as the number of operations (or theorems or tactics) comes from a known closed set. Thus, it is non-trivial to extend above methods to generate natural language proof without full end-to-end supervision.

Proofs to Improve End-task Accuracy: DBLP:conf/wsdm/HeL0ZW21 and DBLP:conf/acl/KumarT20 has demonstrated the utility of generating proofs (or natural language explanations). In Knowledge-graph based QA context, DBLP:conf/wsdm/HeL0ZW21 show how a teacher network trained on additional intermediate hops can be used to enhance the performance of a student network, that is exposed only to the final output supervision. Most importantly, DBLP:conf/acl/KumarT20 uses label-specific explanations and use them directly to generate the NLI conclusion. While the authors show how explanations are used to generate the conclusion, the authors do not generate fine-grained steps. Authors also compare with a non-recent baseline by camburu2018snli on MNLI, instead of SOTA methods such as RoBERTA-large.

We study multi-step reasoning for the NLI task; where connecting a premise and hypothesis may require external knowledge (including free-form fuzzy commonsense rules). The closest to our work is RuleTaker and ProofWriter DBLP:conf/acl/TafjordDC21; where authors explore deductive abilities of Transformers over natural language. We differ from their work, as we do not provide the required rules explicitly and do not restrict the natural language input in any form. Our fact retrieval method is inspired from DBLP:conf/icml/GuuLTPC20; NEURIPS2020_fc84ad56. NEURIPS2020_fc84ad56 proposes to combine a non-parametric retriever model with a parametric generator for knowledge-augmented NLP tasks. While the goals are similar, we use a custom retrieval method as we do not have knowledge-augmented sentence composition supervision for learning retrieval and generation in an end-to-end fashion.

7 Conclusion

We propose a method to generate knowledge-enriched multiple-step textual proofs (intermediate conclusions) for the NLI task utilizing only next-step supervision. We train a T5 model to generate the next step given a premise-hypothesis pair, and use external fact augmentation, search for more generalized proofs. To ease generation and automatic verification, we introduce constraints over expected proofs, and associated automated metrics. Both human and automated verification show the effectiveness of our proposed method. We also show that our generated proofs can be used to improve NLI task performance using standard data augmentation techniques (on low-data scenarios).

References

Appendix A Prover

The multitask T5 generator model( Section 3.1) can be trained on additional objective functions for improved performance and better generalization. The objectives are as follows:

Explanation Generation: The model is trained to generate an explanation $X$ given the premise $P$ and hypothesis $H$ . The input is explain: P <sep> H and the output to be generated is X. The non-neutral instances of the E-SNLI camburu2018snli dataset is used for this objective.

Entailment Bank Proof Generation: The model is also trained to generate one-step proofs given a premise and a conclusion. Suppose conclusion $c$ can be derived from premise sentences $s_{1}, s_{2}$ . For this instance, the input to the model is proof: $s_{1}$ <sep> c and the output to be generated is $s_{2}$ . Another instance is also created by interchanging $s_{1}$ and $s_{2}$ in the input and output. We consider all leaf sentences, intermediate conclusions, and the final hypothesis of the proof trees in the Entailment Bank dataset dalvi2021explaining to create these instances.

These two objectives are more useful for explanation / intermediate step generation for neutral premise-hypothesis pairs.

Appendix B Constrained Proof Generation

The explain: and proof: generation setup can be used to directly generate the proof in using the premise $P$ and hypothesis $H$ . We observe that the generator model learns to exploit the proof generation process by generating a single sentence as proof, due to the nature of the training data in E-SNLI and Entailment Bank. We thus evaluate the generated proofs against the gold test set annotations in E-SNLI with a number of text generation evaluation metrics – BLEU, METEOR, ROUGE, CIDER and semantic similarity (SIM) using all-mpnet-base-v2. The results are reported in Table 6.

Prefix	BLEU1	METEOR	ROUGE	CIDEr	SIM
explain	0.4063	0.2491	0.3898	1.6603	0.6372
proof	0.2631	0.1537	0.2698	0.7705	0.4720

Table 6: Results for constrained proof generation. The generated proofs are matched against the gold annotations in E-SNLI. SIM indicates semantic similarity.

Appendix C Datasets used

All datasets used and results reported in this paper are for English language.

LoNLI:

The dataset contains templates of (premise, hypothesis) pairs, associated examples, and corresponding entailment, neutral, or contradiction labels. For instance, one such entailment template is - premise: Name moved from Country1 to Country2; hypothesis: Name now lives in Country2. All examples created from this template are labeled as entailment. There are 363 templates in total (166 entailment, 163 contradiction, 34 neutral) each having 1000 examples. The templates are categorized according to reasoning categories, such as lexical, syntactic, boolean, causal, etc. The dataset was inspired by the behavioral testing methodology of NLP systems proposed in CheckList ribeiro-etal-2020-beyond.

Appendix D Automatic Evaluation of Gold Proofs

We report the automatic evaluation metrics of Gold Proofs, Negated Gold Proofs, and Perturbed Gold Proofs in Table 3. For Negated Gold Proofs, we take the intermediate steps of the Gold Proofs and transform them to a contradictory sentence. We perform this operation using the contradiction generator of the Prover model (Section 3.1). This resultant proof is now incorrect, which is accurately represented by the very low correctness scores in Table 3. For Perturbed Gold Proofs, we randomly replace 50% of the words in the intermediate steps with a DistilBERT contextual augmenter ma2019nlpaug. As expected, we observe significant drops across correctness, BLEU4, and Jaccard similarity metrics in Table 3. The drop in BLEU4 and Jaccard similarity is more prominent for Perturbed Gold Proofs compared to Negated Gold Proofs. This result is intuitive because of the word replacing strategy in Perturbed Gold Proofs. We conclude that the proposed automatic evaluation metrics capture the properties and constraints of proofs distinctly and are thus suitable for the task.

Appendix E Additional Results

Algorithm	Correctness				Minimality
	$P - H$	$P - I_{1}$	$I_{1} - I_{n}$	$I_{n} - H$	$P - H$		$P - I_{1}$		$I_{1} - I_{n}$		$I_{n} - H$		$#$ Steps	$#$ Keywords
	$P - H$	$P - I_{1}$	$I_{1} - I_{n}$	$I_{n} - H$	B4	JS	B4	JS	B4	JS	B4	JS	$#$ Steps	P	$I_{1} - I_{n}$	H
Gold Proofs	95.27	94.59	98.42	94.09	0.0272	66.26	0.0981	68.34	0.2223	74.07	0.0911	65.86	2	6.82	2.94	2.54
Negated Gold Proofs		3.94	9.45	20.08			0.0189	61.61	0.0596	62.59	0.0502	59.91			2.83
Perturbed Gold Proofs		5.91	3.15	24.41			0.0139	62.37	0.0202	58.30	0.0147	56.80			3.10
T5 w/o Search	92.28	94.06	94.63	36.25	0.0751	71.07	0.0793	67.36	0.3070	78.82	0.1162	67.45	2	7.14	2.74	3.27
UC Level Search	92.28	92.46	89.85	67.85	0.0751	71.07	0.1367	70.97	0.2443	75.94	0.0987	68.76	2	7.14	3.25	3.27
UC Beam Search	92.28	91.99	82.55	72.94	0.0751	71.07	0.1308	69.81	0.2652	75.82	0.2240	76.09	1.71	7.14	3.21	3.27
Proof Search with Facts	92.42	68.69	67.68	90.15	0.0066	65.52	0.6261	90.10	0.3540	80.63	0.0826	71.05	2.93	7.36	4.44	2.92
Proof Search with Facts*	92.42	98.23	93.43	88.89	0.0066	65.52	0.4457	89.83	0.2699	79.93	0.0796	70.33	2.93	7.36	5.45	2.92

Table 7: Additional results for proof search with facts. All the rows except the last row are the same results as reported in Table 3. The last row Proof search with facts* shows the additional results. Proof search with facts* indicate the setting where the fact being composed with a given step is concatenated with that particular step for the calculation of the correctness and minimality metrics. Proof search with facts* results in higher correctness and closer minimality scores relative to the gold proofs.

Trained On		MNLI				MED
MNLI %	Proofs %	Entailment	Neutral	Contradict	Average	Entailment	Neutral	Average
1	0	87.09	81.94	90.01	86.41	39.62	38.12	38.87
1	1	89.08	81.30	90.03	86.91	45.00	33.93	39.49
2	0	89.45	84.59	91.63	88.61	40.44	38.39	39.42
2	2	89.67	84.62	91.87	88.78	46.52	39.58	43.07
10	0	91.27	86.12	92.97	90.19	43.12	37.42	40.28
5	5	90.18	85.59	92.56	89.50	42.45	39.71	41.09
10	10	91.37	84.93	92.32	89.63	48.23	36.40	42.35

Table 8: F1 scores for MNLI (val-matched) and MED datasets for the DeBERTa-Large model. Scores are average of 3 runs.

Data (MNLI %, Proofs %) $\to$	1%, 0%	2%, 0%	1%, 1%	10%, 0%	5%, 5%	10%, 10%
Category $↓$
Boolean	47.19	64.66	62.51	66.33	64.98	62.29
Causal	96.74	97.23	97.13	98.76	97.78	97.53
Comparative	60.12	62.64	62.69	62.17	70.46	63.10
Conditional	42.62	54.49	56.10	40.90	65.53	53.41
Coreference	69.13	62.50	62.31	87.71	74.21	78.58
Implicature	29.70	33.23	42.34	33.44	39.41	35.73
Lexical	96.09	96.72	97.00	93.17	97.19	96.80
Negation	95.49	96.37	99.20	95.26	98.21	95.49
Numerical	50.94	62.46	61.78	78.06	71.23	68.35
Presupposition	78.79	90.64	88.37	95.41	95.17	95.45
Quantifier	53.81	71.44	58.44	59.48	71.17	73.67
Relational	96.50	98.25	96.10	97.03	96.74	99.15
Spatial	56.26	58.58	62.33	51.05	56.15	64.47
Syntactic	93.86	99.77	99.94	100.0	100.0	100.0
Taxonomic	66.67	69.84	72.60	75.14	68.56	73.65
Temporal	24.70	45.37	37.82	55.02	52.76	52.31
World	99.50	99.75	99.50	99.50	99.50	99.50
Average	68.13	74.35	73.89	75.79	77.59	77.03

Table 9: Reasoning category wise F1 scores on the Lo-NLI dataset for the DeBERTa-Large model. Scores are average of 3 runs.

Premise (P)

Hypothesis (H)

Proof

A black-haired man is entertaining a crowd with a hula hoop.

A man has black hair.

\to

I1: A man with black hair is performing.

\to

ii)

A police officer standing next his police motorbike holding a yellow safety jacket.

A police officer and his motorbike.

\to

I1: Police are riding a police motorbike.

\to

I2: A police motorbike is being rode by an officer.

\to

iii)

A person in a green robe sits on a couch with a blanket.

A person in a robe sits on a couch.

\to

I1: A woman in a green robe sits on a couch.

\to

I2: A girl in a green robe sits on a couch.

\to

iv)

A man and wife stand at the alter as they get married.

A couple is getting married.

\to

I1: A man and a woman are married.

\to

I2: A husband and a wife are married.

\to

A man gets a skateboard up on a big rock.

The skateboarder is near a rock.

\to

I1: A person puts his skateboard on a rock.

\to

I2: Person skateboarding on a rock.

\to

vi)

A professional swimmer spits water out after surfacing while grabbing the hand of someone helping him back to land.

A person is swimming.

F1: A hand is part of the arm.

&

\to

Monotone

\to

I1 A person is swimming.

\to

vii)

A female guitarist is playing on stage.

A woman is playing her instrument.

F1: A guitar is an instrument.

&

\to

Monotone

\to

I1 A woman is playing an instrument.

\to

viii)

Two middle aged police officers watch over a parking lot at night.

A couple cops keep an eye on the parking lot.

F1: An eye is used to see with.

&

\to

Monotone

\to

I1 Some cops look at the parking lot.

\to

Table 10: Some examples of generated proofs. Analysis can be found in Appendix F.

We also show additional results for Proof Search with Facts in table 7. We mentioned earlier in Section 5.2 that the recursive composition of facts result in reduced correctness scores, as pairs are not always entailed because of the introduction of new knowledge. This is demonstrated in the Proof Search with Facts row in table 7. We evaluate another method Proof search with facts*, where the fact being composed with a given step is concatenated with that particular step for the calculation of the correctness and minimality metrics. This strategy results in higher correctness and closer minimality scores relative to the gold proofs.

We show some additional results to show the usefulness of the Prover for various NLI tasks. This is an extension of the results reported in section 5.3. As described earlier, we use the Prover generated data as additional labeled data for supervised learning in NLI tasks. Here, we train a DeBERTa-Large model he2021debertav3 with i) MNLI-only data or ii) a mix of MNLI data and Prover generated data. The results are reported for MNLI and MED datasets in table 8, and for Lo-NLI dataset in table 9.

For the DeBERTa-Large model, we found similar conclusions that we made earlier for the RoBERTa-Large model in section 5.3. The incorporation of Prover generated data helps in improving performance across most settings in the three datasets. In the very low data regime (1% MNLI + 0% Proofs vs. 1% MNLI + 1% Proofs or 2% MNLI + 0% Proofs vs. 2% MNLI + 2% Proofs), the additional data from Prover helps in improving average performance in the MED dataset.

The incorporation of Prover generated data also helps in improving the overall performance with significant gains across a number of reasoning categories in the Lo-NLI dataset. We find an 1.8% overall improvement for 5% MNLI + 5% Proofs over 10% MNLI + 0% Proofs.

Appendix F Analysis of Generated Proofs

We manually analyze proofs from both unconstrainted search and proof search with facts. Some interesting example generated proofs are in Table 10. In example i) and ii), the premise and the hypothesis are lexical/syntactic paraphrases. Hence, no intermediate steps are required as a proof. In example iii), undue specialization or hallucinations are introduced in the intermediate steps, as the word person is changed to a woman and girl. In example iv), additional knowledge beyond the ones retrieved are required to proof the instance. In this case, the knowledge a man and his wife are called a couple, or a woman and his husband are called a couple would be required. Example v) requires spatial reasoning about physical objects skateboard and rock. In example vi), an unrelated fact is retrieved which is not useful for the proof. Example vii) and viii) are instances of correct proofs. We also observe that fact composition provides consistently superior results, especially when ontological knowledge (IsA, HasA, is-part-of relations) are involved. However, as both the search procedures does not involve any learning, in some case intermediate steps are generated that moves the reasoning away from the hypothesis – for example “men” converted to ”people’, then converted back to “male” in the following example; “a man in a black shirt and brown pants is jumping in the air performing a karate kick towards a man in a red shirt and black hat” $\to$ “two men are performing karate” $\to$ “two people are doing martial arts” $\to$ “two males are involved in martial arts.”

Appendix G Instructions for Human Verification

We performed human verification for proofs with two intermediate steps. The following instructions were given as it is to the human annotators.

g.1 Definition of Entailment

Entailment is a directional relation between two sentences - S1 and S2. The relation holds whenever the truth of the second sentence S2 follows from the first sentence S1. In other words, if a human reading S1 infers that S2 is true, then (S1, S2) is an entailment pair. Note that, (S1, S2) being an entailment pair does not necessarily mean that the reverse pair (S2, S1) is an entailment pair. Some examples are given below:

S1: A football game with multiple males playing.

S2: Some men are playing a sport.

Label: (S1, S2) is an entailment pair.

However, (S2, S1) is not an entailment pair because playing a sport does not necessarily mean playing football. More examples:

S1: A woman is walking outside.

S2: A person outdoors.

Label: (S1, S2) $\to$ Entailment.
S1: An older and younger man smiling.

S2: Two men are smiling and laughing at the cats playing on the floor.

Label: (S1, S2) $\to$ Not Entailment.

If S1 and S2 are the same sentences or very similar sentences with minimal difference in tokens then consider that as entailment.

S1: An older and younger man smiling.

S2: An older and younger man smiles.

Label: (S1, S2) $\to$ Entailment.

g.2 Definition of Uniqueness

S1 and S2 are two given sentences. If S1 and S2 are very similar in the token space then consider (S1, S2) as not unique. If they are not very similar then they are unique. This is a symmetrical relation. The uniqueness of (S1, S2) is the same as (S2, S1).

S1: A football game with multiple males playing.

S2: Some men are playing a sport.

Label: Unique.
S1: An older and younger man smiling.

S2: An older and younger man smiles.

Label: Not unique.
S1: A man is playing a guitar.

S2: A man is playing a musical instrument.

Label: Unique.
S1: A man is playing a guitar.

S2: A guitar is being played by a man.

Label: Not Unique.

Please note the difference between the last two examples carefully. The usage of the word musical instrument makes the third pair unique.

g.3 Instructions

Each instance has four elements: i) Premise, ii) Step 1, iii) Step 2, and iv) Hypothesis. We are aiming to find a proof of the hypothesis from the premise. In our case, Step 1 and Step 2 constitute the proof that lets us go from the premise to the hypothesis. In other words, Step 1 and Step 2 provide the gradual transition between the premise and the hypothesis. Considering this, we have 3 sentence pairs from each proof:

X1 : (Premise, Step 1)

X2 : (Step 1, Step 2)

X3 : (Step 2, Hypothesis)

Each instance has to be scored in two aspects using the three above pairs:

Correctness: Give a score among [3, 2, 1, 0]. This is assigned by checking the entailment label of the three pairs.
- X1, X2, X3 are all entailment: Score is 3
- Only X1, X2 are entailment: Score is 2
- Only X1 is entailment: Score is 1
- All other cases: Score is 0
Note that the priority of the pairs are: X1 ¿ X2 ¿ X3. If X2, X3 are both entailment but X1 is not then we will give a score of 0.
Minimality: Give a score among [1, 0].
- If X1, X2, X3 are all unique pairs: Score is 1
- If any pair is not unique: Score is 0

Appendix H Experimental Setup

We use beam search to generate outputs from the T5-Large models. A beam length of 10 is used. The T5-Large models were trained with the Adafactor optimizer shazeer2018adafactor with a learning rate of 5e-6. We retrieve top 8 facts (according to cosine similarity) from the retriever model for composition.

Appendix I Computational Resources

We use a single Quadro RTX 8000 GPU for our experiments. We train the Prover T5 model and composition T5 model for 15 and 5 hours in this GPU. The T5-Large and RoBERTa-Large models have 770M and 355M parameters, respectively.