Meta-Learning for Question Answering on SQuAD 2.0

This post was summarized from my final project @Stanford CS224N.

In a Question Answering (QA) system, the system learn to answer a question by properly understanding an associated paragraph. However, developing a QA system that performs robustly well across all domains can be extra challenging as we do not always have abundant amount of data across domains. Therefore, one area of focus in this field has been learning to train a model to learn new task with limited data available (e.g. Few-Shot Learning, FSL).

Meta-learning in supervised learning, in particular, has been known to perform well in FSL, with the concept being teaching the models learn to set up initial parameters well that enable the model to learn a new task after seeing a few samples of the associated data. In this study, we were given a large amount of in-domain (IND) samples with only limited samples of out-of-domain (OOD) set. We were provided with a fine-tuned (FT) DistilBERT model that knew to perform well on the IND set. To improve the robustness of the FT baseline model performance on OOD set, we trained:

Background

SQuAD 2.0 dataset. Three in‐domain (SQuAD, NewsQA, Natural Questions) and three out‐of‐domain (DuoRC, RACE, RelationExtraction) datasets. The in‐domain (IND) and out‐of‐domain (OOD) datasets contain 50K and 127 question‐passage‐answer samples each.

Model‐Agnostic Meta‐Learning (MAML). MAML was originally proposed by Finn et al 2017 to train the models their own initial parameters so that the parameters allow the algorithm to perform well on a new task (”learn‐to‐learn”) after one or a few gradient steps of updates with few‐shot data availability.

Methods

Fine-tuned Baseline. A fine‐tuned (FT) pre‐trained transformer model ‐ DistilBERT. The baseline QA model was trained on the overall IND training set, and was validated on the IND validation set.

MAML DistilBERT. We adapted MAML as a framework to train our robust QA system that performs well across different domains.

Figure 1. Model architecture of MAML DistilBERT. Training support and query sets can come from In‐domain or OOD datasets and are a factor we experimented on.

FT Baseline + MAML DistilBERT. In addition to training MAML model from scratch, we leveraged the FT DistillBERT (Baseline) model and trained the MAML models from the FT checkpoint.

Experiments

If not otherwise specified, batch size for all experiments were 16. To avoid GPU out-of-memory issue, data was loaded in either batch size of 1 or 4 to accumulate the loss. Model is updated at batch size of 16.

Experiment #1: MAML DistilBERT without FT Baseline

Table 1. Experiment 1: Model configuration.

Experiment #2: Training MAML after FT Baseline

Table 2. Experiment 1: Model configuration.

Analysis

Key-takeaway #1: MAML DistilBERT without FT Baseline couldn’t achieve the same level of model performance as the FT Baseline.

Figure 2. Experiment #1 model descending sorted by EM (OOD eval)

Key-takeaway #2: Training MAML after FT Baseline outperformed FT Baseline occasionally. More experimentation configurations in learning rate and domain variability could be explored.

Figure 3. Experiment #2 model descending sorted by EM (OOD eval)

Conclusions

MAML was a good‐to‐explore to achieve cross‐domain model robustness. MAML might not be the best framework in context of a large amount IND set and small amount OOD set. Training MAML post baseline model pre‐training and fine‐tuning performed occasionally better than the FT baseline model likely due to additional OOD tasks used to learn by the MAML model.

Full copy of the paper could be found here.

Github of this project could be found here.