This post was summarized from my final project @Stanford CS224N.
In a Question Answering (QA) system, the system learn to answer a question by properly understanding an associated paragraph. However, developing a QA system that performs robustly well across all domains can be extra challenging as we do not always have abundant amount of data across domains. Therefore, one area of focus in this field has been learning to train a model to learn new task with limited data available (e.g. Few-Shot Learning, FSL).
Meta-learning in supervised learning, in particular, has been known to perform well in FSL, with the concept being teaching the models learn to set up initial parameters well that enable the model to learn a new task after seeing a few samples of the associated data.
SQuAD 2.0 dataset. Three in‐domain (SQuAD, NewsQA, Natural Questions) and three out‐of‐domain (DuoRC, RACE, RelationExtraction) datasets. The in‐domain (IND) and out‐of‐domain (OOD) datasets contain 50K and 127 question‐passage‐answer samples each.
Model‐Agnostic Meta‐Learning (MAML). MAML was originally proposed by Finn et al 2017
Fine-tuned Baseline. A fine‐tuned (FT) pre‐trained transformer model ‐ DistilBERT.
MAML DistilBERT. We adapted MAML
We defined the baseline DistilBERT
We implemented a task method rather than to pre-define a K-shot task pool ($p(\mathcal{T})$). As K sample support ($\mathcal{D}_i$) and query ($\mathcal{D}_i$’ ) sets can come from IND and OOD training datasets in different experiments
We used the same loss function ($\mathcal{L}$, $\textbf{loss} = - \log p_{start}(i) - \log p_{end}(j)$) as the baseline
FT Baseline + MAML DistilBERT. In addition to training MAML model from scratch, we leveraged the FT DistillBERT (Baseline) model and trained the MAML models from the FT checkpoint.
If not otherwise specified, batch size for all experiments were 16. To avoid GPU out-of-memory issue, data was loaded in either batch size of 1 or 4 to accumulate the loss. Model is updated at batch size of 16.
Experiment #1: MAML DistilBERT without FT Baseline
Experiment #2: Training MAML after FT Baseline
Key-takeaway #1: MAML DistilBERT without FT Baseline couldn’t achieve the same level of model performance as the FT Baseline.
Key-takeaway #2: Training MAML after FT Baseline outperformed FT Baseline occasionally. More experimentation configurations in learning rate and domain variability could be explored.
MAML was a good‐to‐explore to achieve cross‐domain model robustness. MAML might not be the best framework in context of a large amount IND set and small amount OOD set. Training MAML post baseline model pre‐training and fine‐tuning performed occasionally better than the FT baseline model likely due to additional OOD tasks used to learn by the MAML model.
Full copy of the paper could be found here.
Github of this project could be found here.