# Tutorial #2: few-shot learning and meta-learning I

## Introduction

Humans can recognize new object classes from very few instances. However, most machine learning techniques require thousands of examples to achieve similar performance. The goal of *few-shot learning* is to classify new data having seen only a few training examples. In the extreme, there might only be a single example of each class (*one shot learning*). In practice, few-shot learning is useful when training examples are hard to find (e.g., cases of a rare disease), or where the cost of labelling data is high.

Few-shot learning is usually studied using *N-way-K-shot classification*. Here, we aim to discriminate between $N$ classes with $K$ examples of each. A typical problem size might be to discriminate between $N=10$ classes with only $K=5$ samples from each to train from. We cannot train a classifier using conventional methods here; any modern classification algorithm will depend on far more parameters than there are training examples, and will generalize poorly.

If the data is insufficient to constrain the problem, then one possible solution is to gain experience from other similar problems. To this end, most approaches characterize few-shot learning as a *meta-learning* problem.

## The meta learning framework

In the classical learning framework, we learn a how to classify from training data and evaluate the results using test data. In the meta-learning framework, we *learn how to learn* to classify given a set of *training tasks* and evaluate using a set of t*est tasks* (figure 1); In other words, we use one set of classification problems to help solve other unrelated sets.

Here, each task mimics the few-shot scenario, so for N-way-K-shot classification, each task includes $N$ classes with $K$ examples of each. These are known as the *support set* for the task and are used for learning how to solve this task. In addition, there are further examples of the same classes, known as a *query set*, which are used to evaluating the performance on this task. Each task can be completely non-overlapping; we may never see the classes from one task in any of the others. The idea is that the system repeatedly sees instances (tasks) during training that match the structure of the final few-shot task, but contain different classes.

At each step of meta-learning, we update the model parameters based on a randomly selected training task. The loss function is determined by the classification performance on the query set of this training task, based on knowledge gained from its support set. Since the network is presented with a different task at each time step, it must learn how to discriminate data classes in general, rather than a particular subset of classes.

To evaluate few-shot performance, we use a set of test tasks. Each contains only unseen classes that were not in any of the training tasks. For each, we measure performance on the query set based on knowledge of their support set.

## Approaches to meta-learning

Approaches to meta-learning are diverse and there is no consensus on the best approach. However, there are three distinct families, each of which exploits a different type of prior knowledge:

**Prior knowledge about similarity: **We learn embeddings in training tasks that tend to separate different classes even when they are unseen.

**Prior knowledge about learning:** We use prior knowledge to constrain the learning algorithm to choose parameters that generalize well from few examples.

**Prior knowledge of data:** We exploit prior knowledge about the structure and variability of the data and this allows us to learn viable models from few examples.

An overview these methods can be seen in figure 2. In this review, we will consider each family of methods in turn.

## Prior knowledge of similarity

This family of algorithms aims to learn compact representations (embeddings) in which the data vector is mostly unaffected by intra-class variations but retains information about class membership. Early work focused on pairwise comparators which aim to judge whether two data examples are from the same or different classes, even though the system may not have seen these classes before. Subsequent research focused on multi-class comparators which allow assignment of new examples to one of several classes.

## Pairwise comparators

Pairwise comparators take two examples and classify them as either belonging to the same or different classes. This differs from the standard N-way-K-shot configuration and does not obviously map onto the above description of meta-learning although as we will see later there is in fact a close relationship.

**Siamese networks**

Koch *et al.* (2015) trained a model that outputs the probability $Pr(y_a=y_{b})$ that two data examples $\mathbf{x}_{a}$ and $\mathbf{x}_{b}$ belong to the same class (figure 3a). The two examples are passed through identical multi-layer neural networks (hence Siamese) to create two embeddings. The component-wise absolute distance between the embeddings is computed and passed to a subsequent comparison network that reduces this distance vector to a single number. This is passed though a sigmoidal output for classification as being the same or different with a cross-entropy loss.

During training, each pair of examples are randomly drawn from a super-set of training classes. Hence, the system learns to discriminate between classes is general, rather than two classes in particular. In testing, completely different classes are used. Although this does not have the formal structure of the N-way-K-shot task, the spirit is similar.

#### Triplet networks

Triplet networks (Hoffer & Ailon 2015) consist of three identical networks that are trained by triplets $\{\mathbf{x}_{+},\mathbf{x}_{a},\mathbf{x}_{-}\}$ of the form (positive, anchor, negative). The positive and anchor samples are from the same class, whereas the negative sample is from a different class. The learning criterion is *triplet loss* which encourages the anchor to be closer to the positive example than it is to the negative example in the embedding space (figure 3b). Hence it is based on two pairwise comparisons.

After training, the system can take two examples and establish whether they are from the same or different classes, by thresholding the distance in the learned embedding space. This was employed in the context of face verification by Schroff *et al.* (2015). This line of work is part of a greater literature on learning distance metrics (see Suarez *et al.* 2018 for overview).

## Multi-class comparators

Pairwise comparators can be adapted to the N-way-K-shot setting by assigning the class for an example in the query set based on its maximum similarity to one of the examples in the support set. However, multi-class comparators attempt to do the same thing in a more principled way; here the representation and final classification are learned in an end-to-end fashion.

In this section, we'll use the notation $\mathbf{x}_{nk}$ to denote the $k$th support example from the $n$th class in the N-Way-K-Shot classification task, and $y_{nk}$ to denote the corresponding label. For simplicity, we'll assume there is a single query example $\hat{\mathbf{x}}$ and the goal is to predict the associated label $\hat{y}$.

#### Matching Networks

Matching networks (Vinyals *et al.* 2016) predict the one-hot encoded query-set label $\hat{\mathbf{y}}$ as a weighted sum of all of the one-hot encoded support-set labels $\{\mathbf{y}_{nk}\}_{n,k=1}^{NK}$. The weight is based on a computed similarity $a[\hat{\mathbf{x}},\mathbf{x}_{nk}]$ between the query-set data $\hat{\mathbf{x}}$ and each training example $\{\mathbf{x}_{nk}\}_{n,k=1}^{N,K}$.

\begin{equation}

\hat{\mathbf{y}} = \sum_{n=1}^{N}\sum_{k=1}^{K} a[\mathbf{x}_{nk},\hat{\mathbf{x}}]\mathbf{y}_{nk} \tag{1.1}

\end{equation}

where the similarities have been constrained to be positive and sum to one.

To compute the similarity $a[\hat{\mathbf{x}},\mathbf{x}_{nk}]$, they pass each support example $\mathbf{x}_{nk}$ through a network $\mbox{ f}[\bullet]$ to produce an embedding and pass the query example $\hat{\mathbf{x}}$ through a different network $\mbox{ g}[\bullet]$ to produce a different embedding. They then compute the cosine similarity between these embeddings (figure 5a)

\begin{equation}

d[\mathbf{x}_{nk}, \hat{\mathbf{x}}] = \frac{\mbox{ f}[\mathbf{x}_{nk}]^{T}\mbox{ g}[\hat{\mathbf{x}}]} {||\mbox{ f}[\mathbf{x}_{nk}]||\cdot||\mbox{ g}[\hat{\mathbf{x}}]||}, \tag{1.2}

\end{equation}

and normalise using a softmax function:

\begin{equation}

a[\hat{\mathbf{x}}_{nk},\mathbf{x}] = \frac{\exp[d[\mathbf{x}_{nk},\hat{\mathbf{x}}]]}{\sum_{n=1}^{N}\sum_{k=1}^{K}\exp[d[\mathbf{x}_{nk},\hat{\mathbf{x}}]]}. \tag{1.3}

\end{equation}

to produce positive similarities that sum to one. This system can be trained end to end for the N-way-K-shot learning task.^{1 }At each learning iteration, the system is presented with a training task; the predicted labels are computed for the query set (the calculation is based on the support set) and the loss function is the cross entropy of the ground truth and predicted labels.

Matching networks compute similarities between the embeddings of each support example and the query example. This has the disadvantage that the algorithm is not robust to data imbalance; if there are more support examples for some classes than others (i.e., we have departed from the N-way-K-shot scenario), the ones with more frequent training data may dominate.

#### Prototypical Networks

Prototypical networks (Snell et al. 2017) are robust to data imbalance by construction; they average the embeddings $\{\mathbf{z}_{nk}\}_{k=1}^{K}$ of the examples for class $n$ to compute their mean embedding or *prototype* $\mathbf{p}_{n}$. They then use the similarity between each prototype and the query embedding (figures 4 and 5 b) as a basis for classification.

The similarity is computed as a negative multiple of the Euclidean distance (so that larger distances now give smaller numbers). They pass these similarities to a softmax function to give a probability over classes. This model effectively learns a metric space where the average of a few examples of a class is a good representation of that class and class membership can be assigned based on distance.

They noted that (i) the choice of distance function is vital as squared Euclidean distance outperformed cosine distance, (ii) having a higher number of classes in the support set helps to achieve better performance, and that (iii) the system works best when the support size of each class is matched in the training and test tasks.

Ren et al. (2018) extended this system to take advantage of additional unlabeled data which might be from the test task classes or from other distractor classes. Oreshkin et al. (2018) extended this approach by learning a task-dependent metric on the feature space, so that the distance metric changes from place to place in the embedding space.

#### Relation Networks

Matching networks and prototypical networks both focus on learning the embedding and compare examples using a pre-defined metric (cosine and Euclidean distance, respectively). Relation networks (Santoro et al. 2016) also learn a metric for comparison of the embeddings (figure 5c). Similarly to prototypical networks, the relation network averages the embeddings of each class in the support set together to form a single prototype. Each prototype is then concatenated with the query embedding and passed to a *relation module*. This is a learnable non-linear operator that produces a similarity score between 0 and 1 where 1 indicates that the query example belongs to this class prototype. This approach is clean and elegant and can be trained end-to-end.

## Comparison between models

All of the pairwise and multi-class comparators are closely related to one another. Each learns an embedding space for data examples. In matching networks, there are different embeddings for support and query examples, but in the other models, they are the same. For prototypical networks and relation networks, multiple embeddings from the same class are averaged to form prototypes. Distances between support set embeddings/prototypes and query set embeddings are computed using either pre-determined distance functions such as Euclidean or cosine distance (triplet networks, matching networks, prototypical networks) or by learning a distance metric (Siamese networks and relation networks).

The multi-class networks have the advantage that they can be trained end-to-end for the N-way-K-shot classification task. This is not true for the pairwise comparators which are trained to produce a similarity or distance between pairs of data examples (which could itself subsequently be used to support multi-class classification).

Although it is not obvious how the pairwise comparators map to the meta-learning framework, it is possible to consider their data as consisting of minimal training and test tasks. For Siamese networks, each pair of examples is a training task, consisting of one support example and one query example, where their classes may not necessarily match. For triplet networks, there are two support examples (from different classes) and one query example (from one of the classes).

## Conclusion

In part I of this tutorial we have described the few-shot and meta-learning problems and introduced a taxonomy of methods. We have also discussed methods that use a series of training tasks to learn prior knowledge about the similarity and dissimilarity of classes that can be exploited for future few-shot tasks. This knowledge takes the form of data embeddings that reduce within-class variance relative to between-class variance, and hence make it easier to learn from just a few data points.

In part II of this tutorial, we'll discuss methods that incorporate prior knowledge about how to learn models, and that incorporate prior knowledge about the data itself.

^{1}Vinyals et al. (2016). also introduced a novel *context embedding* method which took the full context of the support set $\mathcal{S}$ into account so that $\mbox{ g}[\bullet] = \mbox{ g}[\mathbf{x}, \mathcal{S}]$. Here, the support set was considered as a sequence and encoded by a bi-directional LSTM. Snell et al. (2017) later argued that this context embedding was problematic and redundant.