# This Looks Like That: Deep Learning for Interpretable Image Recognition

## Contents

## Presented by

Nouha Chatti

## Introduction

The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way when dealing with classification tasks. The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression "this looks like that". In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans. [1]

## Previous Work

Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how the model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze the performance of a trained CNN.[2] Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determine parts of the input they are looking at but without associating them to prototypes.[3]

## Network Architecture

The figure below represents ProtoPNet architecture. The first layers of ProtoPNet consist of commonly used convolutional layers [math]f[/math], whose parameters are denoted [math]w_{conv}[/math]. The layers used in this study are from the following known models **VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161** previously pre-trained on ImageNet which are followed by two additional 1 × 1 convolutional layers. A layer called prototype [math]g_p[/math], a fully connected layer h with weight [math]w_h[/math] and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as the activation function. This network takes in an image [math]x[/math] which is propagated through the convolutional layers ([math]f[/math] of shape [math]H x W x D[/math]) where features are extracted and learn the porotypes P of shape ([math]1 x 1 x D[/math]). The number of prototypes [math]m_k[/math] is pre-defined for each class [math]k[/math] (10 per class in this study)
Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output [math]z = f(x)[/math], the j-th prototype unit [math]g_{p_{j}}[/math] in the prototype layer [math]g_p[/math] computes the squared L2 distances between the j-th prototype [math]p_j[/math] and all patches of [math]z[/math] that have the same shape as [math]p_j[/math] and returns the similarity scores.
These score values indicate the presence of the prototypical part in the image all while preserving the spatial relation of [math]z[/math]. It is possible to upsample it to the original size in order to obtain a heatmap with different parts that are most similar to the compared prototypes. The scores given by each unit are produced using max-pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix [math]w_h[/math] in [math]h[/math] to produce the output logits as shown in Figure 1.

## Training Algorithm

The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) than the projection of the prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k: [math] P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \} [/math] During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p. In the last training stage, the convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other words it makes the model ignore the reasoning process of decision making of this kind: an image belongs to a given class because it does not have prototypes from another class. The optimization problem that they try to solve is:

## Datasets

The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.

**Examples of reasoning process:**
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.

## Results

The results obtained using ProtoPNet on bird images as well as the car models are compared to the baseline models as well as attention-based deep models that were trained on the same datasets that ProtoPNet was trained on. ProtoPNet accuracy results are very close and as good as the non-interpretable baselines as shown in the tables below.

Another experience of combining many protoPNet models shows an improvement of the accuracy while preserving the transparency of the decision making process.

## Conclusion

The aim of constructing the ProtopNet network was introduce the interpretability property to neural networks. As a matter of fact, the use of ProtopNet makes the process of classifying images clearer. It is able to dissect images to find prototypical parts. And the predictions of classes of an image are made based on a comparison of parts of this image and learned prototypes of each classes. One of the greatest advantage of this ProtopNet is that it allows the user to observe the process of how the model is making predictions and therefore understands the reasoning in case of misclassification errors for example. However one constraint of this Network can be the pre-determined number of prototypes as it is domain related and has to be set beforehand.

## Critique

I think that this is a really interesting approach to provide insights as to why a neural network made a certain prediction. Intuitively, based on the architecture, it seems that each Convolutional layer learns a certain "aspect" of the image (ie. wheel of a car, the beak of the bird, etc). It would be interesting to see how much further one can take this idea, especially in classifying images of things that appear very similar to the human eye (i.e. various insects).

## Source code

The code for this paper is available at https://github.com/cfchen-duke/ProtoPNet

## Refrences

[1] C. Chen, O. Li, A. Barnett, J. Su, C. Rudin, This looks like that: deep learning for interpretable image recognition, arXiv preprint, arXiv:1806.10574, 2018.

[1] A. Holt, I. Bichindaritz, R. Schmidt, and P. Perner. Medical applications in case-based reasoning. The Knowledge Engineering Review, 20:289–292, 09 2005.

[2] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In Workshop at the 2nd International Conference on Learning Representations (ICLR Workshop), 2014

[3] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2921–2929. IEEE, 2016