Structural vs Data Inductive Bias

Class project proposal

Introduction

Lack of Training Data

The transformative impact of vision transformer (ViT) architectures in the realm of deep learning has been profound, with their applications swiftly extending from computer vision tasks, competing with traditional neural network architectures like convolutional neural networks (CNNs). Despite their success, the intricacies of how architectural variations within ViTs influence their performance under different data conditions remain largely uncharted. Unraveling these subtleties

Project Goal

While much research has being made to find the best choice of data augmentation or the best structural change in the model to increase performance, our project empirically compares two kinds of methods:

Data augmentation through tuning-free procedures
Explicit inductive bias through discrete attention masking For data augmentation, we chose a simple-to-use procedure called TrivialAugment to increase by four times the amount of training data. Here we want an easy-to-use method that could help as a benchmark for the second method.

For explicit inductive bias, we use a general vision transformer architecture which allow us the change the number of attention heads and layers where the mask would be applied, this mask is what explicitly induce a bias in the model by forcing some layers to only learn relationship between close patches of the data.

Our goal with this comparison and the difference with previous works is that we want to experiment to which point one method could be better than the other by really compensating for the lack of information in the training of a vision transformer.

Due to computational and time limitations, we would train our model in a simple task of image classification based on CINIC-10. We also use a tiny model to be able to iterate many times through different scenarios of inductive bias. The selection of methods also reinforces these limitations but are a good starting point as many of the projects that would be lacking in training data probably are in testing phases where light tools like Google Colab are used.

Contribution

The result from this project contributes in two ways. First, it gives us a glance of how beneficial the level of proposed inductive bias in the performance of the model could be, and second, it contrasts which method, and until which point, performs better given different scenarios of initial training data available.

Data Augmentation

Data augmentation consists in applying certain transformations to the data in order to create new examples with the same semantic meaning as the original data. For images, data augmentation consists in spatial transformations like cropping, zooming or flipping. Although data augmentation is very popular among practitioners, previous works like have proven that data augmentation by itself reaches a saturation point where it is even worse than training in the new data, decreasing the performance of the model. Given our goal of comparing data augmentation with inductive bias, we expect to get a similar result in the efficacy of data augmentation while we increase the initial amount of data.

Data augmentation decisions can be thought because of the many options available to perform, but it is so popular that some researches are trying to make it more easy to use and computational-efficient, one example been TrivialAugment method where simple random augmentation can be compared to or outperform other more complex algorithms that try to find the best augmentation for the given dataset. TrivialAugment would be the procedure used in this project given it simplicity.

Changes in Architecture

To compensate the lack of training data for vision transformers, an interesting approach from is to use instance discrimination techniques which adjust the loss function of the model to improve the representation of the datapoints getting high accuracy scores for datasets with only 2k samples. The model proposed is trained from scratch with few data, but its implementation and interpretability could be troublesome for small projects.

Othe authors in use a set of pre-trained models with complementary structures (Convolutional and Involutional) to help a lightweight visual transformer model called DeiT (Data-efficient image Transformer) increase its performance by getting a baseline result that is added as a token and works like an induced bias of the properties of the image. The scores from the pre-trained models give more information than the ground truth because it gives a value of likehood for every class, which is a result of the different attributes of the specific image.

Although these changes demonstrate that it is possible to get better performance with few data without augmentation, it is not clear how we can adjust the inductive bias produced to identify until which point it works. The usage of pre-trained models is also not desirable here because of our premise that we could be using this experiment to make decisions in new datasets and tasks.

Explicit Inductive Bias

The model proposed in is a better example of real lack of training data overcome by introducing a different kind of attention heads. In this case, medical images tend to have the same orientation, property that is leveraged to force the attention heads to focus on axial information which normally represents the spread of tumors. Here the inductive bias is that the image has a structure where patches aligned in rows and columns are more related between them than diagonal ones.

Following this path, , and try to apply the local induced bias of convolutional networks in a transformer by different methods. adds a new layer at the beginning of the model which acts like a local mask but with variable learnable attention levels, where the model figures out how much local attention it should apply given the proposed task. on the other hand add new convolutional layers in parallel to the transformer to let them capture the local information while letting the original transformer to keep the big picture of the image. Finally, in it is proposed a change in the initial attention layers, making them GPSA (Gated Positional Self-Attention) which learn for each patch if pay more attention to the attention product (Query * Key) or the position of the patch in the image.

From these works it is stated that some layers of the transformer converge to convolutional behaviors given the nature of the data used for training, but this requires a relatively big amount of data that could not be available. It is also noticed that the inductive bias is applied to the first layers of the model.

The model proposed in uses a simpler method which consists in applying a mask pattern to some of the attention heads to induce local attention bias into the model. To decide which heads and layers should be masked, it uses a soft masking approach where the model learns a scale factor between 0 and 1 which sets the level of local inductive bias that is applied to that head. The results show that it is possible to obtain good results by using more local masking in the first layers and keeping the global interaction in the last ones. This approach is also model agnostic and easy to implement, which is why it is close to the experimentation of this project.

The power of this masking method is also shown in where the mask is learned by a parallel process of pixel-wise classification, successfully increasing the performance in more complex tasks like pixel-wise segmentation.

Methods and Experiment

To explore and compare the benefits of data augmentation versus induced bias we are running three related experiments. All experiments would be run with CINIC-10 dataset in Google Colab using a T4 GPU. We decided to use CINIC-10 instead of CIFAR-10 because even though it is a drop-in replacement of CIFAR-10, it is a much larger than CIFAR-10 so we can test on different number of base training samples but not so large like ImageNet that is too large/difficult to test.

Experiment 1

The goal of the first experiment is to get a glance of the overall differences in accuracy for the compared methods. The model used for this experiment consists of a basic visual transformer with six layers and linear positional embeddings. Each layer corresponds to a multiheaded attention layer with only two heads. The schematic of the model can be seen in figure 1.

Figure 1

By default, the attention heads in the model are fully connected to give them a global behavior, but the model can be configured to apply a local pattern mask or a sparse pattern mask to all heads in all layers.

Figure 2

The model would be trained with different scenarios of initial data, in specific, with 1000, 2000, 5000, 12500 and 20000 samples. In each scenario, we would get four different models:

Baseline model: Without data augmentation and with default global attention
Data augmentation: With data augmentation and default global attention
Local attention: Without data augmentation and with local attention
Sparse attention: Without data augmentation and with sparse attention

The data augmentation technique would be TrivialAugment and the metric would be accuracy on validation dataset. We set these four models trying not to mix data augmentation with changes in the induced bias, keeping the default global attention in the transformer as our baseline.

Experiment 2

Having experimented with the differences where all layers have the same mask, we now set experiments to play with the level of induced bias applied to the model. The goal now is to identify a relation between the level of induced bias applied to the model and their performance. For this experiment we modify our first model in the following ways:

We increase the number of attention heads in each layer from 2 to 4
We set the final two layers to global attention, so the mask is not applied to them
We configure each head in the first four layers to be able to be hard configured as either local or global attention.

Figure 3

With this new model, we can create one instance for each combination of global/local head in any of the first four layers, generating a sense of “level of induced bias” based on the number and configuration of attention heads treated as local.

Given computational limitations, we would set only two initial data scenarios (10000 and 50000) and get 16 models for each scenario:

Baseline model: Without augmentation and with all global attention
Data augmentation: With data augmentation and all global attention
14 combinations of local heads and layers:

Table 1

We would analyze the differences in accuracy between different levels of induced bias in the same initial data scenario and see if we can get a selection of best performing inductive bias levels to apply them more broadly in the third experiment.

With this comparison we also want to capture what are the visual differences between the attention heads in the different levels of induced bias to try to explain with is doing better or worse than the baseline.

Experiment 3

Our final experiment consists in comparing the accuracy and the effective additional data (EAD) that each method brings when applied to different initial amounts of data. The initial data scenarios to train the models would be 1000, 5000, 10000, 20000, and 50000 samples. The comparison would be made between the data augmentation model for each scenario, versus the top 3 levels of induced bias from experiment 2.

The effective additional data (EAD) represents the extra amount of real data that the method is compensating, the higher the better to be considered as a successful method for solving lack of data problems. This metric is calculated by looking at which scenario of initial data would make the baseline model perform equal to the method analyzed.