June 10, 2020

Robustness and Repeatability of modern Deep Neural Networks: a review

Modern neural networks are sensitive to slight change of inputs. We review the literature on the problem diagnosis and how to fix it



Scortex automates visual inspection in the heart of factories using computer vision and deep learning techniques.

As a result, Scortex needs to constantly assess its system’s performance, using offline (validation and test sets) and online (live, at the factory) tests.

One of these tests is the Repeatability and Reproducibility test. Originally used in the metrology field, it can be performed to evaluate human performances at the visual inspection task. Repeatability measures that given the same conditions, the operator gives the same quality assessment. That is, we are asking the following question: “If I give the same part to the same operator in one hour, will the answer be the same?” On the other hand, reproducibility measures consensus between different operators. If we give the same part to several persons, will they give the same results?

At first sight, it could be assumed that a human is very repeatable. In fact, it is hardly the case. This can be partly because operators can get tired or distracted, but mostly because quality specifications can have some loopholes, leading to subjectivity in the quality decision. This is especially true for edge cases, which fall under what is usually called the “grey zone”. Typically, in “Visual inspection of products: a comparison of the methods used to evaluate surface anomalies“, the authors ask 5 experts to analyse 18 products, each containing one anomaly. They found that “only 22% of the products led to an agreement among the evaluations carried out by each expert independently”. For more information on these tests and human visual inspections, we refer to Nathalie Baudet’s thesis.

As manufacturers are getting toward automated production lines, we need to run the same assessment for automated systems. In Scortex’s case, that means the machine learning models. Are these repeatable? Obviously, the system cannot really get “tired” or “distracted”, and is not subjective (or is it? A recent study shows that dataset bias is a real concern). Moreover, given the exact same input (image/video), the network is expected to output the exact same results (plus or minus 1e-7 if you are using TensorFlow, because of CUDA GPU randomness).

However, Goodfellow et. al. showed in 2016 that a slight perturbation of the input can lead to drastic change in a network decision. This phenomena has been widely studied under the adversarial example spectrum, where one can craft an attack to fool a neural network. Fortunately, this kind of attack is not the primary concern of models deployed on the production line. Rather, perturbations like small translations, rotations, change in lighting, camera noise or dead pixels are very likely to occur and impact the network’s decision… can these perturbations lead to non repeatability and reproducibility?

This blog post focuses on repeatability of deep learning models against “natural perturbations” (perturbations that can happen in the factory). This review of literature is divided into five parts. First, we examine works showing and assessing these issues in deep neural networks. Then, we review papers proposing potential fixes which modify the architecture of networks in order to make them more robust. Finally, we have a look at how data augmentation strategies can be part of the solution, and outline the possibility of using self-supervised regularization.

Assessment of the problem


Making Convolutional Networks Shift-Invariant again” is a great article and a good entry point in the repeatability field. The authors show that modern neural networks are not translation invariant. Typically, if we look at the probabilities output by a classifier while translating an image pixel by pixel, we can see that the networks results change significantly as the image is translated.

Their chart depicted below shows that the output probability can vary by 100%! In practice, this would mean the following: if you take the exact same image with a defect and translate it by a few pixels, it possible to obtain totally opposite results. A system based on such a network would not be very repeatable.

Note that this goes against the general intuition that neural networks are translation invariant. One may wonder why this intuition does not hold true, since convolutions do have this translational invariance property. In fact, that is because of the downsampling operations(!), against which the authors propose an antialiased network that we will detail in the second section.

A more recent paper: “Why do deep convolutional networks generalize so poorly to small image transformations?” does a similar assessment. The authors show that translation is not the only perturbation that can result in drastic changes of the output. For example, a slight zoom of one pixel can have strong perturbation effects on the network, without being noticed by the human eye otherwise. Moreover, CNNs do not seem consistent over time when applied to videos. Indeed, a network classifying independently successive frames of a video generates output probabilities changing a lot over the time axis (cf. the otter below).

To assess the vulnerability of networks the author come up with four perturbations (or tasks) leading to images that look exactly the same for a human eye:

  1. “Translation-Cropping”: choose a central crop and compare with the same crop shifted by 1 pixel
  2. “Translation-Embedding-Black”: embed the image in a black background and shift the position by 1 pixel
  3. “Translation-Embedding-Inpainting”: same as 2., but with a different padding strategy
  4. “Scale-Embedding-Black”: same as 2., except the second image is not shifted but bigger (resized) by 1 pixel (from 100 to 101).

The four cases are depicted in the chart below. The authors look at two metrics:

  • The probability that the top-1 predicted class changes
  • The average of probability change for all classes

The results are impressive:

For the first task (the more likely to happen in reality), translating by only one pixel leads to between 5 and 15% of top-1 class change and the average probabilities are shifted by 2 to 5%. This is huge!

For the three other tasks, the probability gaps are even larger. Why? Probably because images created for these tasks are out of the network known distribution. It is known that out-of-distribution images are more prone to sudden change in probabilities. As a matter of fact, this can be used a way to detect them (see the Odin OOD detector or GeoTrans for example).

In the article, the authors also assess if possible fixes, such as anti-aliasing CNNs and data augmentation can help: they show these are only partial solutions against 1 pixel shift “attacks”.

Another paper, called “Using Videos to Evaluate Image Model Robustness” comes to the same conclusion regarding videos. The authors make a case that videos are a great way to evaluate deep neural networks robustness to perturbations. Indeed, consecutive frames do have the same content… and thus should be scored the same way by a classifier.

Using the YouTube-BoundingBoxes dataset, they are able to compare different architectures robustness but also to correlate the effect of “video perturbations” with synthetic perturbations.

According to the correlations above, it should be possible to test video perturbations robustness with proxies using synthetic distortions such as random brightness, hue, saturation or translations. Interestingly this also gives a good list of data augmentation that could be used to make the model more robust to natural perturbations!

A similar assessment can also be found in another paper we do not detail here: Do Image Classsifiers Generalize Across Time?

Finally, for the review to be complete, we have to talk about the elephant in the room. In this paper, the authors extract an image from MS-COCO (below, an example with an elephant), and then paste and translate it on another image. They report several issues when running an object detection algorithm on these images:

  • The detection of the pasted object is not stable. Below, the elephant is mostly not detected, but sometimes is.
  • The pasted object can be detected but not classified properly (on 6th image, elephant is detected as a chair).
  • The pasted object has non local effect and disturbs other detections (probability of chair and couch fluctuate on the bottom left of the image, cup detection disappears on the 4th and 6th image).

The authors advance several reasons for these issues to happen:

  • First, the copy-paste operation creates occlusion, which will necessarily lead to changes
  • Second, by copy-pasting, Out Of Distribution (OOD) images are created: inference on such image is thus unstable. This is interesting as it is inline with one of the papers mentioned above.
  • Deep networks are not shift invariant (as mentioned above).
  • Deep networks learn contexts from images, and therefore do have context bias. As a result, it is more likely for an object in a living room to be a chair than an elephant.
  • Non Maximal Suppression (NMS) post processing can lead to non local effects in detection tasks.
  • “Feature interference”: the authors claim that using squared region of interest (ROI) and max-pooling in detection networks force the network to use features belonging to other objects than the one detected. According to them, this is the main issue in modern detectors. Could some spatial attention mechanism be used to temper this effect? Maybe.


The take away from this section is that current deep neural networks are in general not stable to small corruptions like translations or noise. Several attempts have been made to address this issue, which we detail below. Before that, we will describe benchmark datasets and metrics widely used in the field.

Dataset and Metrics

ImageNet-C and ImageNet-P

The best introduction paper to the constructing robustness datasets is probably: Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. In this paper, the authors modify the ImageNet dataset to measure robustness against common perturbations. They create two datasets:


This dataset is the validation data from ImageNet with additional corruptions of 15 types depicted below. Each corruption has 5 possible levels of severity. As a result, the size of the dataset is 75 times the ImageNet validation dataset size.


While ImageNet-C allows to measure accuracy of networks when the image is highly corrupted, the goal of ImageNet-P is to see how consistent predictions are when images are slightly perturbed by an increasing amount. This allows drawing curves very similar to the ones with probability as a function of pixel translation that were shown in the previous part. To this end, the authors use sequence of perturbations of length 30 images, where the first image is the original one.

In order to make research easier, the authors also propose “C” and “P” versions of CIFAR, Tiny ImageNet and ImageNet 64×64. Since their introduction, these datasets have been used several times in the field.

The datasets can be recreated or downloaded using their github.



In the same paper, the authors first describe the “Error” (E) for corruption c and severity s as the top-1 accuracy error on the ImageNet-C dataset restricted to this corruption and severity.

As corruptions may not be equal in terms of network perturbation strength (Impulse noise may not have as much effect as elastic transformations), they normalise each error by the error of a baseline network (they use AlexNet) and define the Corruption Error (CE) as the average of this normalised Error over the 5 severities.

Now that the metrics are comparable, they can be averaged across corruption giving the mean Corruption Error metric or mCE. This metric is now used by several papers as you will see in some ablation experiments below.


In the sequence perturbation, what we would ideally like for a perfect robust model is for the inferred label to remain the same. The more the predictions fluctuate, the less robust the model is. As a result, the authors propose the Flip Probability (FP) as the probability that two corrupted images share the same prediction. As in the mCE case, they normalise the FP by the AlexNet performance to get what they call the “Flip Rate”. Averaging this over all perturbations, you get the mFR (mean Flip Rate).

Architecture related fixes

Anti-aliasing neural networks

The paper “Making Convolutional Networks Shift-Invariant again” proposes a fix to remove the internal aliasing issue of deep neural networks.

First, we need to understand why this happens. A function is called:

  • “shift-invariant” if shifting its input does not change its output
  • “shift-equivariant” if shifting its input shifts its output in the same way

A convolutional layer can be seen as a shift-equivariant (padding issues aside). As a result, a CNN designed as a stack of convolutional layers and a global average pooling operation should be shift-invariant. Why are modern CNNs sensible to translation perturbations then? They are because of the subsampling operations: max-pooling, stride, and probably dilated convolutions. To understand this phenomena, consider the image below (taken from their paper).

Let’s look at the left chart. We have a one-dimensional signal in grey. If we apply a max-pooling operator with a kernel size of 2 and a stride of 2, we get the blue squares. If we first shift the series by 1 position to the left, we get the red squares, which is quite different!

So how can we solve this? In signal processing (and computer vision), it is preferable to apply a lowpass filter before downsampling a signal. This is a consequence of the Nyquist-Shannon theorem. Inspired by this, the authors propose to use a “blur layer” to smooth the downsampling effect. If you are familiar with opencv, this concept should ring a bell to you. Indeed, when you use cv2.pyrDown, images are resized by:

  • convolving first the images with a 5×5 gaussian kernel
  • removing then every even row and column from images

They propose an equivalent for the max-pooling operator:

  • first applying the max-pooling with kernel k (usually k=2), but using stride of 1 instead of stride s (usually s = 2)
  • then applying the blur kernel
  • then performing the downsampling operation (stride s)

The authors show that a similar operation can be conducted for different pooling strategies: max-pooling, convolution with strides and average-pooling.

The authors describe what kernel can be used. Kernels are described as the outer product of a vector F. “Rectangle-2” would be a simple moving average with F = [1,1], Triangle-3 will be a 3×3 kernel with F=[1,2,1] and Binomial-5 the same as open image pyramids kernel described above (F=[1,4,6,4,1]). This gives the following matrices (note that they can be normalised).

Now looking at the results, the authors show that for a wide range of network architectures, using the blur pool layers always improve robustness to translation, but also accuracy! It has to be noted that using additional layers will increase the inference time; although for deep networks, it is going to be relatively small as only one new layer per downsampling operation (5 in most classification CNNs) is needed. Plus, this layer can be implemented as a depth-wise convolution in most deep learning frameworks (that is, each blurring is performed independently per feature map).

Now, does this anti-aliasing layer (BlurPool) fixes the translation issue? Not totally, as the networks are still not shift invariant. The authors of “Why do deep convolutional networks generalize so poorly to small image transformations?” benchmarked the anti-aliasing layer and showed that though it has an effect, it only partially fixes the issue. Indeed, on the translation-cropping case, it seems to lower the probability change by 10% to 20% only.

What could be the next steps? Theoretically, limiting the amount of downsampling operations would help. But this would come at a very large cost in terms of network size and inference time as networks would run on full resolution feature maps and would need many layers to get a comparable receptive field. Other ideas could be to improve on the blur kernel (for example using “Translation Incensitive CNNs“?).

Other architectural changes

1) “Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

After creating the datasets we mentioned above, the authors discuss what architecture modifications can help. They show that:

  • Multiscale networks (such as Multigrid Neural Networks) can improve robustness
  • Feature aggregation (such as ResNext and DenseNet) and larger networks also help
  • Smaller models are usually less robust! The intuition behind using smaller model may have been that they would overfit less and thus be more robust. It seems however that this hypothesis is loosing ground compared to Deep double descent).


2) “Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network”

The authors present a “bag of tricks” to improve over a ResNet50 baseline. They distinguish between “network tweaks” (architecture) related tricks and “regularization” related tricks. They sum up the results in a very nice ablation study table.

In the chart above, we focus for now on the “network tweaks”. Looking at the top delta in mCE we can see that the top tweaks that make the model more robust are:

  • Using Squeeze and Excitation networks (-3.71 mCE)
  • Using Selective Kernels networks (-4.30 mCE)
  • Using a deeper ResNet-152 network (-5.62 mCE)
  • Interestingly, the use of anti-aliasing does not provide a too good improvement (only -1.08 mCE). This may be because the authors only apply the antialiasing kernel on the convolution with stride downsampling and not on the other downsampling operations, in order to maintain the highest throughput possible.


Data augmentation related fixes

Using data augmentation as a way to fight against input perturbations is very natural.

We can try to prevent shifts due to translations or camera noise by randomly translating or adding Poisson noise to the images. In fact, you can probably come up with a data augmentation that fits the perturbation that you would expect. Such as:

  • Expecting changes in the geometry of the acquisition system? Try flips, rotations, scale, projections,…
  • Expecting changes in the camera set-up?  Try random multiplication (gain), blur (shutter speed), exposure, …
  • Expecting light perturbations? Try random additions, or more structured light data augmentations, …

To get a good idea of what can be done using data augmentation, we recommend having a look at imgaug. It’s an awesome library.

At Scortex, we noticed the following: some data augmentations seems useful when looking at evaluation set metrics, and some do not. But they can nevertheless increase robustness performances. Typically, because we cannot sample the reality as much as we would like, it is hard to construct representative evaluation sets of factories with all possible conditions (think variations in time of day, season, dust, lights, vapour, heat, …), especially in low data regimes. Thus, not improving on an evaluation set may not totally correlate with not improving on the production line.

It is possible to create synthetic versions of our evaluation set in the same way ImageNet-C was created. However, one must be very careful then to cross validate data augmentation used for training, and the one used for evaluation!

Simple data augmentation

Looking at the literature we can gather some informations about which data augmentation is the most likely to help.

1) “Benchmarking Neural Network Robustness to Common Corruptions and Perturbations”

The authors report results on data augmentation related approaches to augment network robustness:

2) “Using Videos to Evaluate Image Model Robustness”

The noise correlation matrix shown above seems to show that random Contrast, Brightness, Hue, Saturation and Translation data augmentation would probably increase robustness against natural perturbations (videos).

3) “Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network”

Additionally to architectural changes (see the table above), the authors show that specific data augmentation and regularisation can improve network robustness:

Interestingly, Mixup and AutoAugment are two building block of Augmix, a paper we describe later in the post.

4) “An Empirical Evaluation on Robustness and Uncertainty of Regularization Methods”

The authors show that a combination of CutMix data augmentation, shakeDrop and Label Smoothing make the model more robust. However, they show that there is no silver bullet and that data augmentation regularisation may make the network against adversarial perturbations, and vice versa.

5) “Improving Robustness Without Sacrificing Accuracy with Patch Gaussian Augmentation”

The authors come up with a new data augmentation technique called Patch Gaussian. The idea is to apply gaussian noise only on a patch of the image. This technique lies between additive gaussian noise (adds robustness but may hurt accuracy because of underfitting) and Cutout (removing a patch of the image, a powerful data augmentation technique) and shows some mCE improvement over the baselines.

Stability training

The idea of stability training was introduced in “Improving the Robustness of Deep Neural Networks via Stability Training“.

Instead of training with data augmentation, the authors propose to train on normal images but make data augmented images close to original ones in probability or embedding space. The approach is summarised in the graph below.

In a classification task setup:

  • An original image I is passed to the network. Cross entropy between the output probability and the target (or other classification loss) is used to train the network.
  • A (gaussian) noise epsilon is added to the image I to produce I’. This image is also fed to the network but cross entropy with the target is not used to train the network. Rather,  KL-divergence between the two predictions is used.

If this approach seems familiar to you, it may be because it is very close to consistency loss used in semi-supervised learning methods (Pi-model, Mean teacher) or self-supervised learning contrastive objectives (SimCLR, MoCo).

Now, why would this be better than using a simple data augmentation strategy by adding noise to the input image? The authors advocate the fact that data augmenting using random noise can quickly lead to under-fitting, while stability training does not. Rather, it trains on original image and makes the predictions in a neighbourhood of this image constant. This point is very similar to the one exposed for  the Patch Gaussian data augmentation above.

One of the issue of using this technique though is the training time. Indeed, if you only use data augmentation, you need to pass one data augmented version of the image per epoch while stability training requires two versions of each images.

Some papers such as Achieving Generalizable Robustness of Deep Neural Networks by Stability Training and Augmix try to extend the methodology. We describe the later below.


A logical extension of stability training could be to replace the gaussian noise with any kind of data augmentation.

In “AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty” the authors use several key components described above to improve model robustness to small natural perturbations:

  • data augmentation using Mixup and AutoAugment operations
  • a stability training like strategy

Regarding the data augmentation policy, the authors first explain that applying several transformations on an image may result in an out-of-distribution image, or in loosing the information that enables proper classification. In the image below for example, the composition of 5 transformations ends up making the turtle barely visible.

To remedy this issue, they propose to combine several data augmented images using Mixup. That is:

  • generate k data augmented images by randomly sampling 3 operations in the AutoAugment operations list.
  • blend the k images together using random weights that sum to 1.
  • blend the resulting image with the original image using random weights summing to 1.

In the paper, the authors show a realisation of the strategy.

When training, they then:

  • generate 2 Augmix augmented images
  • train with cross entropy loss on the original image
  • enforce that probabilities output by the network for the original images and the two augmented images are similar using Jensen-Shanon divergence.
    This loss is equivalent to computing the average probability of the 3 images, and then averaging the KL-divergence of each probability compared to this average.

The whole method is summarised in the algorithm below.

The paper shows good results on robustness to corruption on CIFAR-10-C datasets (and Imagenet-C). Ablation experiments also show that the method is robust to small change in hyper-parameters and that each component of the method is relevant.

An interesting question is how could one generalize this approach to segmentation or detection tasks. Indeed, composition of images with random geometric transformations means that the target needs to be updated in some way. Additionally, there are little evidence in the literature of Mixup use for segmentation tasks so far (though the generalisation seems fairly direct).

Data Augmentation limits

While it seems that data augmentation and related methods can give robustness for free, they have a few limitations.

Data Augmentation Tuning

Data augmentation is not so easy to tune. The search space can quickly become large and finding the right data augmentation amount may be challenging.

In order not to tune your data augmentation, you may want to use learned strategies such as AutoAugment or RandAugment. The search itself may be prohibitive for you, so you may want to use the policies found by these authors papers. That would be the equivalent of using transfer learning from ImageNet to train a network, but for data augmentation.

It is however not obvious that such transformations would generalise well across domains, typically from ImageNet to biology or manufacturing datasets.

Data augmentation poorly generalizes outside training manifold

This is mostly advocated by the “Why do deep convolutional networks generalize so poorly to
small image transformations?” paper.

The augmentations you use may help cover only perturbations in the range of the data augmentation used. This means that if you trained with slight -5° to +5° random rotations, there is no guarantee that the network will perform well if a 10° rotation happens in real life.

Additionally, the invariance seems to be learned only for data similar to the one in training set. This is can becomes quickly an issue if the dataset has a strong bias. The authors take the example of dog pictures in ImageNet. Most of them are close and centered shots: the photographer bias. A picture of a dog on a small corner of an image will then be very sensitive to perturbations.

One can argue that this is expected as this kind of image would be in a way Out of Distribution (OOD). This idea has interesting ties with the self supervised learning and anomaly detection literature, which we detail below.

Self supervised learning

The self supervised learning field has recently exploded thanks to paper such as RotNet, MoCo or SimCLR. In self supervised learning, no label is available. As a result, a pretext task is used to train a network. It can be reconstructing the input image for Auto Encoders like methods, or using data augmentation and contrastive learning to force an image to be closer to its data augmented version than to any other images.

An other example, from the RotNet paper, is apply random rotations to the input and to predict what angle of rotation was applied (usually people use 4 rotations: 0, 90°, 180°, 270°).


Indeed, to be able to predict if an image if rotated or not, the network needs to learn semantic features such as: what are feet or ground for example.

Arguably, the recent rise of self-supervised is partly due to advances in the use of data augmentation and transformation to train deep neural networks. As a matter of fact, using the right data transformations is paramount for contrastive methods to work well. For a broader introduction to self-supervised learning, consider reading this great blog post from Lilian Weng or the one from fastai.

Such self supervised task can be used along a supervised one, with an auxiliary loss function. In To Balance or Not to Balance, the authors use it to improve performance on long tail (unusual) classes while in S4L, prediction of the rotation is used to improve performances on ImageNet in a semi supervised fashion.

A recent paper: “Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty“, now claims that this kind of approach can make the network more robust against adversarial attacks, common corruptions and label noise. It makes a lot of sense for label noise as the auxiliary head provides a clean though weak signal to regularise noise overfitting. For common corruptions, the authors get good results on CIFAR-C.

They do not compare the results to using a pure data augmentation approach. However, in the ImageNet / CIFAR setup, using only 4 rotations while training would probably not change much the performance. A interpretation of why the method works may be to consider that deep neural network are biased towards texture, which can be sensitive to small perturbations. Forcing the network to predict rotations may help learning more semantic feature and hence improve the model robustness.

Though training a supervised network with auxiliary heads can be tough, there will probably be a rise in using self-supervised methods as a way to regularize deep networks trainings.

Finally, let’s note that this paper, building on ideas from Geotrans, also provides a way to do OOD detection. The way it works is by looking at which rotation is predicted by the network. If the network does not know what rotation was used, it is very likely to be an outlier.

Hence it seems that these ideas of robustness, data augmentation and out of distribution seem tied together in a way that a complete system could be: “I am stable whenever I know the data. I become unstable in unknown regions. But that’s ok because being unstable is actually my way to tell you the input images are not in my confidence zone anymore.”


The robustness against natural perturbations is still a very active field of research. At Scortex, we are working hard on this topic. We created our own repeatability datasets, implemented and benchmarked several of the fixes mentioned above. This will maybe be the subject of a following post. If you are interested in the subject (typically if your are a research lab looking for a collaboration), or if you think we missed a relevant paper, feel free to reach out.