August 4, 2020
Extending selective back-propagation to segmentation
One of the weaknesses of deep learning is its requirement for large amounts data. Can we accelerate training by focusing on the most valuable images?
Introduction: why accelerate deep learning trainings?
Deep learning is known for working well in big-data regimes. In the computer vision field, that means having to work with a lot of images: as the data collection grows, training the models becomes an inevitably longer process as epoch time slows down. This becomes a problem when the need for reactivity (updating the models) increases, especially in a factory setting where things can change rapidly on the production line. At Scortex, we work with lots of high-resolution filmed images. Most of these numerous images are normal, or “healthy”, while only rarer ones are considered “defective”. Also, defects often only represent a small area of the image.
This data imbalance strongly suggests that a naive training scheme (focusing on all images equally) may very well be sub-optimal: gradients are not equally important and valuable for all images. This is a common idea in the literature, which motivates techniques such as hard example mining (Shrivastava et al., 2016), or focal loss (Lin et al., 2017).
A simple yet efficient idea from the literature (Jiang et al., 2019) proposes only using the most valuable images for training, which they call selective back-propagation. At the cost of an additional forward pass on the whole training set in order to determine which images are most valuable, we can cut down training time by selectively back-propagating (“focusing”) on these most valuable images.
The original paper applies this idea to a classification task. In this series of experiments, we propose a natural extension of the technique to the semantic segmentation use case.
- We first will restate the ideas and techniques behind the Jiang et al. paper, before detailing how we can adapt the latter to segmentation.
- After a preliminary study on our dataset, we will first conduct experiments with trainings focusing on a first defect (defect 1), then focusing on two defects (defect 1, and a rarer defect, defect 2).
- Finally, after experimenting with the notion of selectivity and upsampling, we will understand some of the limitations of the technique for segmentation, and try to solve them.
The first question we might want to address now is: how do you determine how valuable an image is?
The technique: focusing on the biggest losers (Jiang et al., 2019)
The technique Jiang et al. propose is based on the following observation:
“We suspect, and confirm experimentally, that examples with low loss correspond to gradients with small norm and thus contribute little to the gradient update.”
They introduce selective back-propagation, which is a method for selecting training images. Images with valuable gradients (which corresponds to highest losses, the “biggest losers“) will be stochastically kept, while the others will be stochastically discarded. In a nutshell, selective back-propagation acts as a sort of smart, stochastic upsampling of the most important images.
They justify the above assumption experimentally, by looking at the similarity across gradients within a batch. On the figure below, similarity is calculated on an original batch and on subsampled batches when training a MobilenetV2 on CIFAR10. After the first epoch, images sampled with high-loss have more similar gradients than randomly sampled images. Similarity can be measured with cosine similarity, or fraction of gradient components of same sign.
Hence, for the acceleration experiments, they will select images with highest loss for training. This will cut down forward and back-propagation time for discarded examples, at the cost of one additional forward pass on the whole training set. This additional forward pass is done at the end of each epoch, and serves for computing the loss on each training example. Once the loss has been computed for each training example, selection of examples will be done stochastically, by assigning a probability of selection for each example based on the loss quantile to which this example belongs to. Keeping stochasticity in the method matters, and allows avoiding over-fitting on the highest loss examples.
Formula for selection is:
Beta is a selectivity parameter. If selection is done with a beta value of 1 (red selection curve), an example with a loss in the 90% percentile will be selected with a 90% probability. With higher beta (blue or green curve), selection will be more drastic as examples need to be in higher percentiles to be selected.
They also accelerate the technique further by only computing the losses every N epochs: they call this stale selective back-propagation. If N is not too high, examples will still be selected according to a recent loss value computation, while significantly reducing the average additional forward pass time.
They experiment on training a classification model on CIFAR10, CIFAR100, and SVHN. They manage to accelerate training and convergence by a factor 2 on CIFAR10 (by a factor 5 on SVHN), while keeping performances. Below, an illustration of their results on CIFAR 10 (S is the selectivity used, and Err is the final test error reached).
The following chart sums up how selective back-propagation and stale selective back-propagation allow to reduce wall training time (epoch-wise):
The question left is now: will the acceleration hold up true for us? The answer will depend on the forward-propagation vs back-propagation time ratio, which varies along with image resolution.
Selective back-propagation for the segmentation use case
Scortex aims to detect and locate defects on industrial parts. In this series of experiments, we will work with a standard semantic segmentation model, such as U-Net. For encoding the ground truth, target pixels will be encoded as 1 if they intersect with a defect annotation, while left with 0 if they don’t:
For our dataset, we use a training set of XXX high-resolution images, with XXX defective images. We evaluate on XXX. Defects are of two types:
- one is more common, and is present on the majority of our defective images (defect A, “door scratch”)
- one is rarer (defect B, “door broken”)
We will start by focusing on the more common defect A. In order to adapt the idea of selective back-propagation to segmentation, a simple idea would be need to aggregate the output loss map (which is a loss tensor of shape [output_height, output_width, defects_number] for our semantic segmentation network) spatially in order to get one loss value per image, with which we will be able further select the images. For starters, we aggregate the loss map by taking its maximum loss value (global max pooling) or its mean (global average pooling).
To sum up:
- At the end of an epoch, we compute the loss map for defect A, for each image of the training set
- We aggregate this output loss map spatially (in a max or mean fashion, for now)
- At the start of the next epoch, we sub-sample the training set according to the loss distribution, with a certain selectivity parameter (beta)
Wait... will selective back-propagation work for this use case?
Before diving into acceleration experiments, we decide to start with a preliminary study in order to ensure the technique makes sense for our trainings. It will also be critical for tuning selectivity.
We will train the model regularly, except that we will compute and aggregate the loss on the whole training set at the end of each epoch (additional forward pass). For each epoch, we then plot the distribution of these aggregated losses on the training set. We expect two separated modes (especially for the max-aggregation scheme):
- one, with higher losses, corresponding mostly to defective images
- the other, with lower losses, corresponding mostly to healthy images
For the first epochs, because of the pixel imbalance, the model learns to output every pixel as healthy: this results in the model outputting high loss values for defective images. We expect these modes to merge as the model learns to differentiate defects from healthy pixels while converging.
Above, we plot evolution of the loss distribution on the training set across time (epochs), using max and mean aggregation. We observe the following:
- As expected, loss distribution is shifting towards lower values (this is more obvious when looking at mean histogram): as the model converges, mean training loss decreases
- We do not observe a clear merging of the two modes: they seem to stay relatively separated (colors and values stay distinct)
Our hypothesis to explain this absence of merging is that the error could stay high on the target pixels located on the border of imprecise annotations, which are encoded as 1, but should actually be 0 (and predicted as such). We will test this hypothesis by sampling random images from the top light blue bins (high loss images).
What we actually observe is either the hypothesized phenomena (which we call the “border effect”), or missing annotations leading to constant high error rate (target encoded as 0, but should actually be 1 –and is predicted as such)
This “border effect” exists because of variabilities and discrepancies in the labelled data (annotation noise), resulting in a non-perfect target.
As a consequence, the error (loss) stays high on the border of annotations, even if the error on the rest of the actual defect is low. This results in a problem for the “biggest losers” technique: these images will stay “losers” forever (in the case of a “max” aggregation strategy). For now, we decide to ignore the problem, and to neglect its impact on the following experiments (we will see that it is not obvious to do so afterwards).
We decide to go for a selectivity of 10% of the training set (beta=8), which should cut down pure wall training time by a factor 10.
First experiment: regular selective back-propagation for defect A
We start by running two “focus” trainings with regular selective back-propagation on defect A, with 10% selectivity: one with max-aggregation, the other with mean aggregation.
We expect the “focus on biggest losers” training to be faster epoch-wise than the baseline, and hope that they reach the same performance as the baseline (here, our metric of choice for measuring performance of the model is mAP).
We observe the following:
- Max-aggregation training mAP converges faster than its mean-aggregation counterpart
- After convergence, baseline mAP can be reached for both focus trainings!
- We notice a small “warming-up” time at the beginning (probably time needed for “finding the right losers”)
Since the defects are relatively small compared to the size of the image, for mean-aggregation the high-loss pixels (around defects) will be “drowned” in the majority of low-loss pixels. Hence, the model will focus a little bit less on defective images than with max aggregation (less defective images are sampled). This phenomena, which would not hold true for very large defects, explains why we observe faster convergence for max-aggregation than for mean.
In terms of evaluation loss, we observe that it is lower for both focus trainings (data not shown). That could mean that the objective is better minimized. Also, it seems the “focus” technique reduces over-fitting, but that could very well be because epochs are not directly comparable (we would need to wait for 10 times more epoch on the “focus” trainings to be comparable). Evaluation loss should be split per defect for a more complete analysis.
Are we faster epoch-wise (wall training time)?
Actually, not (yet). Even though we reduce our pure training (forward and backward) by a factor 10, epochs are still of same duration between baseline and our focus trainings. This is because our additional forward pass lasts approximately the same time as forward and back-propagation training passes: this would mean back-propagation / forward propagation ratio is low in our case (much lower than 2), probably because we are using a smaller network than the one used in the “focus” paper, and using higher resolution images. For a fully-connected neural network for example, without parallelization, the back-propagation / forward-propagation ratio varies linearly with the number of hidden layers.
Increasing selectivity not only might be at the cost of losing performance (lower mAP), but also cannot solve the problem (additional pass is done on the whole dataset, and is for now incompressible). We could optimize the process better by having the additional forward pass parallelized, rather than computing it sequentially (not done in this work). For now, we focus on finding tricks to make the additional forward pass shorter on average, such as stale back-propagation.
Two strategies for making the selective back-propagation more efficient
We envision two orthogonal strategies for making the forward pass faster on average:
- “Stale” selective-back-propagation: do the additional forward pass (loss computation) every N epoch (strategy 1)
- “Smart recompute”: only recompute loss for the the datapoints that were seen since the last recompute (strategy 2)
Here are the results:
One epoch (wall training time) can effectively become approximately 4 times faster than the baseline training with either of the strategies mentioned above! Using both strategies can even hit a x6 speed-up compared to the baseline training, which is what we aimed. Still, the end-goal is not to speed-up the pure epoch time, but rather to accelerate convergence of performance on defects A and B. The question left is now: is defect mAP convergence faster?
On these “relative” curves (x-axis is time in hours, not epochs), we can see that both strategies (stale back-propagation and smart recompute) have similar effects on defect A mAP convergence. When using both techniques at the same time, we further accelerate training, while maintaining performances and proper convergence. As a result, we will now use both strategies by default in the following experiments.
However, convergence is only faster than the baseline during the beginning of the training (before the black line): after a couple epochs, both training reach convergence at the same rate. Defect A mAP final convergence is not faster when using the technique.
We see that rarer defect B is not learnt when using “focus” techniques. This is because so far, the “focus” is only done for defect A. In comparison, the baseline learns the B defect since defect B images are upsampled during training (it is what we will call an “upsampled baseline”). Now, can we learn defect B by focusing on both defects?
Selective back-propagation on the two defects
Selective back-propagation did not allow a proper learning of the rarer defect B. In the baseline case (no “focus”), this rare defect needs to be upsampled in order to be learnt properly. Can we, by performing selective back-propagation on defect B as well, improve convergence? Can we do as good (or faster) than upsampling?
In practice, we aggregate the loss map by taking the max for all loss maps (both loss map of defect A and B). As for the baselines, we have two: the regular one (no upsampling), and the one trained with upsampling on defect B. Note that focus trainings do not benefit from defect B upsampling here. Here are the results:
We observe that:
- Defect B can be learnt by the baseline without upsampling, only very slowly (purple curve)
- Defect B is learnt quickly by the baseline with upsampling (see evaluation curves)
- Defect B is learnt when focusing on both two defects, but not immediately. Probably, what happen is that defect A images are first focused on, before leaving space for defect B images to be trained on later (once some defect A images are not in the biggest losers anymore). mAP curve for defect B is lower its upsampled baseline counterpart, even though its evaluation loss is lower.
- Defect B, when learnt, is learnt at the cost of an apparent overfitting (few images, seen over and over again by the model). This seems in contradiction with Buda et al., 2018, where it is stated that “oversampling does not cause overfitting of CNNs“.
- Defect A convergence and behavior does not change much when focusing on both defects (data not shown)
- Defect B over-sampling is mainly responsible for the observed overfitting on the upsampled baseline discussed in the First experiment paragraph (see evaluation loss curves)
- Evaluation losses for “focus” training are again lower than for the non-upsampled baseline:
- Explanation 1: “focus” strategies are improving the training by improving the gradients. The loss is better minimized as the flowed gradients are of better quality, since they are less drowned in the noise of gradients coming from less valuable images.
- Explanation 2: there is a distribution imbalance between the training and the evaluation set. Defects are much more rarer in the training set than in the evaluation set. When we are training with focus strategies, we are mainly training with defective images: the distribution of the validation data hence becomes closer to the training set, resulting in a lower loss compared to the baseline.
The approach works: we overall learn both defect A and B rapidly without manual upsampling, and with lower evaluation loss. Convergence for defect A is still similar to what we observed when focusing only on defect A (not faster than baseline). For defect B, we are not doing better than the upsampled baseline either.
NB: If you are puzzled by the counter-intuitive experimental observation that on defect B, evaluation loss is increasing (overfitting) while mAP is increasing as well, consider the following idea. In a multi-classification setup, classes won’t be learnt simultaneously: classes with more images (and more variability within images) will be learnt first. As the network learns to classify defect A, a small number of defect B images will be predicted incorrectly with increasing confidence (model will become more and more over-confident for these images: the binary cross-entropy loss for defect B will become dominated by these highly-confident probabilities, resulting in an apparent exploding evaluation loss. Nevertheless, the model is still learning and predicting correctly most defect B images, resulting in an increase of our evaluation metric (here mAP). If you wish to read more on the topic, we recommend the LossUpAccUp GitHub repository, as well as the associated Twitter thread.
Playing with selectivity
We will now experiment on the impact of selectivity on convergence of the rarer defect B. Indeed, we might be too selective in our selective back-propagation, thus not taking enough defect B images for training (if defect B loss is smaller than defect A loss). We therefore set beta to 2 (focusing on 30% of the data) instead of 8 (10% selectivity). Here are the results:
We observe a slower convergence on defect B. An explanation could be that if the loss is higher in general on defect B than on defect A, lowering beta (increasing the percentage of focused images) will, as an effect, dilute the defect B images (which could be already in the top 1%), because of the added defect A and healthy images. This could be tested by being more selective (using higher beta), and would be confirmed by witnessing a faster and higher mAP convergence (data not shown).
Overall, it seems spending tuning beta should have been a priority at the beginning. A good rule of thumb could be to set selectivity such as it would allow to sample slightly more than the percentage of defective images out of the total training images.
Combining selective back-propagation and upsampling
Combining upsampling and selective back-propagation should allow to By upsampling ,we increase the probability of these images to be focused on. This combination should solve the problem for defect B and allow the model to focus properly on defect B from the start.
In practice, this is what happens:
The convergence for defect B mAP is now equivalent to the upsampled baseline. While this works (we manage to get fast convergence of defect B), it also kind of defeats the purpose of selective back-propagation, which is supposed to act as a smart upsampling.
This experiment confirms the “dilution issue” for defect B. This issue could be aggravated by the “border effect” discussed above, especially in the max-aggregation setting (which we are using). Indeed, this effect might have as a direct consequence the monopolization of defect A images in the sampled “biggest losers”, at the cost of the learning of defect B images. Thus, we will now try to explore alternative aggregation schemes, in order to avoid the “border effect” and defect B dilution.
Fighting the "border effect" with new aggregation schemes
If the model starts by learning defect A, images with defect A will tend to stay in the biggest losers because of the “border effect”. But we want some “space” in the biggest losers for the rarer defect B images, and avoid having too few of them (dilution).
We propose three schemes for fighting the “border effect” in order to improve convergence on defect B:
- One way is to indirectly ignore the border: we take the minimum loss value on the defective target zones, and the maximum loss value on the background, and take the maximum of these two values as the aggregation “loser” score for our image: we call this first scheme “agg_mindefects_maxbackground”.
- Another way is to directly ignore the border: we take the maximum loss value on the whole image, except that we ignore the border pixels of the ground truth target: we call this second scheme “agg_maskdefectborders”.
- Finally, we also consider a last aggregation strategy: before predicting, shifting the image from a random (small) amount x and y-wise. Loss map will hence be computed on a prediction of this shifted image (aggregation is still done with maximum). Statistically, we should “erase” the loss at the border of the defects and increase randomization in selection of the losers: we call this third scheme “shift”.
Results are as follow:
For defect B, we do not manage to beat the upsampled baseline with the new aggregation schemes. Also, we do not do better than simple max aggregation. It could be that the “border effect” is negligible compared to annotation noise in the data. Also, taking min value on defect (scheme 1) or ignoring the border (scheme 2) could prevent the network from learning proper contours, making the task more difficult.
Exotic aggregation strategies failed to deliver a better convergence. That means:
- either that the aggregation strategies were not effective against the “border effect”,
- or that the problem is not about this “border effect”: it could be that defect B images are already in the top losers (see selectivity study), and that they are already diluted too strongly, putting a limit to the convergence rate we can obtain. A direct solution would therefore be to increase selectivity. That increase in selectivity would probably be at the cost of a slower defect A mAP convergence.
In this article, we introduced a generalization of the selective back-propagation to the segmentation task.
Selective back-propagation strategies act as “smart upsampling”, and can effectively accelerates the wall training time. We had to rely on additional tricks such as “stale” selective back-propagation and “smart recompute” of the loss in order to cope with the fact that in our case, the back-propagation / forward-propagation ratio is lower than what is found in the “Focus on biggest losers” paper. In these experiments, we are using small networks. We are convinced using larger network would benefit more from the method.
Overall, even though it seems the objective is better optimized (lower evaluation losses, avoiding over-fitting), we still had difficulties catching up with the mAP convergence speed of the baseline training: mAP convergence is indeed only faster for the first epochs for defect A. Proper optimization (parallelization) of the additional forward pass could help accelerate the training wall time even further, though we already made the forward pass time quite low on average.
For defect B, we had difficulties catching up with the mAP convergence speed of the upsampled baseline training. A solution could be to be more selective, using a larger beta in order to reduce defect B images “dilution” in defect A images. This would also make the training faster epoch-wise.
Other aggregation schemes for extending the technique to segmentation could be explored, such as taking the average of the top K loss values on the images. One would need to be careful with such schemes, as they can be subject to imbalance in defect sizes.
Our experimental training dataset is already well constructed, since we already subsampled a lot of healthy images when creating it. In the wild, such a method would allow us to work with a very high amount of healthy images more easily.
The selective back-propagation technique is a very nice way to accelerate trainings by focusing only on the most valuable images.
One could imagine extending the technique further for learning new data in continuous or incremental learning setups: a model is trained on a original batch of data, before being re-trained with “focus” techniques on additional new data (data with potentially new classes), while still having access to the original data. That way, the model would learn the new data fast by stochastically focusing on the biggest losers (i.e. mostly the most valuable images from the new data), without forgetting features learnt on the original data.
The technique could also be used for mining valuable unlabelled images for further annotation prioritization (“label the biggest losers”), in an active learning fashion.