September 20, 2018

Our ECCV 2018 recap

Scortex Machine Learners booked their tickets for the 2018 edition of European Conference on Computer Vision (ECCV).


New NN Architectures

Group Norm

Group Norm (GN) is another normalisations scheme, in a way similar to BatchNorm (BN, widely used at Scortex).

BatchNorm has several failure cases:

  • small batch size (typically for segmentation of large images)
  • when the batch can vary or the notion of batch is ill posed:
  • varying batch size ?
    •  pre training vs fine tuning
    • head of a detector, for example: Single Shot Multibox Detector (SSD) where the number of elements might not be fixed.
    • Proposition of the authors is to do a normalisation that is independent of the batch size.

Here is a schema representing the different possible normalisations.


• While batch norm normalise over one feature map on the whole batch,
• Group norm normalise on pre defined groups of features for each instance (image) separately.
• Notice that other normalisation exists (Layer Norm – LN – or Instance Norm – IN), but their performances are not quite on par with BN and GN
ThoughBatch norm with batch size 32 tends to beat Group Norm on ImageNet dataset, it is less robust to batch size.

Group Norm seems to improve detection algorithms:

• It’s been used by the COCO 2018 winner
• It enable good performances of from scratch training of Faster R-CNN (without ImageNet pretraining)
The authors found empirically that 32 groups per layer (of 16 channels each) leads to the best performances. They note that choosing a fixed number of groups per layers or same number of filter per group result in similar performances.

Nota Bene:

• for more information, refer to the paper.
• a keras implementation is available here.
• One can learn a mix of group norm and batch norm (and other norms) by using: switchnorm

The Hourglass Network (not new but used everywhere)

Hourglass network is not new per se (first paper in 2016) but was extremely popular at ECCV so we thought it was a good idea to do a quick recap.

Originally introduced for human pose estimation, it is now used for other tasks such as detection (see CornerNet bellow, for instance).

Bellow we illustrate the module as depicted in the paper.

Note that the ressemblance with the classic encoder-decoder architecture close to an auto encoder or U-Net. One of the difference is that the authors use nearest neighbour upsampling instead of unpooling or transpose convolution. Another difference is that in each “skip layer” some computation is done.

Hourglass modules can be stacked together to create a deeper architecture. In this case, it’s useful to add intermediate supervision.

Typically, if you want to predict maps (such as segmentation mask), you add a loss after each decoder part of the hourglass module.

The idea of concatenating these modules is to be able to refine at each step the prediction that you generated.
In the same way, if you do multi-task learning, both predictions feature maps can be merged with original image to refine the predictions.


Corner Net

There are two issues with single shot detectors:

• anchor boxes don’t match perfectly the shape of boxes
• they add a lot of extra parameters to tune such as grid size and aspect ratio (sad)
Instead of predicting a score for every anchor boxes and apply non-max suppression, CornerNet tries to detect object as paired keypoints.

It predicts for each pixel if its a top left corner, if it is a right corner, the class, as well as an embedding that is used to match the complementary keypoint of the box.

They use the hourglass network as the backbone architecture with an intermediary supervision (between the two hourglass networks).

Predicting for each pixel would be to costly. Instead, the prediction is done at a lower resolution and offsets are predicted for a higher box precision (just like in the SSD paper).

More details are available in the paper.

Receptive Field Block Net (RFB Net)
The RFB block architecture is motivated by the human eye receptive field.

It is composed of the combinaison (concatenation + 1×1 convolution) of three branches:

• 1×1 convolution followed by a standard 3×3 convolution
• 3×3 convolution followed by a dilated convolution with rate 3
• 5×5 convolution followed by a dilated convolution with rate 5

A comparaison of the receptive field with common architectures is shown bellow.

A comparaison of the receptive field with common architectures is shown bellow.

It shows that while atrou / dilated convolution provide large receptive fields, the proposed RFB block leads to a smoother and more “human” receptive field.

The authors use their blocks in a SSD like approach depicted bellow.

They use a different version of the block in shallow layers

They obtain a very competitive architecture in terms of trading between speed and accuracy (faster than Retinanet for a same accuracy budget)


• the use of 5×5 convolution is interesting and goes against the trend of using only 3X3 convolutions. Maybe this 5×5 could be replaced by two successive 3×3 convolutions?
• it’s nice to see dilated convolution in detection architectures, given their success in segmentation tasks
• the paper is available here.

Other detection papers

IOU-Net: the autors propose to output the IOU prediction as well as the confidence of the class. This enables a modified version of the Non max suppression that makes much more sense: choose the box of the highest predicted IOU and set the class probability as the max of the confidences.
Deep Featured Pyramid Reconfiguration: The authors improve the Feature Pyramid Network (introduced in FPN, used in RetinaNet) by using attention modules.
PFPNet: Another way to reconfigure FPN networks. They get better MAP for similar speed compared to YOLO V3.



A lightweight segmentation architecture.

The overall idea is simple: to have a good segmentation, we need fine grained details and global shapes.

In traditional architectures such as U-Net, global information is obtained through concatenation of convolution layers with stride or maxpooling. That is, we need to compute fine grained features to compute the global shapes. BiseNet separate the two components to increase the network speed.

As a result, BiseNet is composed of two branches:

• a spatial path that keeps local features. It’s composed of three 3×3 convolutions with stride 2 followed by batchnorm and relu
• a context path that is designed to add a large receptive field for context. They are downsampling and using a “lightweight” model but it is not cristal clear how in the paper.
The paths are merged together with a concatenate operation followed by an attention-like mechanism.

BiseNet shows interesting performances for the speed it can achieve. The authors test it on relatively high resolution (1920x1080pixels), which is worth noting.

The paper is available here.

DeepLab v3+

The DeepLab team improved (again) their architecture by merging the best of two worlds, Atrou Spatial Pyramid Pooling (ASPP) and encoder-decoder approach (ex U-Net).

Instead of directly using a x8 upsampling, they merge information from low depth layer after a x4 upsampling.

Focus segment and Erase

The authors manage to get interesting performances in segmenting tumors on medical images with a sequential approach:

• extract ROIs (Regions of Interest)
• iteratively for several classifiers
• segment the class
• erase it: mutiply the image by a 1-segmentation mask

The idea is to be able to segment with good performances tiny regions of the brain such as necrosis.

Paper available here.

Pose Estimation

The second subject we wanted to detail is pose estilmation using only RGB and a 3D model. Pose estimation is important for us at Scortex because it enables us to position each defect / defect detections in a common referential for human verification.

Deep-6DPose: Recovering 6D Object Pose from a Single RGB Image

The paper presents an extension of the R-CNN network that also output the pose.

That is (cf. bellow), for each anchor box you predict the class, box, segmentation mask and pose coefficients.

It runs at 10 fps on 480x640pixels images.

They use Lie algebra for the pose to be regressed. This is the main different with other pose estimation papers from RGB images only.

One example to be compared with would be: “Fast Single Shot Detection and Pose Estimation

The training was done on the linemod dataset. For more information, here is the paper.

DeepIM: Deep Iterative Matching for 6D Pose Estimation


• Direct pose regression (example above) from images methods have limited accuracy.
• matching rendered images of an object against an observed image can produce accurate results
→ let’s iteratively refine a pose by matching rendered image with observed image. This leverages the additional 3D model knowledge.


1. Inputs:
• the 3D model (CAD)
• the image
• an initial pose estimation (typically obtained via direct pose regression)
2. The initial pose is used to render the 3D model in this pose.
3. A neural network takes the image and the rendered scene as input and outputs the delta of the pose
4. The pose estimation is updated with the delta of the pose
Repeat steps 2 to 4 until convergence

Training of the network

What’s left to explain is how they get this delta pose network.

Since training would require annotated 6 pose estimation, the network is trained using synthetic data only.

The backbone of the architecture is FlowNet, a network that takes two images and output the optical flow.

The network takes as input (concatenated in a 8 channels tensor):

• the image and its detection mask. During training these are fake simulated images. During inference, it’s the image of interest
• the rendered image and its segmentation mask

To make the trainings more stable, the authors propose also to predict segmentation mask of the first image as well as the optical flow.

For more information, check the paper (implementation can be found here).

Learning Implicit Representations of 3D Object Orientations from RGB

The original paper is here.

The authors propose a method to estimate the pose of an object. The method uses an encoder-decoder approach trained using only synthetic data from a 3D model.


At inference time:

• detect objects on an image using a standard detection algorithm (YOLO, RCNN, etc.)
for each detection:
• crop around the object
• input the result in an encoder-decoder architecture
• retrieve the bottleneck encoded values
• compare them to a range of known poses encoded values the authors called the codex
→ the closest match give the pose estimation

What does mean codex ? The authors rendered images of the 3D model without background, centered and make their 3D pose vary.

Then, they feed it to the auto encoder to get a codex of latent vectors.


The objective is to learn implicit representations of 3D object orientations.

Using only simulated data (collecting real pose data is very costly), the authors train the following auto encoder:

• the input is randomly rendered scenes (with variable poses, positions, background, lights, crops to simulate occlusions, …)
• while the output is the rendered part without any background and centered (same as the codex construction).
This way, each entry is re-mapped to the simplest rendering of the 3D model.

Here are some reconstruction examples from real images.

The authors say that the latent representations are robust against occlusions, clutter and the differences between synthetic and real data.

One of the key advantage of the use of closest representations is that it prevents issues related to ambiguous object views (for example when an object contains symmetries).

To our knowledge, the implementation is not available yet on the internet.

Deep Model-Based 6D Pose Refinement in RGB

The goal of the authors is to propose a 6D pose estimation that

• only uses RGB images (and 3D model)
• is robust
• is precise
• is ambiguity free


The approach is similar to the DeepIM methodology:

• The inputs are:
• the image
• the 3D model
• an estimation of the pose
• From the pose we extract a croped part of the image where the object lies
• From the 3D model is rendered an image with the pose hypothesis
• Both of these images are fed into a neural network that will predict a update (quaternions and translation).
The network is trained using simulated/rendered images only.

Note that the authors used as base feature extractors a model pretrained on Imagenet. This is to limit the gap between real images and the simulated data they use for training.

At the end the talk, one of the authors showed a video where the inference is done in real time and the results are quite impressive!

There implementation is available here. The paper is available here.


That’s it! We hope you enjoyed this quick summary as much as we enjoyed going to this conference.

The next ECCV will take place in 2020 in Edinburgh. See you there!