Specular-to-Diffuse Translation for Multi-View Reconstruction

European Conference on Computer Vision

Shihao Wu1    Hui Huang2*      Tiziano Portenier1      Matan Sela3      Daniel Cohen-Or4      Ron Kimmel3      Matthias Zwicker5

1University of Bern     2Shenzhen University     3Technion - Israel Institute of Technology    4Tel Aviv University    5University of Maryland

Fig. 1: Specular-to-diffuse translation of multi-view images. We show eleven views of a glossy object (top), and the specular-free images generated by our network (bottom).


Most multi-view 3D reconstruction algorithms, especially when shapefrom-shading cues are used, assume that object appearance is predominantly diffuse. To alleviate this restriction, we introduce S2Dnet, a generative adversarial network for transferring multiple views of objects with specular reflection into diffuse ones, so that multi-view reconstruction methods can be applied more effectively. Our network extends unsupervised image-to-image translation to multiview “specular to diffuse” translation. To preserve object appearance across multiple views, we introduce a Multi-View Coherence loss (MVC) that evaluates the similarity and faithfulness of local patches after the view-transformation. Our MVC loss ensures that the similarity of local correspondences among multi-view images is preserved under the image-to-image translation. As a result, our network yields significantly better results than several single-view baseline techniques. In addition, we carefully design and generate a large synthetic training data set using physically-based rendering. During testing, our network takes only the raw glossy images as input, without extra information such as segmentation masks or lighting estimation. Results demonstrate that multi-view reconstruction can be significantly improved using the images filtered by our network. We also show promising performance on real world training and testing data.

Fig. 2: Overview of S2Dnet. Two generators and two discriminators are trained simultaneously to learn cross-domain translations between the glossy and the diffuse domain.In each training iteration, the model randomly picks and forwards a real glossy and diffuse image sequence, computes the loss functions and updates the model parameters.

Fig. 5: Illustration of the generator and discriminator network. The generator uses the U-net architecture and both input and output are a multi-view sequence consisting of three views. A random SIFT correspondence is sampled during training to compute the correspondence loss. The multi-scale joint discriminator examines three scales of the image sequence and two scales of corresponding local patches. The width and height of each rectangular block indicate the channel size and the spatial dimension of the output feature map, respectively.

Fig. 6: Qualitative translation results on a synthetic input sequence consisting of 8 views. From top down: the glossy input sequence, the ground truth diffuse rendering, and the translation results for the baselines pix2pix and cycleGAN, and our S2Dnet. The output of pix2pix is generally blurry. The cycleGAN output, although sharp, lacks inter-view consistency. Our S2Dnet produces both crisp and coherent translations.

Fig. 7: Qualitative surface reconstruction results on 10 different scenes. From top to bottom: glossy input, ground truth diffuse renderings, cycleGAN translation outputs, our S2Dnet translation outputs, reconstructions from glossy images, reconstructions from ground truth diffuse images, reconstructions from cycleGAN output, and reconstructions from our S2Dnet output. All sequences are excluded from our training set, and the objects in column 3 and 4 have not even been seen during training.

Fig. 8: Qualitative translation results on a real-world input sequence consisting of 11 views. The first row shows the glossy input sequence and the remaining rows show the translation results of pix2pix, cycleGAN, and our S2Dnet. All networks are trained on synthetic data only. Similar to the synthetic case, cycleGAN outperforms pix2pix, but it produces high-frequency artifacts that are not consistent along the views. Our S2Dnet is able to remove most of the specular effects and preserves all the geometric details in a consistent manner.

Fig. 10: a), b) A sample of our real-world dataset. c) translation result of cycleGAN when training from scratch on our real-world dataset. d) S2Dnet output, trained from scratch on our real-world dataset. e) S2Dnet output, trained on synthetic data only. f) S2Dnet output, trained on synthetic data, fine-tuned on real-world data.

Data & Code

Note that the DATA and CODE are free for Research and Education Use ONLY. 

Please cite our paper (add the bibtex below) if you use any part of our ALGORITHM, CODE, DATA or RESULTS in any publication.



Link:  https://arxiv.org/abs/1807.05439


We thank the anonymous reviewers for their constructive comments. This work was supported in parts by Swiss National Science Foundation (169151), NSFC (61522213, 61761146002, 61861130365), 973 Program (2015CB352501), Guangdong Science and Technology Program (2015A030312015), ISF-NSFC Joint Research Program (2472/17) and Shenzhen Innovation Program (KQJSCX20170727101233642).

  title={Specular-to-Diffuse Translation for Multi-View Reconstruction,
  author={Shihao Wu and Hui Huang and Tiziano Portenier and Matan Sela and Daniel Cohen-Or and Ron Kimmel and Matthias Zwicker},




Downloads (faster for people in China)

Downloads (faster for people in other places)