A Large Scale Urban Scene Dataset and Simulator

{whatsevenlyl; fullcyxuc; hhzhiyan}@gmail.com

Visual Computing Research Center

Shenzhen University, China

Figure 1: Overview of UrbanScene3D.




The ability to perceive the environments in different ways is essential to robotic research. This involves the analysis of various kinds of data sources, such as depth map, visual image, and LIDAR data, etc. Related works in 2D/2.5D [1, 2] image domains have been proposed. However, a comprehensive understanding of 3D scenes needs the cooperation of 3D data (e.g., point clouds and textured polygon meshes), which is still far from sufficient in the community.

We present a large scale urban scene dataset associated with a handy simulator based on Unreal Engine 4 [3] and AirSim [4], which consists of both man-made and real-world reconstruction scenes in different scales, referred to as UrbanScene3D. The manually made scene models have compact structures, which are carefully constructed/designed by professional modelers according to the images and maps of target areas; see the first row of Figure 1 for a glance. In contrast, UrbanScene3D also offers dense, detailed scene models reconstructed by aerial images through multi-view stereo (MVS) techniques. These scenes have realistic textures and meticulous structures; see e.g., the second part of Figure 1. We have also released the originally captured aerial images that have been used to reconstruct the 3D scene models, as well as a set of 4K video sequences that would facilitate designing algorithms, such SLAM and MVS; please check some samples shown in the third and fourth parts of Figure 1.

Although there are 3D instance segmentation datasets, e.g., S3DIS [5], ScanNet [6], NYUv2 [7], and SceneNN [8], they are all collected from indoor scenes and not enough for deep learning-based methods. Please noting that, there is basically no decent data for learning 3D instance segmentation in outdoor scenes, especially complicated urban regions. In this context, our released UrbanScene3D provides rich, large- scale 3D urban scene building annotation data for outdoor instance segmentation research. To segment and label 3D urban architectures, we have to extract all single building models from the entire urban scene. Every building model is then assigned with an unique label forming an instance segmentation map, which indicates the ground-truth of the instance segmentation task. The provided 3D ground-truth textured models with instance segmentation label in UrbanScene3D allow users to obtain all kinds of data they would like to have: instance segmentation map, depth map in arbitrary resolution, 3D point cloud/mesh in both visible and invisible place, etc. In addition, with the help of AirSim [4], users can also simulate the robots (cars/drones) to test a variety of autonomous tasks in the proposed city environment; see e.g., the bottom row of Figure 1.

Features and Requirements

We show the statistics and features of our dataset below.

Table 1: Statistics of our 3D urban dataset. We provide six virtual scenes with their CAD models that are constructed by professional artists according to images or satellite maps. In addition, we also share 5 reconstructed real world scenes, including both their corresponding aerial images and detailed mesh models. All these models have already been associated with building-level instance segmentation. Area (Km2) represents the covered area of the scene; Model (Mb) represents the size of the model; Texture (#) represents the number of texture images contained in this model; Texture (Mb) represents the size of texture; Object (#) represents the number of objects in this scene.


Table 2: Features of the data sources of different datasets. Compared to existing datasets, UrbanScene3D has both CAD and detailed models, as well as the corresponding 2D/3D original data.


The released zip file on our website contains the unreal project of the above proposed urban scenes. Users can use either pure Unreal Engine or AirSim client (both in C++ or python) to capture their desired data. The ground truth textured meshes and their relevant poses are also provided in the Unreal project.

    Required packages:
  • Unreal Engine 4 (4.24 is recommended)
  • C++ or Python
  • AirSim(Optional)

Advantages and Applications

UrbanScene3D not only has CAD models that contain sharp edges and compact primitive structures for man-made virtual environments, but also has reconstructed mesh models with detailed, realistic features for real-world urban scenes. Please refer to the videos shown below. We build the touring in different scenes to demonstrate their corresponding CAD/Detailed models.

In addition, UrbanScene3D also releases the raw acquisition data, including high resolution aerial images (6000x4000) and aerial videos (4K) together along with their poses and detailed reconstructed 3D models. Check out the video samples shown below. These data can be largely used to train and test algorithms like SLAM, MVS or single view reconstruction in the wild.

Moreover, based on the UrbanScene3D simulator, users can design autonomous driving/flying in various environments. Below we demonstrate some popular applications via our simulator.

Video 1: Touring a virtual city scene. Note that users can obtain the corresponding CAD models associated with instance segmentation.

Video 2: Touring the real-world scenes. Note that these scenes are reconstructed by real-world aerial images, thus they have detailed geometries, structures and realistic textures. The instance segmentation is available as well.

Video 3: A real-world model reconstructed with aerial images captured from optimized 3D views.

Video 4: A sample of aerial videos aimed at 3D urban scene acquisition.

Video 5: Applications of our dataset and simulator, e.g., autonomous driving, drone navigation, and intelligent aerial acquisition.

Download (70G)

Note that all DATA and CODE are free for Research and Education Use ONLY.
Please cite our paper (add the bibtex below) if you use any part of our ALGORITHM, CODE, DATA or RESULTS in any publication.

  • Download via FTP
  • Download via HTTP
  • Download via Baidu Netdisk (code: vccc)
  • Download via Dropbox (faster outside China)
  • (In Baidu Netdisk and Dropbox, we have compressed the files in volume. Please download all the files and then extract the main compression package)

  • Dataset Description Download via arXiv
  • Github Page for issues and comments


This work was supported in parts by NSFC (U2001206), Guangdong Talent Program (2019JC05X328), Guangdong Science and Technology Program (2020A0505100064, 2015A030312015), DEGP Key Project (2018KZDXM058), and Shenzhen Science and Technology Program (RCJC20200714114435012).


title={UrbanScene3D: A Large Scale Urban Scene Dataset and Simulator},
author={Yilin Liu and Fuyou Xue and Hui Huang},


  1. T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. Euro. Conf. on Computer Vision, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., 2014, pp. 740–755.
  2. A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in Proc. IEEE Conf. on Computer Vision & Pattern Recognition, 2012, pp. 3354–3361.
  3. Epic Games, “Unreal Engine.” [Online]. Available: https://www.unrealengine.com
  4. S. Shah, D. Dey, C. Lovett, and A. Kapoor, “AirSim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics, 2017.
  5. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” Proc. Euro. Conf. on Computer Vision, vol. 7576 LNCS, no. PART 5, pp. 746–760, 2012.
  6. A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “ScanNet: Richlyannotated 3D reconstructions of indoor scenes,” Proc. IEEE Conf. on Computer Vision & Pattern Recognition, vol. 2017-Janua, pp. 2432–2443, 2017.
  7. I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese, “3D Semantic Parsing of Large-Scale Indoor Spaces,” Proc. IEEE Conf. on Computer Vision & Pattern Recognition, pp. 1534–1543, 2016.
  8. B. S. Hua, Q. H. Pham, D. T. Nguyen, M. K. Tran, L. F. Yu, and S. K. Yeung, “SceneNN: A scene meshes dataset with aNNotations,” Proc. Int. Conf. on on 3D Vision, pp. 92–101, 2016.
  9. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proc. IEEE Conf. on Computer Vision & Pattern Recognition, 2016, pp. 3213–3223.
  10. X. Song, P. Wang, D. Zhou, R. Zhu, C. Guan, Y. Dai, H. Su, H. Li, and R. Yang, “ApolloCar3D: A large 3D car instance understanding benchmark for autonomous driving,” in Proc. IEEE Conf. on Computer Vision & Pattern Recognition, 2019, pp. 5447–5457.
  11. Y. Zhou, J. Huang, X. Dai, S. Liu, L. Luo, Z. Chen, and Y. Ma, “HoliCity: A City-Scale Data Platform for Learning Holistic 3D Structures,” 2020.