The PASCAL VOC and ImageNet ILSVRC challenges have enabled significant progress for object recognition in the past decade. Beginning with CVPR 2015, we borrowed this mechanism to speed up the progress for scene understanding via the LSUN workshop. Complementary to the object-centric ImageNet ILSVRC Challenge hosted at ICCV/ECCV every year, we propose to continue hosting this scene-centric challenge at CVPR every year. Our challenge will focus on major tasks in scene understanding, including scene object retrieval, outdoor scene segmentation, RGB-D 3D object detection and saliency prediction. Inspired by recent successes using big data, such as deep learning, we focus on providing benchmarks that are significantly bigger and more diverse than the existing ones, to support training these data-hungry algorithms. By providing a set of large-scale benchmarks in an annual challenge format, we expect significant progress to continue for scene understanding in the coming years. Given the experience of our previous workshops, we are updating all of our existing tasks and rolling out new tasks.

Scene Classification

In this task, an algorithm needs to report the top 1 most likely scene categories for each image. You can now download preliminary released data from the links below. The final training set may be different. Besides the training set, we also provide 300 images per category for validation. There are 1,000 images for each category in the testing set. The data can be downloaded by the provided script. Please check README for documentation and demo code. Contact Fisher Yu for requests of original images and other questions. The submission deadline is July 15, 2017.

LSUN Dataset more information about LSUN dataset can be found at the project webpage

Bedroom3,033,042 images (43 GB)300 images
Bridge818,687 images (16 GB) 300 images
Church Outdoor126,227 images (2.3 GB) 300 images
Classroom168,103 images (3.1 GB) 300 images
Conference Room229,069 images (3.8 GB) 300 images
Dining Room657,571 images (11 GB) 300 images
Kitchen2,212,277 images (34 GB) 300 images
Living Room1,315,802 images (22 GB) 300 images
Restaurant626,331 images (13 GB) 300 images
Tower708,264 images (12 GB) 300 images
Testing Set 10,000 images (173 MB)

Segmentation Task on Street Images

This task comprises two separate challenges based on the novel Mapillary Vistas Dataset: Semantic image segmentation and Instance-specific semantic image segmentation of street-level images. Mapillary Vistas Research edition contains 25,000 densely annotated street level images (66 object classes, pixel-accurate, polygon-based annotations with instance-specific object annotations for 37 categories), featuring locations from all around the world. The image data visually covers parts of Europe, North and South America, Asia and Australia and consequently spans a broad range of object appearances. For performance assessment, commonly used metrics like average intersection-over-union scores for pixel-level segmentation and average precision on instance-specific segmentations are used. We expect large resonance from the object recognition community and hope to generate high impact for pushing the boundaries of state-of-the-art models. Participation details for these challenges can be found at

Saliency Prediction

In this task, an algorithm needs to predict where human look in a scene. SALICON (mouse tracking based) is provided. Evaluation toolkit in both Matlab and Python will be released with the benchmark. The challenge of this task is co-hosted with UMN VIP Lab.

SALICON The data is collected via mouse cursor tracking in a new psychophysical paradigm from Amazon Mechanical Turk by UMN VIP Lab. All the images are from MS COCO dataset. For each image, we provide the image content in JPG, image resolution and ground truth (including mouse trajectory, fixation points, and saliency mask, for training and validation sets only). Please refer to the SALICON page for more details.


July 26, 2017, Room 304 AB

13:30 - 13:35WelcomeFisher Yu
13:35 - 14:05Keynote TalkProf. Hao Jiang
14:05 - 14:25Classification winner talk - Deep Pyramidal Residual Networks Jiwhan Kim
14:25 - 14:40Mapillary Scene Parsing Task Peter Kontschieder
14:40 - 15:10Keynote TalkProf. Devi Parikh
15:10 - 15:30Contributed TalkScene Parsing Winner
15:30 - 16:15Coffee break
16:15 - 16:20Saliency Detection Task Qi Zhao
16:20 - 16:40Saliency Winner Talk 1 Prof. Roberto Vezzani
16:40 - 17:00Saliency Winner Talk 2Samuel Dodge


  • Fisher Yu - Princeton University
  • Peter Kontschieder - Mapillary
  • Shuran Song - Princeton University
  • Ming Jiang - University of Minnesota, Twin Cities
  • Yinda Zhang - Princeton University
  • Catherine Qi Zhao - University of Minnesota, Twin Cities
  • Thomas Funkhouser - Princeton University
  • Jianxiong Xiao - AutoX, Inc.