PASCAL VOC and ImageNet ILSVRC challenges have enabled significant progress for object recognition in the past decade. We plan to borrow this mechanism to speed up the progress for scene understanding as well. Complementary to the object-centric ImageNet ILSVRC Challenge hosted at ICCV/ECCV every year, we are hosting a scene-centric challenge at CVPR every year. Our challenge focuses on four major tasks in scene understanding, including scene classification, saliency prediction, room layout estimation, and caption generation (hosted by MS COCO). Inspired by recent success using big data, such as deep learning, we will focus on providing benchmarks that are at least several times bigger than the existing ones, to support training these data-hungry algorithms. By providing a set of large-scale benchmarks in an annual challenge format, we expect significant progress to be made for scene understanding in the coming years. The details for the last-year challenge can be found at LSUN 2015.
Submission: The details of each task and submission format are provided below. You can submit results once every 5 days and the submission with best performance from each team will appear in the final ranking. Please email the results with "LSUN2016" in the subject and filled submission form to firstname.lastname@example.org. For classification task, you can attach the text file containing the results to the email. For the other tasks, please upload it to cloud storage such as Dropbox and send us the downloadable link, because the submission files can be large.
Results: The challenge results are listed in our leaderboard.
In this task, an algorithm needs to report the top 1 most likely scene categories
for each image. You can now download preliminary released data from the links below. The final training set may be different. Besides the training
set, we also provide 300 images per category for validation. There are 1,000 images for each category in the testing set.
The data can be downloaded by the provided script.
Please check README for documentation and demo code. Contact Fisher Yu for requests of original images and other questions.
LSUN Dataset more information about LSUN dataset can be found at the project webpage lsun.yf.io.
|Bedroom||3,033,042 images (43 GB)||300 images|
|Bridge||818,687 images (16 GB)||300 images|
|Church Outdoor||126,227 images (2.3 GB)||300 images|
|Classroom||168,103 images (3.1 GB)||300 images|
|Conference Room||229,069 images (3.8 GB)||300 images|
|Dining Room||657,571 images (11 GB)||300 images|
|Kitchen||2,212,277 images (34 GB)||300 images|
|Living Room||1,315,802 images (22 GB)||300 images|
|Restaurant||626,331 images (13 GB)||300 images|
|Tower||708,264 images (12 GB)||300 images|
|Testing Set||10,000 images (173 MB)|
In this task, an algorithm needs to predict where human look in a scene. Two datasets are provided: iSUN (eye tracking based) and SALICON (mouse tracking based). All submissions will be evaluated on both datasets respectively, and we will have a winner for each of the datasets. Evaluation toolkit in both Matlab and Python will be released with the benchmark. The challenge of this task is co-hosted with NUS VIP Lab and Bethge Lab.
iSUN The data is collected by gaze tracking from Amazon Mechanical Turk using a web-cam. All our images are from the SUN database. For each image, we provide the image content in JPG, image resolution, scene category, and ground truth (including gaze trajectory, fixation points, and saliency mask, for training and validation sets only). Please refer to iSUN project page for more details about how this data is collected.
SALICON The data is collected via mouse cursor tracking in a new psychophysical paradigm from Amazon Mechanical Turk by NUS VIP Lab. All the images are from MS COCO dataset. For each image, we provide the image content in JPG, image resolution and ground truth (including mouse trajectory, fixation points, and saliency mask, for training and validation sets only). Please refer to the SALICON page for more details.
|Training Set (6000 images)||Image List and Labels|
|Validation Set (926 images)||Image List and Labels|
|Testing Set (2000 images)||Image List|
|Fixation Ground Truth||Zip File|
|Saliency Map Ground Truth||Zip File (12GB)|
|All Images in JPG||Zip File (2GB)|
|Training Set (10000 images)||Image List and Labels|
|Validation Set (5000 images)||Image List and Labels|
|Testing Set (5000 images)||Image List|
|Fixation Ground Truth||Zip File|
|Saliency Map Ground Truth||Zip File (19GB)|
|All Images in JPG||Zip File (3GB)|
In this task, an algorithm needs to estimate the room layout from a single indoor scene image. All the images are indoor. They are from the SUN database and our LSUN scene classification database. We assume that a room showed in an image can be represented by a part of a 3D box. Therefore, the room layout estimation is formulated as a way to predict the positions of intersection between planar walls, ceiling and floors. There are 4000 images for training, 394 images for validation and 1000 images for testing. All the images have valid room layout that can be clearly annotated by human. The annotation is done in house by the organizers from Princeton Vision Group. For each image, we provide the image content, the scene category and the room layout annotation (for training and validation sets only). There are eight scene categories in our dataset, including bedroom, hotel room, dining room, dinette home, living room, office, conference room and classroom. The scene categories for the images in the testing set are also provided. A Matlab toolkit is provided for visualization and evaluation.
June 26, 2016 at Augustus I - II
|13:30 - 13:35||Welcome||Jianxiong Xiao|
|13:35 - 13:50||Introduction to LSUN dataset and classification task||Fisher Yu|
|13:50 - 14:05||Classification winner talk 1||Bowen Zhang (Team SIAT-MMLAB)|
|14:05 - 14:20||Classification winner talk 2||Wen-Sheng Chu (Team SJTU-ReadSense)|
|14:20 - 14:25||Introduction to saliency prediction task||Yinda Zhang|
|14:25 - 14:35||Saliency evaluation and toolkit||Matthias Kümmerer|
|14:35 - 14:50||Saliency winner talk||Srinivas Kruthiventi (Team VAL)|
|14:50 - 14:55||Introduction to room layout task||Yinda Zhang|
|14:55 - 15:10||Room layout winner talk||Yuzhuo Ren (Team CF)|
|15:10 - 15:45||Coffee break|
|15:45 - 16:15||Keynote Talk||Jitendra Malik|
|16:15 - 16:45||Keynote Talk||Yann LeCun|
|16:45 - 16:50||Award Session|
|16:50 - 17:00||Closing remarks|