PASCAL VOC and ImageNet ILSVRC challenges have enabled significant progress for object recognition in the past decade. We plan to borrow this mechanism to speed up the progress for scene understanding as well. Complementary to the object-centric ImageNet ILSVRC Challenge hosted at ICCV/ECCV every year, we are hosting a scene-centric challenge at CVPR every year. Our challenge focuses on four major tasks in scene understanding, including scene classification, saliency prediction, room layout estimation, and caption generation (hosted by MS COCO). Inspired by recent success using big data, such as deep learning, we will focus on providing benchmarks that are at least several times bigger than the existing ones, to support training these data-hungry algorithms. By providing a set of large-scale benchmarks in an annual challenge format, we expect significant progress to be made for scene understanding in the coming years.
In this task, an algorithm needs to report the top 1 most likely scene categories for each image. You can now download preliminary released data from the links below. Besides the training set, we also provide 300 images per category for validation. There are 1,000 images for each category in the testing set. The data can be downloaded by the provided script. Please check README for documentation and demo code. Contact Fisher Yu for requests of original images and other questions.
|Bedroom||3,033,042 images (43 GB)||300 images|
|Bridge||818,687 images (16 GB)||300 images|
|Church Outdoor||126,227 images (2.3 GB)||300 images|
|Classroom||168,103 images (3.1 GB)||300 images|
|Conference Room||229,069 images (3.8 GB)||300 images|
|Dining Room||657,571 images (11 GB)||300 images|
|Kitchen||2,212,277 images (34 GB)||300 images|
|Living Room||1,315,802 images (22 GB)||300 images|
|Restaurant||626,331 images (13 GB)||300 images|
|Tower||708,264 images (12 GB)||300 images|
|Testing Set||10,000 images (173 MB)|
In this task, an algorithm needs to predict where human look in a scene. Two datasets are provided: iSUN (eye tracking based) and SALICON (mouse tracking based). All submissions will be evaluated on both datasets respectively, and we will have a winner for each of the datasets. A Matlab toolkit is provided for evaluation. The challenge of this task is co-hosted by NUS VIP Lab.
iSUN: The data is collected by gaze tracking from Amazon Mechanical Turk using a web-cam. All our images are from the SUN database. For each image, we provide the image content in JPG, image resolution, scene category, and ground truth (including gaze trajectory, fixation points, and saliency mask, for training and validation sets only). Please refer to iSUN project page for more details about how this data is collected.
SALICON: The data is collected via mouse cursor tracking in a new psychophysical paradigm from Amazon Mechanical Turk by NUS VIP Lab. All the images are from MS COCO dataset. For each image, we provide the image content in JPG, image resolution and ground truth (including mouse trajectory, fixation points, and saliency mask, for training and validation sets only). Please refer to the SALICON page for more details.
|Training Set (6000 images)||Image List and Labels|
|Validation Set (926 images)||Image List and Labels|
|Testing Set (2000 images)||Image List|
|Fixation Ground Truth||Zip File|
|Saliency Map Ground Truth||Zip File (12GB)|
|All Images in JPG||Zip File (2GB)|
|Training Set (10000 images)||Image List and Labels|
|Validation Set (5000 images)||Image List and Labels|
|Testing Set (5000 images)||Image List|
|Fixation Ground Truth||Zip File|
|Saliency Map Ground Truth||Zip File (19GB)|
|All Images in JPG||Zip File (3GB)|
|Evaluation Toolkit||Matlab Toolkit|
In this task, an algorithm needs to estimate the room layout from a single indoor scene image. All the images are indoor. They are from the SUN database and our LSUN scene classification database. We assume that a room showed in an image can be represented by a part of a 3D box. Therefore, the room layout estimation is formulated as a way to predict the positions of intersection between planar walls, ceiling and floors. There are 4000 images for training, 394 images for validation and 1000 images for testing. All the images have valid room layout that can be clearly annotated by human. The annotation is done in house by the organizers from Princeton Vision Group. For each image, we provide the image content, the scene category and the room layout annotation (for training and validation sets only). There are eight scene categories in our dataset, including bedroom, hotel room, dining room, dinette home, living room, office, conference room and classroom. The scene categories for the images in the testing set are also provided. A Matlab toolkit is provided for visualization and evaluation.
|8:45 - 9:05||Introduction, Speaker: Fisher Yu, Yinda Zhang|
|9:05 - 9:20||"Google" team, Speaker: Christian Szegedy|
|9:20 - 9:35||"isia_ICT" team, Speaker: Luis Herranz|
|10:30 - 10:45||iSUN introduction, Speaker: Yinda Zhang|
|10:45 - 11:00||SALICON introduction, Speaker: Qi Zhao|
|11:00 - 11:15||"UPC" team, Speaker: Xavier Giró i Nieto|
|11:15 - 11:30||"Rare" team, Speaker: Pierre Marighetto|
|11:30 - 11:45||Opening, Speaker: Larry Zitnick, Matteo Ruggero Ronchi, Yin Cui|
|11:45 - 11:55||“Montreal/Toronto” team, Speaker: Kelvin Xu|
|11:55 - 12:05||“MSR Captivator” team: Speaker: Saurabh Gupta|
|12:05 - 12:15||“Google” team, Speaker: Oriol Vinyals|
|12:15 - 12:20||Remark|
|April 1, 2015||Initial data, development kit and evaluation software made available|
|May 1, 2015||Final data release|
|June 4, 2015||Submission deadline at 11:59pm Pacific Time|
|June 5, 2015||Challenge results release|
|June 12, 2015||Workshop in CVPR 2015|