Large-scale Scene Understanding Challenge: Leaderboard

>> Scene Classification

* Performance

Team Name Top 1 Accuracy
Google 0.9120
SIAT_MMLAB 0.9030
Contextual CNN 0.8821
isia_ICT 0.8799

* Metric

Top 1 Accuracy The percent of testing data with the same label in prediction and ground truth.

* Team

Team Name Member Method Additional
Google Julian Ibarz, Christian Szegedy, Vincent Vanhoucke A single Inception based convnet with a 151x151 receptive field.
SIAT_MMLAB Sheng Guo, Tong He, Limin Wang, Weilin Huang, Yu Qiao
Contextual CNN Mandar Dixit, Nuno Vasconcelos, Si Chen The convolutional GoogLeNet is trained separately to classify the entire image as well as image regions. Contextual cues from the region based classifier are pooled to generate an scene representation which is classified with a linear SVM. These scores are combined with the scores of the conventional full-image classification Net and used to produce the final decisions.
isia_ICT Xiangyang Li, Xinhang Song,Luis Herranz,Shuqiang Jiang



>> Saliency Prediction

* Performance: iSUN

Method Similarity CC AUC_shuffled AUC_Borji AUC_Judd
UPC 0.6833 0.8230 0.6650 0.8463 0.8693
Xidian 0.5713 0.6167 0.6484 0.7949 0.8207
WHU_IIP 0.5593 0.6263 0.6307 0.7960 0.8197
LCYLab 0.5474 0.5699 0.6259 0.7921 0.8133
Rare 2012 Improved 0.5199 0.5199 0.6283 0.7582 0.7846
Baseline: BMS[1] 0.5026 0.3465 0.5885 0.6560 0.6914
Baseline: GBVS[2] 0.4798 0.5087 0.6208 0.7913 0.8115
Baseline: Itti[3] 0.4251 0.3728 0.6024 0.7262 0.7489

* Performance: SALICON

Method Similarity CC AUC_shuffled AUC_Borji AUC_Judd
UPC 0.5198 0.5957 0.6698 0.8291 0.8364
WHU_IIP 0.4908 0.4569 0.6064 0.7759 0.7923
Rare 2012 Improved 0.5017 0.5108 0.6644 0.8047 0.8148
Xidian 0.4617 0.4811 0.6809 0.7990 0.8051
Baseline: BMS[1] 0.4542 0.4268 0.6935 0.7699 0.7899
Baseline: GBVS[2] 0.4460 0.4212 0.6303 0.7816 0.7899
Baseline: Itti[3] 0.3777 0.2046 0.6101 0.6603 0.6669

* Metric

We used exactly the same metrics with the MIT Saliency Benchmark [6,7,8,9] and their source codes for evaluation. For the completeness and convenience of the readers, we copy their description of the metrics from the MIT website here.
Similarity This similarity measure is also called histogram intersection and measures the similarity between two different saliency maps when viewed as distributions (SIM=1 means the distributions are identical).
CC This is also called Pearson's linear coefficient and is the linear correlation coefficient between two different saliency maps (CC=0 for uncorrelated maps).
AUC_shuffled This shuffled AUC (introduced in Zhang et al. 2008) is based on Ali Borji's code, used in Borji et al. 2013. The saliency map is treated as a binary classifier to separate positive from negative samples at various thresholds. The true positive (tp) rate is the proportion of saliency map values above threshold at fixation locations. The false positive (fp) rate is the proportion of saliency map values above threshold sampled at random pixels (as many samples as fixations, sampled uniformly from FIXATIONS ON OTHER IMAGES). In this implementation, threshold values are sampled at a fixed step size.
AUC_Borji This version of the Area Under ROC curve measure is based on Ali Borji's code, used in Borji et al. 2013 - not to be confused with shuffled AUC (provided as a separate metric, sAUC). The saliency map is treated as a binary classifier to separate positive from negative samples at various thresholds. The true positive (tp) rate is the proportion of saliency map values above threshold at fixation locations. The false positive (fp) rate is the proportion of saliency map values above threshold sampled at random pixels (as many samples as fixations, sampled uniformly from ALL IMAGE PIXELS). In this implementation, threshold values are sampled at a fixed step size.
AUC_Judd This version of the Area Under ROC curve measure has been called AUC-Judd in Riche et al. 2013. The saliency map is treated as a binary classifier to separate positive from negative samples at various thresholds. The true positive (tp) rate is the proportion of saliency map values above threshold at fixation locations. The false positive (fp) rate is the proportion of saliency map values above threshold at non-fixated pixels. In this implementation, the thresholds are sampled from saliency map values.

* Team

Team Name Member Method Additional
WHU_IIP Yifan Ding, Shidong Ke, Xiaoying Ding, Zhenzhong Chen We analyze the low level features of the retina image and considered the visual sensitivity and visual search theory. The perceived retina image is considered for feature extraction and the attention shift is modeled accordingly. The visual memory effect and fixation priorities are also modeled for generating the final saliency map.
UPC Junting Pan and Xavier Giro-i-Nieto Our solution is based on a convolutional neural network that consists of 3 convolutional and 2 fully connected layers. Its parameters have been learned by solving a regression problem of the saliency maps on the images of the training dataset.
LCYLab CongYan Lang, ZunLi, Yeliang For this challenge,first we through the RCNN to train the object class,and also ,we use the object tag as our prior,and get our model to fuse our prior to the existing models,last ,we get our saliencymap through our model. External data: The object tag we used to get our boxes,and we use this as our model's prior.
Xidian Fei Qi, Chen Xia, Yuancheng Huang, Guangming Shi, Wei Liu, Chong Shen We model visual saliency as a center-surround reconstruction problem where we use a lager patch to reconstruct its center part. The behind mechanism is that in background regions, the reconstruction will produce small reconstruction error while on foreground regions, the error is large. With large center-surround reconstruction error, regions are predicted as salient. The reconstruction model is learnt from the given image (visual stimulus) by training a deep neural network which composes a 5 layer deep auto-encoder and an additional inference layer.
Rare 2012 Improved Pierre Marighetto, Nicolas Riche, Matei Mancas This model is an improvement of Rare 2012 model, made for the challenge, based on the paper "RARE2012: A multi-scale rarity-based saliency detection with its comparative statistical analysis [Signal Processing: Image Communication, 2013]". It also includes objectness computation, based on work CALVIN group project, whic can be found on this web page "http://groups.inf.ed.ac.uk/calvin/objectness/



>> Room Layout Estimation

* Performance

Method Pixelwise Error Corner Error
DeLay 0.1063 0.0820
UIUC 0.1671 0.1102
Varsha Hedau, et al. [4] 0.2423 0.1548

* Metric

Pixelwise Error Covert the room layout to segmentation mask and count the percentage of pixels with consistent labels.
Corner Error The distance of room corners normalized by the diagnal of the image size.

* Team

Team Name Member Method Additional
DeLay Anonymous External data: None. Excluding ImageNet+PASCAL VOC, which were used for pre-training the network that we fine-tuned on LSUN.
UIUC Arun Mallya, Svetlana Lazebnik We use a fully convolutional deep network to generate informative edge features and then use a conventional structured svm to rank prospective layouts generated by an adaptive proposal mechanism. External data: The svm is trained on the Hedau indoor dataset.



>> Reference

[1] Zhang, Jianming, and Stan Sclaroff. "Saliency detection: A boolean map approach." Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013.
[2] Harel, Jonathan, Christof Koch, and Pietro Perona. "Graph-based visual saliency." Advances in neural information processing systems. 2006.
[3] Itti, Laurent, Christof Koch, and Ernst Niebur. "A model of saliency-based visual attention for rapid scene analysis." IEEE Transactions on pattern analysis and machine intelligence 20.11 (1998): 1254-1259.
[4] Hedau, Varsha, Derek Hoiem, and David Forsyth. "Recovering the spatial layout of cluttered rooms." Computer vision, 2009 IEEE 12th international conference on. IEEE, 2009.
[5] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
[6] Judd, Tilke, Fredo Durand, and Antonio Torralba. "A benchmark of computational models of saliency to predict human fixations." (2012).
[7] Borji, Ali, Dicky N. Sihite, and Laurent Itti. "Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study." Image Processing, IEEE Transactions on 22.1 (2013): 55-69.
[8] Borji, Ali, and Laurent Itti. "CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research." arXiv preprint arXiv:1505.03581 (2015).
[9] Bylinskii, Zoya, Tilke Judd, Ali Borji, Laurent Itti, Fredo Durand, Aude Oliva, and Antonio Torralba. "MIT saliency benchmark".