Large-scale Scene Understanding Challenge: Leaderboard

>> Scene Classification

* Performance

Team Name Top 1 Accuracy
SIAT_MMLAB 91.61
SJTU-ReadSense 90.43
TEG Rangers 88.70
ds-cube 83.02

* Metric

Top 1 Accuracy The percent of testing data with the same label in prediction and ground truth.

* Team

Team Name Member Method Additional
SIAT_MMLAB Sheng Guo, Limin Wang, Bowen zhang and Yu Qiao Multi-resolution CNNS model fusion. finetune on imagenet model
SJTU-ReadSense Qinchuan Zhang, Junxuan Chen, Faming Xu, Leon Ding, Hongtao Lu An ensemble of three Inception models plus three VGG_19 based models and multiple resolution crops are used for inference No
TEG Rangers Yongqiang Gao, Na Li, Weidong Chen "We propose a new scene recognition system with deep convolutional models. Specifically, we address this problem from two aspects: (i) We mainly use VGG net as the fine tuning model. The last output convolution layers are divided several parts which are object-centric. (ii) We introduce multi-scale CNN framework, where we train CNNs from image patches of three resolutions(112 cropped from 128, 224 cropped from 256, and 336 cropped from 384). (ii) Combing CNNs of different architectures: considering the complementarity of networks with different architectures, we also fuse the prediction results of networks: VGGNet13-256, VGGNet16-128,VGGNet16-256,VGGNet16-384,VGGNet19-256 and finetuned denscap Net. " No
ds-cube Ayush Rai, Shaivi Kochar, Anil Nelakanti Performed finetuning of LSUN Train dataset using VGG 19 Network No



>> Saliency Prediction

* Performance: iSUN

Team AUC CC IG sAUC
Donders 0.834654 0.668005 0.001817 0.543768
SDU_VSISLab 0.849714 0.788234 0.050751 0.520509
XRCE 0.855237 0.786927 0.101901 0.538404
UPC-Microsoft-BSC 0.859838 0.798076 0.135869 0.541129
NPU_HanLab 0.860600 0.815104 0.156153 0.549970
VAL 0.861729 0.814632 0.178770 0.550059
DEEPATTENT 0.861991 0.814404 0.174019 0.549960

* Performance: SALICON

Team AUC CC IG sAUC
VLL 0.732196 0.766254 0.160525 0.598952
SDU_VSISLab 0.744363 0.735466 0.179225 0.599414
HUCVL 0.747197 0.825767 0.197146 0.598397
Donders 0.747980 0.767043 0.247464 0.627419
ML-Net 0.748492 0.724379 0.273812 0.632655
UPC-Microsoft-BSC 0.754524 0.796821 0.291961 0.635668
XRCE 0.755746 0.821661 0.303806 0.631685
NPU_HanLab 0.755892 0.774942 0.317763 0.636591
VAL 0.761066 0.804374 0.315320 0.629928
DEEPATTENT 0.766941 0.890133 0.325582 0.631224

* Metric

We used and modified CC, AUC_shuffled, and AUC_Judd metrics from the MIT Saliency Benchmark [1,2,3,4] and their source codes as part of the toolkit. For the completeness and convenience of the readers, we copy their description of the metrics from the MIT website here followed by our modification. For the IG metric, please refer to [5] for more details. All the submissions are officially evaluated by the python toolkit from pysaliency github.
CC This is also called Pearson's linear coefficient and is the linear correlation coefficient between two different saliency maps (CC=0 for uncorrelated maps).
sAUC (AUC_shuffled) This shuffled AUC (introduced in Zhang et al. 2008) is based on Ali Borji's code, used in Borji et al. 2013. The saliency map is treated as a binary classifier to separate positive from negative samples at various thresholds. The true positive (tp) rate is the proportion of saliency map values above threshold at fixation locations. The false positive (fp) rate is the proportion of saliency map values above threshold sampled at random pixels (as many samples as fixations, sampled uniformly from FIXATIONS ON OTHER IMAGES). In this implementation, threshold values are sampled at a fixed step size. Instead of using randomly sampled 10 images as source of negative fixations, we used all the other images for negative fixations.
AUC (AUC_Judd) This version of the Area Under ROC curve measure has been called AUC-Judd in Riche et al. 2013. The saliency map is treated as a binary classifier to separate positive from negative samples at various thresholds. The true positive (tp) rate is the proportion of saliency map values above threshold at fixation locations. The false positive (fp) rate is the proportion of saliency map values above threshold at non-fixated pixels. In this implementation, the thresholds are sampled from saliency map values. Instead of averaging over images, we averaged over fixations.
IG (Information Gain) If a model predicts for each image a fixation density for each image, information-theoretic metrics can be applied. Information gain is the difference in average log-likelihood between the model and a centerbias baseline model, quantifying how much more information the model explains than the baseline. An information gain of 1 bit per fixation means that the model found on average each fixation to be double as likely to occur than found by the baseline. To formulate the submitted models as probabilistic models, we fitted a pointwise nonlinearity and a centerbias to each model. For more details on the metric, the baseline and the evaluation we used, see Kümmerer et. al (PNAS 2015) [5].

* Team

Team Name Member Method Additional
Donders Umut Güçlü*, Yağmur Güçlütürk*, Rob van Lier, Marcel A. J. van Gerven We used deep neural networks comprising convolutional layers, deconvolutional layers and residual blocks. The networks were trained with mini-batch stochastic gradient descent by minimizing the Euclidean distance between the target saliency maps and the predicted saliency maps. The images and the target saliency maps in the datasets were resized to fixed sizes. The images and the target saliency maps in each mini-batch were augmented by cropping and mirroring. The predicted saliency maps were smoothed with a Gaussian kernel and resized to the corresponding original sizes.
SDU_VSISLab Jianjie Gao, Wei Zhang, Weidong Zhang. We trained a deep network consist of two branches to predict the saliency maps, one branch of which is initialized by the JuntingNet. No
XRCE Naila Murray A fully-convolutional deep model which is trained to predict an output saliency map using a Bhattacharyya distance-based loss. External data: ImageNet was used to pre-train the early convolutional layers of the model.
UPC-Microsoft-BSC Junting Pan, Cristian Canton Ferrer, Xavier Giró-i-Nieto, Elisa Sayrol, Jordi Torres We have proposed a novel fully convolutional network for end to end saliency prediction. The proposed network has two stages: encoding stage and decoding stage. In the first stage we finetuned the parameters adopted from a combination of different pretrained models. The decoding stage is composed by parameter learned from scratch. None. Excluding ImageNet + Places, the first part of our model has used model pretrained with these two dataset.
VAL Srinivas S S Kruthiventi, Kumar Ayush and R. Venkatesh Babu Our model is DeepFix - a fully convolutional neural network for end-to-end saliency prediction. DeepFix is designed to capture semantics at multiple scales while taking global context into account. Our network introduces spatial variance in the layers to handle center bias. Network is pre-trained on ILSVRC 2014 image classification data for classification task.
VLL R-Tavakoli Hamed The model consists of two pipelines each capable of acting as a self-contained model. One pipeline can act as a baseline for a class of models. If the challenge permits, we can provide the output of each pipeline for more detailed analysis.
HUCVL Cagdas Bak, Aykut Erdem, Erkut Erdem We adapted a dynamic saliency model trained on DIEM dataset to static saliency estimation. We did not use any other external data during training on SALICON dataset.
ML-Net Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, Rita Cucchiara Our saliency model is composed by three main parts. The first component is a Fully Convolutional network with 16 layers, which takes the input image and produces features maps for the encoding network. The architecture is inspired by the VGG-19 model. The second component of our model is an encoding network which learns how to weight features extracted at different levels from previous layers and produces saliency specific features. Finally a prior is learned and applied to produce the final predicted saliency map. We initialized some layers of our model with weights of VGG19 model trained over ImageNet Dataset.
DEEPATTENT Kshitij Dwivedi, Nitin Kumar Singh, Sabari Raju Shanmugam (Submitted Anonymously) Our model utilizes a network with residual architecture pretrained on IMAGENET which is then finetuned on SALICON train dataset for fixation map generation. The fixations are further enhanced by adding layers for integrating rich information from features at multiple scales. Some layers of the model are initialized with pretrained weights from IMAGENET



>> Room Layout Estimation

* Performance

Method Pixelwise Error Corner Error
ILC-ST-PIO 0.0529 0.0384
SDU_VSISLab 0.0658 0.0517
CFILE 0.0757 0.0523
UIUC (last year winner) 0.1671 0.1102

* Metric

Pixelwise Error Covert the room layout to segmentation mask and count the percentage of pixels with consistent labels.
Corner Error The distance of room corners normalized by the diagnal of the image size.

* Team

Team Name Member Method Additional
SDU_VSISLab Weidong Zhang, Wei Zhang We trained a deconvolution network to simultaneously predict the room edges (corner representation) and different faces (segmentation representation). The parameterized layout results were generated and iteratively refined based on the predicted edge maps and segmentation masks. External data: No external data. The network we use was initialized by the weights of the VGG-16-layer model trained on the ILSVRC dataset.
CFILE Yuzhuo Ren, C.-C. Jay Kuo CNN based approach to estimate a coarse layout and then optimize to a finer result External data: Hedau indoor dataset.
UIUC Arun Mallya, Svetlana Lazebnik We use a fully convolutional deep network to generate informative edge features and then use a conventional structured svm to rank prospective layouts generated by an adaptive proposal mechanism. External data: The svm is trained on the Hedau indoor dataset.
ILC-ST-PIO Hao Zhao, Ming Lu, Anbang Yao, Yiwen Guo, Yurong Chen, Li Zhang We propose ST-PIO, a compact room layout estimation framework, which enjoys the benefits of two novel techniques named semantic transfer and physics inspired optimization. External data: SUN-RGBD.



>> Reference

[1] Judd, Tilke, Fredo Durand, and Antonio Torralba. "A benchmark of computational models of saliency to predict human fixations." (2012).
[2] Borji, Ali, Dicky N. Sihite, and Laurent Itti. "Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study." Image Processing, IEEE Transactions on 22.1 (2013): 55-69.
[3] Borji, Ali, and Laurent Itti. "CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research." arXiv preprint arXiv:1505.03581 (2015).
[4] Bylinskii, Zoya, Tilke Judd, Ali Borji, Laurent Itti, Fredo Durand, Aude Oliva, and Antonio Torralba. "MIT saliency benchmark".
[5] Matthias Kümmerer, Thomas SA Wallis, Matthias Bethge. "Information-theoretic model comparison unifies saliency metrics." Proceedings of the National Academy of Sciences 2015