Re-se-arch
Our re-se-arch has been generously supported by ARO, NSF, ARFL, IARPA, BlueHalo and Salesforce.
2019
Wu, Tianfu; Song, Xi
Towards Interpretable Object Detection by Unfolding Latent Structures Proceedings Article
In: International Conference on Computer Vision (ICCV), 2019.
@inproceedings{iRCNN,
title = {Towards Interpretable Object Detection by Unfolding Latent Structures},
author = {Tianfu Wu and Xi Song},
year = {2019},
date = {2019-10-28},
booktitle = {International Conference on Computer Vision (ICCV)},
abstract = {This paper first proposes a method of formulating model interpretability in visual understanding tasks based on the idea of unfolding latent structures. It then presents a case study in object detection using popular two-stage region- based convolutional network (i.e., R-CNN) detection systems. The proposed method focuses on weakly-supervised extractive rationale generation, that is learning to unfold latent discriminative part configurations of object instances automatically and simultaneously in de- tection without using any supervision for part configura- tions. It utilizes a top-down hierarchical and compositional grammar model embedded in a directed acyclic AND-OR Graph (AOG) to explore and unfold the space of latent part configurations of regions of interest (RoIs). It presents an AOGParsing operator that seamlessly integrates with the RoIPooling/RoIAlign operator widely used in R-CNN and is trained end-to-end. In object detection, a bounding box is interpreted by the best parse tree derived from the AOG on-the-fly, which is treated as the qualita- tively extractive rationale generated for interpreting detec- tion. In experiments, Faster R-CNN [50] is used to test the proposed method on the PASCAL VOC 2007 and the COCO 2017 object detection datasets. The experimental results show that the proposed method can com- pute promising latent structures without hurting the performance.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper first proposes a method of formulating model interpretability in visual understanding tasks based on the idea of unfolding latent structures. It then presents a case study in object detection using popular two-stage region- based convolutional network (i.e., R-CNN) detection systems. The proposed method focuses on weakly-supervised extractive rationale generation, that is learning to unfold latent discriminative part configurations of object instances automatically and simultaneously in de- tection without using any supervision for part configura- tions. It utilizes a top-down hierarchical and compositional grammar model embedded in a directed acyclic AND-OR Graph (AOG) to explore and unfold the space of latent part configurations of regions of interest (RoIs). It presents an AOGParsing operator that seamlessly integrates with the RoIPooling/RoIAlign operator widely used in R-CNN and is trained end-to-end. In object detection, a bounding box is interpreted by the best parse tree derived from the AOG on-the-fly, which is treated as the qualita- tively extractive rationale generated for interpreting detec- tion. In experiments, Faster R-CNN [50] is used to test the proposed method on the PASCAL VOC 2007 and the COCO 2017 object detection datasets. The experimental results show that the proposed method can com- pute promising latent structures without hurting the performance.