Re-se-arch
Our re-se-arch has been generously supported by ARO, NSF, ARFL, IARPA, BlueHalo and Salesforce.
2021
Sun, Wei; Wu, Tianfu
Deep Consensus Learning Online
arXiv preprint 2021.
@online{DCL,
title = {Deep Consensus Learning},
author = {Wei Sun and Tianfu Wu},
url = {https://arxiv.org/abs/2103.08475},
year = {2021},
date = {2021-03-15},
organization = {arXiv preprint},
abstract = {Both generative learning and discriminative learning have recently witnessed remarkable progress using Deep Neural Networks (DNNs). For structured input synthesis and structured output prediction problems (e.g., layout-to-image synthesis and image semantic segmentation respectively), they often are studied separately. This paper proposes deep consensus learning (DCL) for joint layout-to-image synthesis and weakly-supervised image semantic segmentation. The former is realized by a recently proposed LostGAN approach, and the latter by introducing an inference network as the third player joining the two-player game of LostGAN. Two deep consensus mappings are exploited to facilitate training the three networks end-to-end: Given an input layout (a list of object bounding boxes), the generator generates a mask (label map) and then use it to help synthesize an image. The inference network infers the mask for the synthesized image. Then, the latent consensus is measured between the mask generated by the generator and the one inferred by the inference network. For the real image corresponding to the input layout, its mask also is computed by the inference network, and then used by the generator to reconstruct the real image. Then, the data consensus is measured between the real image and its reconstructed image. The discriminator still plays the role of an adversary by computing the realness scores for a real image, its reconstructed image and a synthesized image. In experiments, our DCL is tested in the COCO-Stuff dataset. It obtains compelling layout-to-image synthesis results and weakly-supervised image semantic segmentation results.},
keywords = {},
pubstate = {published},
tppubtype = {online}
}
2020
Li, Xilai; Sun, Wei; Wu, Tianfu
Attentive Normalization Proceedings Article
In: European Conference on Computer Vision (ECCV), 2020.
@inproceedings{AttnNorm,
title = {Attentive Normalization},
author = {Xilai Li and Wei Sun and Tianfu Wu},
url = {https://arxiv.org/abs/1908.01259},
year = {2020},
date = {2020-08-23},
booktitle = {European Conference on Computer Vision (ECCV)},
journal = {CoRR},
volume = {abs/1905.10695},
abstract = {Batch Normalization (BN) is a vital pillar in the development of deep learning with many recent variations such as Group Normalization (GN) and Switchable Normalization. Channel-wise feature attention methods such as the squeeze-and-excitation (SE) unit have also shown impressive performance improvement. BN and its variants take into account different ways of computing the mean and variance within a min-batch for feature normalization, followed by a learnable channel-wise affine transformation. SE explicitly learns how to adaptively recalibrate channel-wise feature responses. They have been studied separately, however. In this paper, we propose a novel and lightweight integration of feature normalization and feature channel-wise attention. We present Attentive Normalization (AN) as a simple and unified alternative. AN absorbs SE into the affine transformation of BN. AN learns a small number of scale and offset parameters per channel (i.e., different affine transformations). Their weighted sums (i.e., mixture) are used in the final affine transformation. The weights are instance-specific and learned in a way that channel-wise attention is considered, similar in spirit to the squeeze module in the SE unit. AN is complementary and applicable to existing variants of BN. In experiments, we test AN in the ImageNet-1K classification dataset and the MS-COCO object detection and instance segmentation dataset with significantly better performance obtained than the vanilla BN. Our AN also outperforms two state-of-the-art variants of BN, GN and SN.},
howpublished = {arXiv preprint},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Xue, Nan; Wu, Tianfu; Bai, Song; Wang, Fudong; Xia, Gui-Song; Zhang, Liangpei; Torr, Philip H. S.
Holistically-Attracted Wireframe Parsing Proceedings Article
In: IEEE Conference on Computer Vision and Pattern Recognition (CVRP), 2020., 2020.
@inproceedings{HAWP,
title = {Holistically-Attracted Wireframe Parsing},
author = {Nan Xue and Tianfu Wu and Song Bai and Fudong Wang and Gui-Song Xia and Liangpei Zhang and Philip H.S. Torr},
year = {2020},
date = {2020-02-23},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVRP), 2020.},
abstract = {This paper presents a fast and parsimonious parsing method to accurately and robustly detect a vectorized wireframe in an input image with a single forward pass. The proposed method is end-to-end trainable, consisting of three components: (i) line segment and junction proposal generation, (ii) line segment and junction matching, and (iii) line segment and junction verification.
For computing line segment proposals, a novel exact dual representation is proposed which exploits a parsimonious geometric reparameterization for line segments and forms a holistic 4-dimensional attraction field map for an input image. Junctions can be treated as the ``basins" in the attraction field. The proposed method is thus called Holistically-Attracted Wireframe Parser (HAWP). In experiments, the proposed method is tested on two benchmarks, the Wireframe dataset and the YorkUrban dataset. On both benchmarks, it obtains state-of-the-art performance in terms of accuracy and efficiency. For example, on the Wireframe dataset, compared to the previous state-of-the-art method L-CNN, it improves the challenging mean structural average precision (msAP) by a large margin ($2.8%$ absolute improvements), and achieves 29.5 FPS on a single GPU (89% relative improvement). A systematic ablation study is performed to further justify the proposed method. },
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
For computing line segment proposals, a novel exact dual representation is proposed which exploits a parsimonious geometric reparameterization for line segments and forms a holistic 4-dimensional attraction field map for an input image. Junctions can be treated as the ``basins" in the attraction field. The proposed method is thus called Holistically-Attracted Wireframe Parser (HAWP). In experiments, the proposed method is tested on two benchmarks, the Wireframe dataset and the YorkUrban dataset. On both benchmarks, it obtains state-of-the-art performance in terms of accuracy and efficiency. For example, on the Wireframe dataset, compared to the previous state-of-the-art method L-CNN, it improves the challenging mean structural average precision (msAP) by a large margin ($2.8%$ absolute improvements), and achieves 29.5 FPS on a single GPU (89% relative improvement). A systematic ablation study is performed to further justify the proposed method.
2019
Li, Xilai; Song, Xi; Wu, Tianfu
AOGNets: Compositional Grammatical Architectures for Deep Learning Proceedings Article
In: IEEE Conference on Computer Vision and Pattern Recognition (CVRP), 2019.
@inproceedings{AOGNets,
title = {AOGNets: Compositional Grammatical Architectures for Deep Learning},
author = {Xilai Li and Xi Song and Tianfu Wu},
url = {http://openaccess.thecvf.com/content_CVPR_2019/papers/Li_AOGNets_Compositional_Grammatical_Architectures_for_Deep_Learning_CVPR_2019_paper.pdf
https://github.com/iVMCL/AOGNets
https://www.wraltechwire.com/2019/05/21/ncsu-researchers-create-framework-for-a-smarter-ai-are-seeking-patent/
https://www.technologynetworks.com/tn/news/new-framework-enhances-neural-network-performance-319704
},
year = {2019},
date = {2019-06-18},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVRP)},
abstract = {Neural architectures are the foundation for improving performance of deep neural networks (DNNs). This paper presents deep compositional grammatical architectures which harness the best of two worlds: grammar models and DNNs. The proposed architectures integrate compositionality and reconfigurability of the former and the capability of learning rich features of the latter in a principled way. We utilize AND-OR Grammar (AOG) as network generator in this paper and call the resulting networks AOGNets. An AOGNet consists of a number of stages each of which is composed of a number of AOG building blocks. An AOG building block splits its input feature map into N groups along feature channels and then treat it as a sentence of N words. It then jointly realizes a phrase structure grammar and a dependency grammar in bottom-up parsing the “sentence” for better feature exploration and reuse. It provides a unified framework for the best practices developed in state-of-the-art DNNs. In experiments, AOGNet is tested in the ImageNet-1K classification benchmark and the MS-COCO object detection and segmentation benchmark. In ImageNet-1K, AOGNet obtains better performance than ResNet and most of its variants, ResNeXt and its attention based variants such as SENet, DenseNet and DualPathNet. AOGNet also obtains the best model interpretability score using network dissection. AOGNet further shows better potential in adversarial defense. In MS-COCO, AOGNet obtains better performance than the ResNet and ResNeXt backbones in Mask R-CNN.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Wu, Tianfu; Song, Xi
Towards Interpretable Object Detection by Unfolding Latent Structures Proceedings Article
In: International Conference on Computer Vision (ICCV), 2019.
@inproceedings{iRCNN,
title = {Towards Interpretable Object Detection by Unfolding Latent Structures},
author = {Tianfu Wu and Xi Song},
year = {2019},
date = {2019-10-28},
booktitle = {International Conference on Computer Vision (ICCV)},
abstract = {This paper first proposes a method of formulating model interpretability in visual understanding tasks based on the idea of unfolding latent structures. It then presents a case study in object detection using popular two-stage region- based convolutional network (i.e., R-CNN) detection systems. The proposed method focuses on weakly-supervised extractive rationale generation, that is learning to unfold latent discriminative part configurations of object instances automatically and simultaneously in de- tection without using any supervision for part configura- tions. It utilizes a top-down hierarchical and compositional grammar model embedded in a directed acyclic AND-OR Graph (AOG) to explore and unfold the space of latent part configurations of regions of interest (RoIs). It presents an AOGParsing operator that seamlessly integrates with the RoIPooling/RoIAlign operator widely used in R-CNN and is trained end-to-end. In object detection, a bounding box is interpreted by the best parse tree derived from the AOG on-the-fly, which is treated as the qualita- tively extractive rationale generated for interpreting detec- tion. In experiments, Faster R-CNN [50] is used to test the proposed method on the PASCAL VOC 2007 and the COCO 2017 object detection datasets. The experimental results show that the proposed method can com- pute promising latent structures without hurting the performance.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}