Re-se-arch
Our re-se-arch has been generously supported by ARO, NSF, ARFL, IARPA, BlueHalo and Salesforce.
2022
Grainger, Ryan; Paniagua, Thomas; Song, Xi; Wu, Tianfu
Learning Patch-to-Cluster Attention in Vision Transformer Working paper
arXiv preprint, 2022.
@workingpaper{PaCaViT,
title = {Learning Patch-to-Cluster Attention in Vision Transformer},
author = {Ryan Grainger and Thomas Paniagua and Xi Song and Tianfu Wu},
url = {https://arxiv.org/abs/1606.00850},
year = {2022},
date = {2022-03-23},
abstract = {The Vision Transformer (ViT) model is built on the assumption of treating image patches as "visual tokens" and learning patch-to-patch attention. The patch embedding based tokenizer is a workaround in practice and has a semantic gap with respect to its counterpart, the textual tokenizer. The patch-to-patch attention suffers from the quadratic complexity issue, and also makes it non-trivial to explain learned ViT models. To address these issues in ViT models, this paper proposes to learn patch-to-cluster attention (PaCa) based ViT models. Queries in our PaCaViT are based on patches, while keys and values are based on clustering (with a predefined small number of clusters). The clusters are learned end-to-end, leading to better tokenizers and realizing joint clustering-for-attention and attention-for-clustering when deployed in ViT models. The quadratic complexity is relaxed to linear complexity. Also, directly visualizing the learned clusters can reveal how a trained ViT model learns to perform a task (e.g., object detection). In experiments, the proposed PaCa-ViT is tested on CIFAR-100 and ImageNet-1000 image classification, and MS-COCO object detection and instance segmentation. Compared with prior arts, it obtains better performance in classification and comparable performance in detection and segmentation. It is significantly more efficient in COCO due to the linear complexity. The learned clusters are also semantically meaningful and shed light on designing more discriminative yet interpretable ViT models.},
howpublished = {arXiv preprint},
keywords = {},
pubstate = {published},
tppubtype = {workingpaper}
}
2020
Xue, Nan; Wu, Tianfu; Bai, Song; Wang, Fudong; Xia, Gui-Song; Zhang, Liangpei; Torr, Philip H. S.
Holistically-Attracted Wireframe Parsing Proceedings Article
In: IEEE Conference on Computer Vision and Pattern Recognition (CVRP), 2020., 2020.
@inproceedings{HAWP,
title = {Holistically-Attracted Wireframe Parsing},
author = {Nan Xue and Tianfu Wu and Song Bai and Fudong Wang and Gui-Song Xia and Liangpei Zhang and Philip H.S. Torr},
year = {2020},
date = {2020-02-23},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVRP), 2020.},
abstract = {This paper presents a fast and parsimonious parsing method to accurately and robustly detect a vectorized wireframe in an input image with a single forward pass. The proposed method is end-to-end trainable, consisting of three components: (i) line segment and junction proposal generation, (ii) line segment and junction matching, and (iii) line segment and junction verification.
For computing line segment proposals, a novel exact dual representation is proposed which exploits a parsimonious geometric reparameterization for line segments and forms a holistic 4-dimensional attraction field map for an input image. Junctions can be treated as the ``basins" in the attraction field. The proposed method is thus called Holistically-Attracted Wireframe Parser (HAWP). In experiments, the proposed method is tested on two benchmarks, the Wireframe dataset and the YorkUrban dataset. On both benchmarks, it obtains state-of-the-art performance in terms of accuracy and efficiency. For example, on the Wireframe dataset, compared to the previous state-of-the-art method L-CNN, it improves the challenging mean structural average precision (msAP) by a large margin ($2.8%$ absolute improvements), and achieves 29.5 FPS on a single GPU (89% relative improvement). A systematic ablation study is performed to further justify the proposed method. },
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
For computing line segment proposals, a novel exact dual representation is proposed which exploits a parsimonious geometric reparameterization for line segments and forms a holistic 4-dimensional attraction field map for an input image. Junctions can be treated as the ``basins" in the attraction field. The proposed method is thus called Holistically-Attracted Wireframe Parser (HAWP). In experiments, the proposed method is tested on two benchmarks, the Wireframe dataset and the YorkUrban dataset. On both benchmarks, it obtains state-of-the-art performance in terms of accuracy and efficiency. For example, on the Wireframe dataset, compared to the previous state-of-the-art method L-CNN, it improves the challenging mean structural average precision (msAP) by a large margin ($2.8%$ absolute improvements), and achieves 29.5 FPS on a single GPU (89% relative improvement). A systematic ablation study is performed to further justify the proposed method.
Xing, Xianglei; Wu, Tianfu; Zhu, Song-Chun; Wu, Ying Nian
Towards Interpretable Image Synthesis by Learning Sparsely Connected AND-OR Networks Proceedings Article
In: IEEE Conference on Computer Vision and Pattern Recognition (CVRP), 2020., 2020.
@inproceedings{iGenerativeM,
title = {Towards Interpretable Image Synthesis by Learning Sparsely Connected AND-OR Networks},
author = {Xianglei Xing and Tianfu Wu and Song-Chun Zhu and Ying Nian Wu},
url = {https://arxiv.org/abs/1909.04324},
year = {2020},
date = {2020-02-23},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVRP), 2020.},
journal = {CoRR},
abstract = {This paper proposes interpretable image synthesis by learning hierarchical AND-OR networks of sparsely connected semantically meaningful nodes. The proposed method is based on the compositionality and interpretability of scene-objects-parts-subparts-primitives hierarchy in image representation. A scene has different types (i.e., OR) each of which consists of a number of objects (i.e., AND). This can be recursively formulated across the scene-objects-parts-subparts hierarchy and is terminated at the primitive level (e.g., Gabor wavelets-like basis). To realize this interpretable AND-OR hierarchy in image synthesis, the proposed method consists of two components: (i) Each layer of the hierarchy is represented by an over-completed set of basis functions. The basis functions are instantiated using convolution to be translation covariant. Off-the-shelf convolutional neural architectures are then exploited to implement the hierarchy. (ii) Sparsity-inducing constraints are introduced in end-to-end training, which facilitate a sparsely connected AND-OR network to emerge from initially densely connected convolutional neural networks. A straightforward sparsity-inducing constraint is utilized, that is to only allow the top-k basis functions to be active at each layer (where k is a hyperparameter). The learned basis functions are also capable of image reconstruction to explain away input images. In experiments, the proposed method is tested on five benchmark datasets. The results show that meaningful and interpretable hierarchical representations are learned with better qualities of image synthesis and reconstruction obtained than state-of-the-art baselines.},
howpublished = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
2019
Li, Xilai; Song, Xi; Wu, Tianfu
AOGNets: Compositional Grammatical Architectures for Deep Learning Proceedings Article
In: IEEE Conference on Computer Vision and Pattern Recognition (CVRP), 2019.
@inproceedings{AOGNets,
title = {AOGNets: Compositional Grammatical Architectures for Deep Learning},
author = {Xilai Li and Xi Song and Tianfu Wu},
url = {http://openaccess.thecvf.com/content_CVPR_2019/papers/Li_AOGNets_Compositional_Grammatical_Architectures_for_Deep_Learning_CVPR_2019_paper.pdf
https://github.com/iVMCL/AOGNets
https://www.wraltechwire.com/2019/05/21/ncsu-researchers-create-framework-for-a-smarter-ai-are-seeking-patent/
https://www.technologynetworks.com/tn/news/new-framework-enhances-neural-network-performance-319704
},
year = {2019},
date = {2019-06-18},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVRP)},
abstract = {Neural architectures are the foundation for improving performance of deep neural networks (DNNs). This paper presents deep compositional grammatical architectures which harness the best of two worlds: grammar models and DNNs. The proposed architectures integrate compositionality and reconfigurability of the former and the capability of learning rich features of the latter in a principled way. We utilize AND-OR Grammar (AOG) as network generator in this paper and call the resulting networks AOGNets. An AOGNet consists of a number of stages each of which is composed of a number of AOG building blocks. An AOG building block splits its input feature map into N groups along feature channels and then treat it as a sentence of N words. It then jointly realizes a phrase structure grammar and a dependency grammar in bottom-up parsing the “sentence” for better feature exploration and reuse. It provides a unified framework for the best practices developed in state-of-the-art DNNs. In experiments, AOGNet is tested in the ImageNet-1K classification benchmark and the MS-COCO object detection and segmentation benchmark. In ImageNet-1K, AOGNet obtains better performance than ResNet and most of its variants, ResNeXt and its attention based variants such as SENet, DenseNet and DualPathNet. AOGNet also obtains the best model interpretability score using network dissection. AOGNet further shows better potential in adversarial defense. In MS-COCO, AOGNet obtains better performance than the ResNet and ResNeXt backbones in Mask R-CNN.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Sun, Wei; Wu, Tianfu
Image Synthesis from Reconfigurable Layout and Style Proceedings Article
In: International Conference on Computer Vision (ICCV), 2019.
@inproceedings{LostGAN,
title = {Image Synthesis from Reconfigurable Layout and Style},
author = {Wei Sun and Tianfu Wu},
url = {https://arxiv.org/abs/1908.07500
https://github.com/iVMCL/LostGANs},
year = {2019},
date = {2019-10-28},
booktitle = {International Conference on Computer Vision (ICCV)},
abstract = {Despite remarkable recent progress on both unconditional and conditional image synthesis, it remains a long-standing problem to learn generative models that are capable of synthesizing realistic and sharp images from reconfigurable spatial layout (i.e., bounding boxes + class labels in an image lattice) and style (i.e., structural and appearance variations encoded by latent vectors), especially at high resolution. By reconfigurable, it means that a model can preserve the intrinsic one-to-many mapping from a given layout to multiple plausible images with different styles, and is adaptive with respect to perturbations of a layout and style latent code. In this paper, we present a layout- and style-based architecture for generative adversarial networks (termed LostGANs) that can be trained end-to-end to generate images from reconfigurable layout and style. Inspired by the vanilla StyleGAN, the proposed LostGAN consists of two new components: (i) learning fine-grained mask maps in a weakly-supervised manner to bridge the gap between layouts and images, and (ii) learning object instance-specific layout-aware feature normalization (ISLA-Norm) in the generator to realize multi-object style generation. In experiments, the proposed method is tested on the COCO-Stuff dataset and the Visual Genome dataset with state-of-the-art performance obtained.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Li, Xilai; Zhou, Yingbo; Wu, Tianfu; Socher, Richard; Xiong, Caiming
Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting Proceedings Article
In: International Conference on Machine Learning (ICML), 2019.
@inproceedings{Learn2grow,
title = {Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting},
author = {Xilai Li and Yingbo Zhou and Tianfu Wu and Richard Socher and Caiming Xiong},
url = {https://arxiv.org/abs/1904.00310
https://news.ncsu.edu/2019/05/ai-continual-learning-framework/
https://www.army.mil/article/222090/army_funded_research_boosts_memory_of_ai_systems
https://news.science360.gov/archives/20190517
https://techxplore.com/news/2019-05-framework-artificial-intelligence.html
https://www.wraltechwire.com/2019/05/15/researchers-create-framework-to-help-artificial-intelligence-systems-be-less-forgetful/},
year = {2019},
date = {2019-06-11},
booktitle = {International Conference on Machine Learning (ICML)},
abstract = {Addressing catastrophic forgetting is one of the key challenges in continual learning where machine learning systems are trained with sequential or streaming tasks. Despite recent remarkable progress in state-of-the-art deep learning, deep neural networks (DNNs) are still plagued with the catastrophic forgetting problem. This paper presents a conceptually simple yet general and effective framework for handling catastrophic forgetting in continual learning with DNNs. The proposed method consists of two components: a neural structure optimization component and a parameter learning and/or fine-tuning component. By separating the explicit neural structure learning and the parameter estimation, not only is the proposed method capable of evolving neural structures in an intuitively meaningful way, but also shows strong capabilities of alleviating catastrophic forgetting in experiments. Furthermore, the proposed method outperforms all other baselines on the permuted MNIST dataset, the split CIFAR100 dataset and the Visual Domain Decathlon dataset in continual learning setting.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
- https://arxiv.org/abs/1904.00310
- https://news.ncsu.edu/2019/05/ai-continual-learning-framework/
- https://www.army.mil/article/222090/army_funded_research_boosts_memory_of_ai_sys[...]
- https://news.science360.gov/archives/20190517
- https://techxplore.com/news/2019-05-framework-artificial-intelligence.html
- https://www.wraltechwire.com/2019/05/15/researchers-create-framework-to-help-art[...]
Wu, Tianfu; Song, Xi
Towards Interpretable Object Detection by Unfolding Latent Structures Proceedings Article
In: International Conference on Computer Vision (ICCV), 2019.
@inproceedings{iRCNN,
title = {Towards Interpretable Object Detection by Unfolding Latent Structures},
author = {Tianfu Wu and Xi Song},
year = {2019},
date = {2019-10-28},
booktitle = {International Conference on Computer Vision (ICCV)},
abstract = {This paper first proposes a method of formulating model interpretability in visual understanding tasks based on the idea of unfolding latent structures. It then presents a case study in object detection using popular two-stage region- based convolutional network (i.e., R-CNN) detection systems. The proposed method focuses on weakly-supervised extractive rationale generation, that is learning to unfold latent discriminative part configurations of object instances automatically and simultaneously in de- tection without using any supervision for part configura- tions. It utilizes a top-down hierarchical and compositional grammar model embedded in a directed acyclic AND-OR Graph (AOG) to explore and unfold the space of latent part configurations of regions of interest (RoIs). It presents an AOGParsing operator that seamlessly integrates with the RoIPooling/RoIAlign operator widely used in R-CNN and is trained end-to-end. In object detection, a bounding box is interpreted by the best parse tree derived from the AOG on-the-fly, which is treated as the qualita- tively extractive rationale generated for interpreting detec- tion. In experiments, Faster R-CNN [50] is used to test the proposed method on the PASCAL VOC 2007 and the COCO 2017 object detection datasets. The experimental results show that the proposed method can com- pute promising latent structures without hurting the performance.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}