Re-se-arch
Our re-se-arch has been generously supported by ARO, NSF, ARFL, IARPA, BlueHalo and Salesforce.
2024
Liu, Xianpeng; Zheng, Ce; Qian, Ming; Xue, Nan; Chen, Chen; Zhang, Zhebin; Li, Chen; Wu, Tianfu
Multi-View Attentive Contextualization for Multi-View 3D Object Detection Proceedings Forthcoming
In: CVPR'24, Forthcoming.
@proceedings{mvacon,
title = {Multi-View Attentive Contextualization for Multi-View 3D Object Detection},
author = {Xianpeng Liu and Ce Zheng and Ming Qian and Nan Xue and Chen Chen and Zhebin Zhang and Chen Li and Tianfu Wu},
year = {2024},
date = {2024-06-18},
urldate = {2024-06-18},
abstract = {We present Multi-View Attentive Contextualization (MvACon), a simple yet effective method for improving 2D-to-3D feature lifting in query-based multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in the field of query-based MV3D object detection, prior art often suffers from either the lack of exploiting high-resolution 2D features in dense attention-based lifting, due to high computational costs, or from insufficiently dense grounding of 3D queries to multi-scale 2D features in
sparse attention-based lifting. Our proposed MvACon hits the two birds with one stone using a representationally dense yet computationally sparse attentive feature contextualization scheme that is agnostic to specific 2D-to-3D feature lifting approaches. In experiments, the proposed MvACon is thoroughly tested on the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant, as well as the PETR, showing consistent detection performance improvement, especially in enhancing performance in location, orientation, and velocity prediction. It is also tested on the Waymo-mini benchmark using BEVFormer with similar improvement. We qualitatively and quantitatively show that global cluster-based contexts effectively encode dense scene-level contexts for MV3D object detection. The promising results of our proposed MvACon reinforces the adage in computer vision \textendash “(contextualized) feature matters”.},
howpublished = {In: CVPR'24},
keywords = {},
pubstate = {forthcoming},
tppubtype = {proceedings}
}
sparse attention-based lifting. Our proposed MvACon hits the two birds with one stone using a representationally dense yet computationally sparse attentive feature contextualization scheme that is agnostic to specific 2D-to-3D feature lifting approaches. In experiments, the proposed MvACon is thoroughly tested on the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant, as well as the PETR, showing consistent detection performance improvement, especially in enhancing performance in location, orientation, and velocity prediction. It is also tested on the Waymo-mini benchmark using BEVFormer with similar improvement. We qualitatively and quantitatively show that global cluster-based contexts effectively encode dense scene-level contexts for MV3D object detection. The promising results of our proposed MvACon reinforces the adage in computer vision – “(contextualized) feature matters”.
Xue, Nan; Tan, Bin; Xiao, Yuxi; Dong, Liang; Xia, Gui-Song; Wu, Tianfu; Shen, Yujun
NEAT: Distilling 3D Wireframes from Neural Attraction Fields Proceedings Forthcoming
In: CVPR'24, Forthcoming.
@proceedings{neat,
title = {NEAT: Distilling 3D Wireframes from Neural Attraction Fields},
author = {Nan Xue and Bin Tan and Yuxi Xiao and Liang Dong and Gui-Song Xia and Tianfu Wu and Yujun Shen},
year = {2024},
date = {2024-06-18},
urldate = {2024-06-18},
abstract = {This paper studies the problem of structured 3D reconstruction using wireframes that consist of line segments and junctions, focusing on the computation of structured boundary geometries of scenes. Instead of leveraging matching-based solutions from 2D wireframes (or line segments) for 3D wireframe reconstruction as done in prior arts, we present NEAT, a textbf{rendering-distilling} formulation using neural fields to represent 3D line segments with 2D observations, and bipartite matching for perceiving and distilling of a sparse set of 3D global junctions. The proposed {NEAT} enjoys the joint optimization of the neural fields and the global junctions from scratch, using view-dependent 2D observations without precomputed cross-view feature matching.
Comprehensive experiments on the DTU and BlendedMVS datasets demonstrate our NEAT's superiority over state-of-the-art alternatives for 3D wireframe reconstruction. Moreover, the distilled 3D global junctions by NEAT, are a better initialization than SfM points, for the recently-emerged 3D Gaussian Splatting for high-fidelity novel view synthesis using about 20 times fewer initial 3D points.},
howpublished = {In: CVPR'24},
keywords = {},
pubstate = {forthcoming},
tppubtype = {proceedings}
}
Comprehensive experiments on the DTU and BlendedMVS datasets demonstrate our NEAT's superiority over state-of-the-art alternatives for 3D wireframe reconstruction. Moreover, the distilled 3D global junctions by NEAT, are a better initialization than SfM points, for the recently-emerged 3D Gaussian Splatting for high-fidelity novel view synthesis using about 20 times fewer initial 3D points.
2023
Paniagua, Thomas; Grainger, Ryan; Wu, Tianfu
QuadAttacK: A Quadratic Programming Approach to Learning Ordered Top-K Adversarial Attacks Proceedings
In: NeurIPS'23, 2023.
@proceedings{quadattack,
title = {QuadAttacK: A Quadratic Programming Approach to Learning Ordered Top-K Adversarial Attacks},
author = {Thomas Paniagua and Ryan Grainger and Tianfu Wu},
url = {https://arxiv.org/abs/2312.11510},
year = {2023},
date = {2023-12-19},
urldate = {2023-12-19},
abstract = {The adversarial vulnerability of Deep Neural Networks (DNNs) has been well-known and widely concerned, often under the context of learning top-$1$ attacks (e.g., fooling a DNN to classify a cat image as dog). This paper shows that the concern is much more serious by learning significantly more aggressive ordered top-$K$ clear-box~footnote{ This is often referred to as white/black-box attacks in the literature. We choose to adopt neutral terminology, clear/opaque-box attacks in this paper, and omit the prefix clear-box for simplicity.} targeted attacks proposed in~citep{zhang2020learning}. We propose a novel and rigorous quadratic programming (QP) method of learning ordered top-$K$ attacks with low computing cost, dubbed as textbf{QuadAttac$K$}. Our QuadAttac$K$ directly solves the QP to satisfy the attack constraint in the feature embedding space (i.e., the input space to the final linear classifier), which thus exploits the semantics of the feature embedding space (i.e., the principle of class coherence). With the optimized feature embedding vector perturbation, it then computes the adversarial perturbation in the data space via the vanilla one-step back-propagation. In experiments, the proposed QuadAttac$K$ is tested in the ImageNet-1k classification using ResNet-50, DenseNet-121, and Vision Transformers (ViT-B and DEiT-S). It successfully pushes the boundary of successful ordered top-$K$ attacks from $K=10$ up to $K=20$ at a cheap budget ($1times 60$) and further improves attack success rates for $K=5$ for all tested models, while retaining the performance for $K=1$.},
howpublished = {In: NeurIPS'23},
keywords = {},
pubstate = {published},
tppubtype = {proceedings}
}
2022
Grainger, Ryan; Paniagua, Thomas; Song, Xi; Wu, Tianfu
Learning Patch-to-Cluster Attention in Vision Transformer Working paper
arXiv preprint, 2022.
@workingpaper{PaCaViT,
title = {Learning Patch-to-Cluster Attention in Vision Transformer},
author = {Ryan Grainger and Thomas Paniagua and Xi Song and Tianfu Wu},
url = {https://arxiv.org/abs/1606.00850},
year = {2022},
date = {2022-03-23},
abstract = {The Vision Transformer (ViT) model is built on the assumption of treating image patches as "visual tokens" and learning patch-to-patch attention. The patch embedding based tokenizer is a workaround in practice and has a semantic gap with respect to its counterpart, the textual tokenizer. The patch-to-patch attention suffers from the quadratic complexity issue, and also makes it non-trivial to explain learned ViT models. To address these issues in ViT models, this paper proposes to learn patch-to-cluster attention (PaCa) based ViT models. Queries in our PaCaViT are based on patches, while keys and values are based on clustering (with a predefined small number of clusters). The clusters are learned end-to-end, leading to better tokenizers and realizing joint clustering-for-attention and attention-for-clustering when deployed in ViT models. The quadratic complexity is relaxed to linear complexity. Also, directly visualizing the learned clusters can reveal how a trained ViT model learns to perform a task (e.g., object detection). In experiments, the proposed PaCa-ViT is tested on CIFAR-100 and ImageNet-1000 image classification, and MS-COCO object detection and instance segmentation. Compared with prior arts, it obtains better performance in classification and comparable performance in detection and segmentation. It is significantly more efficient in COCO due to the linear complexity. The learned clusters are also semantically meaningful and shed light on designing more discriminative yet interpretable ViT models.},
howpublished = {arXiv preprint},
keywords = {},
pubstate = {published},
tppubtype = {workingpaper}
}
2020
Zhang, Zekun; Wu, Tianfu
Learning Ordered Top-k Attacks via Adversarial Distillation Workshop
CVPRW 2020 Adversarial Machine Learning in Computer Vision, vol. abs/1905.10695, 2020.
@workshop{AdvDistillation,
title = {Learning Ordered Top-k Attacks via Adversarial Distillation},
author = {Zekun Zhang and Tianfu Wu},
url = {https://openaccess.thecvf.com/content_CVPRW_2020/papers/w47/Zhang_Learning_Ordered_Top-k_Adversarial_Attacks_via_Adversarial_Distillation_CVPRW_2020_paper.pdf},
year = {2020},
date = {2020-06-14},
booktitle = {CVPRW 2020 Adversarial Machine Learning in Computer Vision},
journal = {CoRR},
volume = {abs/1905.10695},
abstract = {Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, especially white-box targeted attacks. One scheme of learning attacks is to design a proper adversarial objective function that leads to the imperceptible perturbation for any test image (e.g., the Carlini-Wagner (C\&W) method). Most methods address targeted attacks in the Top-1 manner. In this paper, we propose to learn ordered Top-k attacks (k>= 1) for image classification tasks, that is to enforce the Top-k predicted labels of an adversarial example to be the k (randomly) selected and ordered labels (the ground-truth label is exclusive). To this end, we present an adversarial distillation framework: First, we compute an adversarial probability distribution for any given ordered Top-k targeted labels with respect to the ground-truth of a test image. Then, we learn adversarial examples by minimizing the Kullback-Leibler (KL) divergence together with the perturbation energy penalty, similar in spirit to the network distillation method. We explore how to leverage label semantic similarities in computing the targeted distributions, leading to knowledge-oriented attacks. In experiments, we thoroughly test Top-1 and Top-5 attacks in the ImageNet-1000 validation dataset using two popular DNNs trained with clean ImageNet-1000 train dataset, ResNet-50 and DenseNet-121. For both models, our proposed adversarial distillation approach outperforms the C\&W method in the Top-1 setting, as well as other baseline methods. Our approach shows significant improvement in the Top-5 setting against a strong modified C\&W method.},
howpublished = {CVPRW20 Adversarial Machine Learning in Computer Vision},
keywords = {},
pubstate = {published},
tppubtype = {workshop}
}
2019
Sun, Wei; Bappy, Jawadul H; Yang, Shanglin; Xu, Yi; Wu, Tianfu; Zhou, Hui
Pose Guided Fashion Image Synthesis Using Deep Generative Model Workshop
2019.
@workshop{PoseGuidedSynthesis,
title = {Pose Guided Fashion Image Synthesis Using Deep Generative Model},
author = {Wei Sun and Jawadul H Bappy and Shanglin Yang and Yi Xu and Tianfu Wu and Hui Zhou},
url = {http://arxiv.org/abs/1906.07251},
year = {2019},
date = {2019-08-05},
journal = {The fourth international workshop on fashion and KDD},
abstract = {Generating a photorealistic image with intended human pose is a promising yet challenging research topic for many applications such as smart photo editing, movie making, virtual try-on, and fashion display. In this paper, we present a novel deep generative model to transfer an image of a person from a given pose to a new pose while keeping fashion item consistent. In order to formulate the framework, we employ one generator and two discriminators for image synthesis. The generator includes an image encoder, a pose encoder and a decoder. The two encoders provide good representation of visual and geometrical context which will be utilized by the decoder in order to generate a photorealistic image. Unlike existing pose-guided image generation models, we exploit two discriminators to guide the synthesis process where one discriminator differentiates between generated image and real images (training samples), and another discriminator verifies the consistency of appearance between a target pose and a generated image. We perform end-to-end training of the network to learn the parameters through back-propagation given ground-truth images. The proposed generative model is capable of synthesizing a photorealistic image of a person given a target pose. We have demonstrated our results by conducting rigorous experiments on two data sets, both quantitatively and qualitatively.},
keywords = {},
pubstate = {published},
tppubtype = {workshop}
}
2018
Lanka, Sameera; Wu, Tianfu
ARCHER: Aggressive Rewards to Counter bias in Hindsight Experience Replay Workshop
NeurIPS 2018 Deep RL workshop, 2018.
@workshop{ARCHER,
title = {ARCHER: Aggressive Rewards to Counter bias in Hindsight Experience Replay},
author = {Sameera Lanka and Tianfu Wu},
url = {https://arxiv.org/abs/1809.02070},
year = {2018},
date = {2018-01-01},
booktitle = {NeurIPS 2018 Deep RL workshop},
abstract = {Experience replay is an important technique for addressing sample-inefficiency in deep reinforcement learning (RL), but faces difficulty in learning from binary and sparse rewards due to disproportionately few successful experiences in the replay buffer. Hindsight experience replay (HER) was recently proposed to tackle this difficulty by manipulating unsuccessful transitions, but in doing so, HER introduces a significant bias in the replay buffer experiences and therefore achieves a suboptimal improvement in sample-efficiency. In this paper, we present an analysis on the source of bias in HER, and propose a simple and effective method to counter the bias, to most effectively harness the sample-efficiency provided by HER. Our method, motivated by counter-factual reasoning and called ARCHER, extends HER with a trade-off to make rewards calculated for hindsight experiences numerically greater than real rewards. We validate our algorithm on two continuous control environments from DeepMind Control Suite - Reacher and Finger, which simulate manipulation tasks with a robotic arm - in combination with various reward functions, task complexities and goal sampling strategies. Our experiments consistently demonstrate that countering bias using more aggressive hindsight rewards increases sample efficiency, thus establishing the greater benefit of ARCHER in RL applications with limited computing budget.},
keywords = {},
pubstate = {published},
tppubtype = {workshop}
}
2017
Zhao, Bo; Wu, Botong; Wu, Tianfu; Wang, Yizhou
Zero-Shot Learning Posed as a Missing Data Problem Workshop
2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, October 22-29, 2017, 2017.
@workshop{Zhao_ZeroShot,
title = {Zero-Shot Learning Posed as a Missing Data Problem},
author = {Bo Zhao and Botong Wu and Tianfu Wu and Yizhou Wang},
url = {https://arxiv.org/abs/1612.00560},
doi = {10.1109/ICCVW.2017.310},
year = {2017},
date = {2017-01-01},
booktitle = {2017 IEEE International Conference on Computer Vision Workshops,
ICCV Workshops 2017, Venice, Italy, October 22-29, 2017},
pages = {2616--2622},
abstract = {This paper presents a method of zero-shot learning (ZSL) which poses ZSL as the missing data problem, rather than the missing label problem. While most popular methods in ZSL focus on learning the mapping function from the image feature space to the label embedding space, the proposed method explores a simple yet effective transductive framework in the reverse mapping. Our method estimates data distribution of unseen classes in the image feature space by transferring knowledge from the label embedding space. It assumes that data of each seen and unseen class follow Gaussian distribution in the image feature space and utilizes Gaussian mixture model to model data. The signature is introduced to describe the data distribution of each class. In experiments, our method obtains 87.38% and 61.08% mean accuracies on the Animals with Attributes (AwA) and the Caltech-UCSD Birds-200-2011 (CUB) datasets respectively, which outperforms the runner-up methods significantly by 4.95% and 6.38%. In addition, we also investigate the extension of our method to open-set classification.},
howpublished = {arXiv preprint},
keywords = {},
pubstate = {published},
tppubtype = {workshop}
}