|
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Junnan Li,
Ramprasaath R. Selvaraju,
Akhilesh Deepak Gotmare,
Shafiq Joty,
Caiming Xiong,
Steven Hoi,
NeurIPS, 2021 (Spotlight)
we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR2, ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-of-the-art, while enjoying faster inference speed.
code /
blog
|
|
SOrTing VQA Models: Improving Consistency via Gradient Alignment
Sameer Dharur,
Purva Tendulkar,
Dhruv Batra,
Devi Parikh,
Ramprasaath R. Selvaraju
NAACL, 2021
|
|
CASTing Your Model: Learning to Localize Improves Self-Supervised Representations
Ramprasaath R. Selvaraju,
Karan Desai,
Justin Johnson,
Nikhil Naik
CVPR, 2021
Introducing CAST—a generic training recipe to fix visual grounding ability of contrastive SSL models (eg MoCo) and make them work on complex scene images (eg COCO)!
|
|
SQuINTing at VQA Models: Interrogating VQA Models with Sub-Questions
Ramprasaath R. Selvaraju,
Purva Tendulkar,
Devi Parikh,
Eric Horvitz,
Marco Tulio Ribeiro,
Besmira Nushi,
Ece Kamar
CVPR, 2020 (Oral)
We investigate the capabilities of VQA models for solving tasks that differ in nature and in complexity. We notice that existing VQA models have consistency issues -- they answer complex reasoning question correctly but fail on associated low-level perception sub-questions. We quantify the extent to which this phenomenon occurs by creating a new Reasoning split and collecting Sub-VQA, a new dataset consisting of associated perception sub-questions needed to effectively answer the main reasoning question. Additionally, we propose SQuINT approach which enforces models be right for the right reasons.
|
|
Visual Explanations from Deep Networks
Ramprasaath R. Selvaraju,
Michael Cogswell,
Abhishek Das,
Ramakrishna Vedantam,
Devi Parikh,
Dhruv Batra
IJCV, 2019
arxiv /
blogpost /
code /
demo
|
|
Taking a HINT: Levaraging Explanations to Make Vision and Language Models More Grounded
Ramprasaath R. Selvaraju,
Stefen Lee,
Yilin Shen,
Hongxia Jin,
Shalini Ghosh,
Larry Heck,
Dhruv Batra,
Devi Parikh
ICCV, 2019
arxiv /
blogpost
We notice that many vision and language models suffer from poor visual grounding - often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image. To tackle this, we propose Human Importance-aware Network Tuning (HINT) that effectively leverages human demonstrations to improve visual grounding. We show that encouraging these models to look at same regions as humans makes them generalize to new distributions better.
|
|
Trick or Treat: Thematic Reinforcement for Artistic Typography
Purva Tendulkar,
Kalpesh Krishna,
Ramprasaath R. Selvaraju,
Devi Parikh
ICCC, 2019
arxiv /
code /
demo
We propose an approach to make text visually appealing and memorable is semantic reinforcement - the use of visual cues alluding to the context or theme in which the word is being used to reinforce the message (e.g., Google Doodles). Given an input word (e.g. exam) and a theme (e.g. education), the individual letters of the input word are replaced by cliparts relevant to the theme which visually resemble the letters - adding creative context to the potentially boring input word.
|
|
Choose Your Neuron: Incorporating Domain Knowledge into Deep Networks through Neuron Importance
Ramprasaath R. Selvaraju*,
Prithvijit Chattopadhyay*,
Mohamed Elhoseiny,
Tilak Sharma,
Dhruv Batra,
Devi Parikh,
Stefen Lee
ECCV, 2018
arxiv /
blogpost /
code
Individual neurons in CNNs implicitly learn semantically meaningful concepts ranging from simple textures and shapes to whole objects. We introduce an efficient zero-shot learning approach that learns to map domain knowledge about novel classes onto this dictionary of learned concepts and then optimizes for network parameters that can effectively combine these concepts. We also show how our approach can provide visual and textual explanations including neuron names.
|
|
Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
Ashwin Kalyan,
Michael Cogswell,
Ramprasaath R. Selvaraju,
Qing Sun,
Stefen Lee,
David Crandal,
Dhruv Batra
AAAI, 2018
arxiv /
code /
demo
We propose Diverse Beam Search (DBS), an alternative to BS that decodes a list of diverse outputs by optimizing for a diversity-augmented objective. We observe that our method finds better top-1 solutions by controlling for the exploration and exploitation of the search space - implying that DBS is a better search algorithm. We study the role of diversity for image-grounded language generation tasks as the complexity of the image changes.
|
|
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Ramprasaath R. Selvaraju,
Michael Cogswell,
Abhishek Das,
Ramakrishna Vedantam,
Devi Parikh,
Dhruv Batra
ICCV, 2017
arxiv /
blogpost /
code /
demo
We propose a technique for producing "visual explanations" for decisions from a large class of CNN-based models, making them more transparent. Our approach - Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept, flowing into the final convolutional layer to produce a coarse localization map highlighting important regions in the image for predicting the concept. Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers, (2) CNNs used for structured outputs, (3) CNNs used in tasks with multimodal inputs or reinforcement learning, without any architectural changes or re-training. We apply Grad-CAM to off-the-shelf image classification, captioning, and visual question answering (VQA) models, including ResNet-based architectures. In the context of image classification models, our visualizations (a) lend insights into their failure modes, (b) are robust to adversarial images, (c) outperform previous methods on localization, (d) are more faithful to the underlying model and (e) help achieve generalization by identifying dataset bias. For captioning and VQA, we show that even non-attention based models can localize inputs. Finally, we design and conduct human studies to measure if Grad-CAM helps users establish appropriate trust in predictions from models and show that Grad-CAM helps untrained users successfully discern a 'stronger' model from a 'weaker' one even when both make identical predictions.
|
|
Counting Everyday Objects in Everyday Scenes
Prithviraj Chattopadhyay*,
Ramakrishna Vedantam*,
Ramprasaath R. Selvaraju,
Dhruv Batra,
Devi Parikh
CVPR, 2017 (Spotlight)
arxiv /
code
We study the numerosity of object classes in natural, everyday images and build dedicated models for counting designed to tackle the large variance in counts, appearances, and scales of objects found in natural scenes. We propose a contextual counting approach inspired by the phenomenon of subitizing - the ability of humans to make quick assessments of counts given a perceptual signal, for small count values.
|
|
The Semantic Paintbrush: Interactive 3D Mapping and Recognition in Large Outdoor Spaces
Ondrej Miksik,
Vibhav Vineet,
Morten Lidegaard,
Ramprasaath R. Selvaraju,
Matthias Nießner,
Stuart Golodetz,
Stephen L. Hicks,
Patrick Pérez,
Shahram Izadi,
Philip H. S. Torr
CHI, 2015 (Oral)
paper /
video
We present an augmented reality system for large scale 3D reconstruction and recognition in outdoor scenes. We use a purely passive stereo setup, allowing for outdoor use. In addition to producing a map of the 3D environment in real-time, it also allows the user to draw (or 'paint') with a laser pointer directly onto the reconstruction to segment the model into objects.
|