Ramprasaath R. Selvaraju

I am a Senior Research Scientist at Salesforce Research. Prior to joining Salesforce, I did my PhD at Georgia Tech advised by Prof. Devi Parikh and Prof. Dhruv Batra. My research focuses on building algorithms that provide explanations for decisions emanating from deep networks in order to —

understand/interpret why the model did what it did,
diagnose network failures,
help users build appropriate trust,
enable knowledge transfer between humans and AI,
encourage human-like reasoning in AI, and
learn grounded representations,
correct unwanted biases learned by AI models.

During my undergrad, I had the opportunity to work with Prof. Philip Torr at Oxford University and Prof. Benjamin Kimia at Brown University on projects for helping visually impaired people to navigate in everyday environments.

I love playing all kinds of racquet sports, specifically Badminton, Table Tennis, Tennis, Squash and Racquet ball.

Interested? Visit my Google Scholar profile to find out more about my work!

Feel free to reach out to me at ramprasaath.21@gmail.com

News

[Dec 2021] New preprint introducing an information efficient method for visual representations learning through textual annotations available.
[Dec 2021] New preprint on grounded video representation learning available.
[Nov 2021] We are looking to hire interns at Salesforce. Please reach out to me if you are interested in working on explainability or representation learning.
[Sep 2021] Our paper has been accepted at NeurIPS'21 as a Spotlight (top 3% of submissions).
[Apr 2021] Excited to announce the first edition of "Towards Robust, Trustworthy, and Explainable Computer Vision tutorial at ICCV'21.
[Mar 2021] Our paper has been accepted at NAACL'21.
[Feb 2021] Our paper has been accepted at CVPR'21.
[Dec 2020] I will be hosting Salesforce's Computer Vision round table session at NeurIPS'20. Join me on 8th December at 11 am eastern time to know more about our efforts towards solving computer vision.
[Dec 2020] New preprint on fixing the visual grounding ability of Contrastive Self-supervised models available.
[Nov 2020] Our paper will be presented at Interpretable inductive biases and physically structured learning Workshop at NeurIPS'20.
[Oct 2020] New preprint on improving the consistency of VQA models available. Work led by Sameer Dharur.
[Jun 2020] I am joining Salesforce Research (formerly Metamind) as a Sr. Research Scientist.
[Mar 2020] Our paper has been accepted as an Oral presentation at CVPR'20.
[Jan 2020] New preprint analyzing reasoning in VQA models through a newly collected Sub-VQA dataset available.
[Oct 2019] Presenting our paper at ICCV'19.
[Sep 2019] I was invited to present my research at Microsoft Research AI breakthroughs event.
[Jun 2019] Our paper is published at IJCV Journal.

Achievements

Finalist for 2019 Adobe Research Fellowship
Finalist for 2019 Snap Research Fellowship
Won the 2016 Virginia Divisionals and placed second at US Mid-Atlantic Table-Tennis Championship
Represented Virginia Tech at the 2016 US-Canada National Table-Tennis Championship

Publications

	Align before Fuse: Vision and Language Representation Learning with Momentum Distillation Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, Steven Hoi, NeurIPS, 2021 (Spotlight) we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR2, ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-of-the-art, while enjoying faster inference speed. code / blog
	SOrTing VQA Models: Improving Consistency via Gradient Alignment Sameer Dharur, Purva Tendulkar, Dhruv Batra, Devi Parikh, Ramprasaath R. Selvaraju NAACL, 2021
	CASTing Your Model: Learning to Localize Improves Self-Supervised Representations Ramprasaath R. Selvaraju, Karan Desai, Justin Johnson, Nikhil Naik CVPR, 2021 Introducing CAST—a generic training recipe to fix visual grounding ability of contrastive SSL models (eg MoCo) and make them work on complex scene images (eg COCO)!
	SQuINTing at VQA Models: Interrogating VQA Models with Sub-Questions Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Tulio Ribeiro, Besmira Nushi, Ece Kamar CVPR, 2020 (Oral) We investigate the capabilities of VQA models for solving tasks that differ in nature and in complexity. We notice that existing VQA models have consistency issues -- they answer complex reasoning question correctly but fail on associated low-level perception sub-questions. We quantify the extent to which this phenomenon occurs by creating a new Reasoning split and collecting Sub-VQA, a new dataset consisting of associated perception sub-questions needed to effectively answer the main reasoning question. Additionally, we propose SQuINT approach which enforces models be right for the right reasons.
	Visual Explanations from Deep Networks Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra IJCV, 2019 arxiv / blogpost / code / demo
	Taking a HINT: Levaraging Explanations to Make Vision and Language Models More Grounded Ramprasaath R. Selvaraju, Stefen Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, Devi Parikh ICCV, 2019 arxiv / blogpost We notice that many vision and language models suffer from poor visual grounding - often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image. To tackle this, we propose Human Importance-aware Network Tuning (HINT) that effectively leverages human demonstrations to improve visual grounding. We show that encouraging these models to look at same regions as humans makes them generalize to new distributions better.
	Trick or Treat: Thematic Reinforcement for Artistic Typography Purva Tendulkar, Kalpesh Krishna, Ramprasaath R. Selvaraju, Devi Parikh ICCC, 2019 arxiv / code / demo We propose an approach to make text visually appealing and memorable is semantic reinforcement - the use of visual cues alluding to the context or theme in which the word is being used to reinforce the message (e.g., Google Doodles). Given an input word (e.g. exam) and a theme (e.g. education), the individual letters of the input word are replaced by cliparts relevant to the theme which visually resemble the letters - adding creative context to the potentially boring input word.
	Choose Your Neuron: Incorporating Domain Knowledge into Deep Networks through Neuron Importance Ramprasaath R. Selvaraju, Prithvijit Chattopadhyay, Mohamed Elhoseiny, Tilak Sharma, Dhruv Batra, Devi Parikh, Stefen Lee ECCV, 2018 arxiv / blogpost / code Individual neurons in CNNs implicitly learn semantically meaningful concepts ranging from simple textures and shapes to whole objects. We introduce an efficient zero-shot learning approach that learns to map domain knowledge about novel classes onto this dictionary of learned concepts and then optimizes for network parameters that can effectively combine these concepts. We also show how our approach can provide visual and textual explanations including neuron names.
	Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models Ashwin Kalyan, Michael Cogswell, Ramprasaath R. Selvaraju, Qing Sun, Stefen Lee, David Crandal, Dhruv Batra AAAI, 2018 arxiv / code / demo We propose Diverse Beam Search (DBS), an alternative to BS that decodes a list of diverse outputs by optimizing for a diversity-augmented objective. We observe that our method finds better top-1 solutions by controlling for the exploration and exploitation of the search space - implying that DBS is a better search algorithm. We study the role of diversity for image-grounded language generation tasks as the complexity of the image changes.
	Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra ICCV, 2017 arxiv / blogpost / code / demo We propose a technique for producing "visual explanations" for decisions from a large class of CNN-based models, making them more transparent. Our approach - Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept, flowing into the final convolutional layer to produce a coarse localization map highlighting important regions in the image for predicting the concept. Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers, (2) CNNs used for structured outputs, (3) CNNs used in tasks with multimodal inputs or reinforcement learning, without any architectural changes or re-training. We apply Grad-CAM to off-the-shelf image classification, captioning, and visual question answering (VQA) models, including ResNet-based architectures. In the context of image classification models, our visualizations (a) lend insights into their failure modes, (b) are robust to adversarial images, (c) outperform previous methods on localization, (d) are more faithful to the underlying model and (e) help achieve generalization by identifying dataset bias. For captioning and VQA, we show that even non-attention based models can localize inputs. Finally, we design and conduct human studies to measure if Grad-CAM helps users establish appropriate trust in predictions from models and show that Grad-CAM helps untrained users successfully discern a 'stronger' model from a 'weaker' one even when both make identical predictions.
	Counting Everyday Objects in Everyday Scenes Prithviraj Chattopadhyay, Ramakrishna Vedantam^, Ramprasaath R. Selvaraju, Dhruv Batra, Devi Parikh CVPR, 2017 (Spotlight) arxiv / code We study the numerosity of object classes in natural, everyday images and build dedicated models for counting designed to tackle the large variance in counts, appearances, and scales of objects found in natural scenes. We propose a contextual counting approach inspired by the phenomenon of subitizing - the ability of humans to make quick assessments of counts given a perceptual signal, for small count values.
	The Semantic Paintbrush: Interactive 3D Mapping and Recognition in Large Outdoor Spaces Ondrej Miksik, Vibhav Vineet, Morten Lidegaard, Ramprasaath R. Selvaraju, Matthias Nießner, Stuart Golodetz, Stephen L. Hicks, Patrick Pérez, Shahram Izadi, Philip H. S. Torr CHI, 2015 (Oral) paper / video We present an augmented reality system for large scale 3D reconstruction and recognition in outdoor scenes. We use a purely passive stereo setup, allowing for outdoor use. In addition to producing a map of the 3D environment in real-time, it also allows the user to draw (or 'paint') with a laser pointer directly onto the reconstruction to segment the model into objects.