Ramprasaath R. Selvaraju

I am a Senior Research Scientist at Salesforce Research. Prior to joining Salesforce, I did my PhD at Georgia Tech advised by Prof. Devi Parikh and Prof. Dhruv Batra. My research focuses on building algorithms that provide explanations for decisions emanating from deep networks in order to —

  • understand/interpret why the model did what it did,
  • diagnose network failures,
  • help users build appropriate trust,
  • enable knowledge transfer between humans and AI,
  • encourage human-like reasoning in AI, and
  • learn grounded representations,
  • correct unwanted biases learned by AI models.

During my undergrad, I had the opportunity to work with Prof. Philip Torr at Oxford University and Prof. Benjamin Kimia at Brown University on projects for helping visually impaired people to navigate in everyday environments.

I love playing all kinds of racquet sports, specifically Badminton, Table Tennis, Tennis, Squash and Racquet ball.

Interested? Visit my Google Scholar profile to find out more about my work!

Feel free to reach out to me at ramprasaath.21@gmail.com

News


  • [Dec 2021] New preprint introducing an information efficient method for visual representations learning through textual annotations available.
  • [Dec 2021] New preprint on grounded video representation learning available.
  • [Nov 2021] We are looking to hire interns at Salesforce. Please reach out to me if you are interested in working on explainability or representation learning.
  • [Sep 2021] Our paper has been accepted at NeurIPS'21 as a Spotlight (top 3% of submissions).
  • [Apr 2021] Excited to announce the first edition of "Towards Robust, Trustworthy, and Explainable Computer Vision tutorial at ICCV'21.
  • [Mar 2021] Our paper has been accepted at NAACL'21.
  • [Feb 2021] Our paper has been accepted at CVPR'21.
  • [Dec 2020] I will be hosting Salesforce's Computer Vision round table session at NeurIPS'20. Join me on 8th December at 11 am eastern time to know more about our efforts towards solving computer vision.
  • [Dec 2020] New preprint on fixing the visual grounding ability of Contrastive Self-supervised models available.
  • [Nov 2020] Our paper will be presented at Interpretable inductive biases and physically structured learning Workshop at NeurIPS'20.
  • [Oct 2020] New preprint on improving the consistency of VQA models available. Work led by Sameer Dharur.
  • [Jun 2020] I am joining Salesforce Research (formerly Metamind) as a Sr. Research Scientist.
  • [Mar 2020] Our paper has been accepted as an Oral presentation at CVPR'20.
  • [Jan 2020] New preprint analyzing reasoning in VQA models through a newly collected Sub-VQA dataset available.
  • [Oct 2019] Presenting our paper at ICCV'19.
  • [Sep 2019] I was invited to present my research at Microsoft Research AI breakthroughs event.
  • [Jun 2019] Our paper is published at IJCV Journal.

Achievements


  • Finalist for 2019 Adobe Research Fellowship
  • Finalist for 2019 Snap Research Fellowship
  • Won the 2016 Virginia Divisionals and placed second at US Mid-Atlantic Table-Tennis Championship
  • Represented Virginia Tech at the 2016 US-Canada National Table-Tennis Championship

Publications

3DSP Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, Steven Hoi,
NeurIPS, 2021 (Spotlight)
we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR2, ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-of-the-art, while enjoying faster inference speed.
code / blog

3DSP SOrTing VQA Models: Improving Consistency via Gradient Alignment
Sameer Dharur, Purva Tendulkar, Dhruv Batra, Devi Parikh, Ramprasaath R. Selvaraju
NAACL, 2021


3DSP CASTing Your Model: Learning to Localize Improves Self-Supervised Representations
Ramprasaath R. Selvaraju, Karan Desai, Justin Johnson, Nikhil Naik
CVPR, 2021
Introducing CAST—a generic training recipe to fix visual grounding ability of contrastive SSL models (eg MoCo) and make them work on complex scene images (eg COCO)!

3DSP SQuINTing at VQA Models: Interrogating VQA Models with Sub-Questions
Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Tulio Ribeiro, Besmira Nushi, Ece Kamar
CVPR, 2020 (Oral)
We investigate the capabilities of VQA models for solving tasks that differ in nature and in complexity. We notice that existing VQA models have consistency issues -- they answer complex reasoning question correctly but fail on associated low-level perception sub-questions. We quantify the extent to which this phenomenon occurs by creating a new Reasoning split and collecting Sub-VQA, a new dataset consisting of associated perception sub-questions needed to effectively answer the main reasoning question. Additionally, we propose SQuINT approach which enforces models be right for the right reasons.

3DSP Visual Explanations from Deep Networks
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra
IJCV, 2019
arxiv / blogpost / code / demo

3DSP Taking a HINT: Levaraging Explanations to Make Vision and Language Models More Grounded
Ramprasaath R. Selvaraju, Stefen Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, Devi Parikh
ICCV, 2019
arxiv / blogpost
We notice that many vision and language models suffer from poor visual grounding - often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image. To tackle this, we propose Human Importance-aware Network Tuning (HINT) that effectively leverages human demonstrations to improve visual grounding. We show that encouraging these models to look at same regions as humans makes them generalize to new distributions better.

3DSP Trick or Treat: Thematic Reinforcement for Artistic Typography
Purva Tendulkar, Kalpesh Krishna, Ramprasaath R. Selvaraju, Devi Parikh
ICCC, 2019
arxiv / code / demo
We propose an approach to make text visually appealing and memorable is semantic reinforcement - the use of visual cues alluding to the context or theme in which the word is being used to reinforce the message (e.g., Google Doodles). Given an input word (e.g. exam) and a theme (e.g. education), the individual letters of the input word are replaced by cliparts relevant to the theme which visually resemble the letters - adding creative context to the potentially boring input word.

3DSP Choose Your Neuron: Incorporating Domain Knowledge into Deep Networks through Neuron Importance
Ramprasaath R. Selvaraju*, Prithvijit Chattopadhyay*, Mohamed Elhoseiny, Tilak Sharma, Dhruv Batra, Devi Parikh, Stefen Lee
ECCV, 2018
arxiv / blogpost / code
Individual neurons in CNNs implicitly learn semantically meaningful concepts ranging from simple textures and shapes to whole objects. We introduce an efficient zero-shot learning approach that learns to map domain knowledge about novel classes onto this dictionary of learned concepts and then optimizes for network parameters that can effectively combine these concepts. We also show how our approach can provide visual and textual explanations including neuron names.

3DSP Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
Ashwin Kalyan, Michael Cogswell, Ramprasaath R. Selvaraju, Qing Sun, Stefen Lee, David Crandal, Dhruv Batra
AAAI, 2018
arxiv / code / demo
We propose Diverse Beam Search (DBS), an alternative to BS that decodes a list of diverse outputs by optimizing for a diversity-augmented objective. We observe that our method finds better top-1 solutions by controlling for the exploration and exploitation of the search space - implying that DBS is a better search algorithm. We study the role of diversity for image-grounded language generation tasks as the complexity of the image changes.

3DSP Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra
ICCV, 2017
arxiv / blogpost / code / demo
We propose a technique for producing "visual explanations" for decisions from a large class of CNN-based models, making them more transparent. Our approach - Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept, flowing into the final convolutional layer to produce a coarse localization map highlighting important regions in the image for predicting the concept. Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers, (2) CNNs used for structured outputs, (3) CNNs used in tasks with multimodal inputs or reinforcement learning, without any architectural changes or re-training. We apply Grad-CAM to off-the-shelf image classification, captioning, and visual question answering (VQA) models, including ResNet-based architectures. In the context of image classification models, our visualizations (a) lend insights into their failure modes, (b) are robust to adversarial images, (c) outperform previous methods on localization, (d) are more faithful to the underlying model and (e) help achieve generalization by identifying dataset bias. For captioning and VQA, we show that even non-attention based models can localize inputs. Finally, we design and conduct human studies to measure if Grad-CAM helps users establish appropriate trust in predictions from models and show that Grad-CAM helps untrained users successfully discern a 'stronger' model from a 'weaker' one even when both make identical predictions.

3DSP Counting Everyday Objects in Everyday Scenes
Prithviraj Chattopadhyay*, Ramakrishna Vedantam*, Ramprasaath R. Selvaraju, Dhruv Batra, Devi Parikh
CVPR, 2017 (Spotlight)
arxiv / code
We study the numerosity of object classes in natural, everyday images and build dedicated models for counting designed to tackle the large variance in counts, appearances, and scales of objects found in natural scenes. We propose a contextual counting approach inspired by the phenomenon of subitizing - the ability of humans to make quick assessments of counts given a perceptual signal, for small count values.

3DSP The Semantic Paintbrush: Interactive 3D Mapping and Recognition in Large Outdoor Spaces
Ondrej Miksik, Vibhav Vineet, Morten Lidegaard, Ramprasaath R. Selvaraju, Matthias Nießner, Stuart Golodetz, Stephen L. Hicks, Patrick Pérez, Shahram Izadi, Philip H. S. Torr
CHI, 2015 (Oral)
paper / video
We present an augmented reality system for large scale 3D reconstruction and recognition in outdoor scenes. We use a purely passive stereo setup, allowing for outdoor use. In addition to producing a map of the 3D environment in real-time, it also allows the user to draw (or 'paint') with a laser pointer directly onto the reconstruction to segment the model into objects.

Experience


Research Intern, Adaptive Systems and Interaction Group, Microsoft Research

Towards evaluating and encouraging human-like reasoning abilities in deep models.

(Summer 2019)


Research Intern, Tesla Autopilot

Preventing failures of autonomous systems in case of rarely occurring scenarios.

(Spring 2019)


Research Intern, Samsung Research America

Leveraging explanations to make AI models more grounded.

(Summer 2018)


Research Intern, Applied Machine Learning, Facebook

Developing framework for interpreting and visualizing Facebook's deep models.

(Spring 2017)


PhD Research Assistant, Visual Intelligence Lab, Georgia Tech

Towards building AI systems that are Interpretable, Transparent, and Unbiased.

(2015 - 2017)

Research Intern, Visual Intelligence Lab, Georgia Tech

Building curious systems that ask Natural Language open-ended questions about an image.

(Spring 2015)

Research Intern, Oxford University

Developing interactive augmented reality systems for visually impaired.

(Fall 2014)

Research Intern, Brown University

Designing a vision-based navigation system to help the visually impaired navigate indoor environments.

(Summer 2013)

Educational Qualifications


Ph.D in Computer Science

Georgia Institute of Technology, Atlanta

Thesis title: Explaining model decisions and fixing them via human feedback

(August 2015 - May 2020)

Master of Science in Physics and Bachelor of Engineering in Electrical & Electronics Engineering

Birla Institute of Technology and Science (BITS-Pilani), Hyderabad, India

(August 2010 - May 2015)

Contact


- Primary Email: rselvaraju@salesforce.com

(Design and CSS courtesy: Jon Barron and Amlaan Bhoi)