\dm_csml_event_details UCL ELLIS

Question Answering in Realistic Visual Environments: Challenges and Approaches


Catalina Cangea


University of Cambridge


Friday, 10 January 2020






Malet Place Engineering Building 1.03

Event series

DeepMind/ELLIS CSML Seminar Series


The Embodied Question Answering (EQA) and Interactive Question Answering (IQA) tasks were recently introduced as a means to study the capabilities of agents in rich, realistic 3D environments, requiring both navigation and reasoning to achieve success. Each of these skills typically needs a different approach, which should nevertheless be smoothly integrated with the rest of the system leveraged by the agent. However, initial approaches either suffer from potentially weaker performance than when using a language-only model or are preceded by additional hand-engineered steps. This talk will provide an overview of the existing work on this thread and describe in more detail our recent study (published at BMVC 2019, spotlight talk at ViGIL@NeurIPS 2019), VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering. Here, we investigate the feasibility of EQA-type tasks by building a novel benchmark, which contains pairs of questions and videos generated in the House3D environment. While removing the navigation and action selection requirements from EQA, we increase the difficulty of the visual reasoning component via a much larger question space, tackling the sort of complex reasoning questions that make QA tasks challenging. By designing and evaluating several VQA-style models on the dataset, we establish a novel way of evaluating EQA feasibility given existing methods, while highlighting the difficulty of the problem even in the most ideal setting.


Cătălina Cangea is a second-year PhD student at the Department of Computer Science and Technology from University of Cambridge - her research is focused on multimodal, visual reasoning and relational learning tasks. She was Aaron Courville's intern last summer at Mila and an AI Resident at (Google) X, the moonshot factory this summer. Her work was presented at various venues that include the British Machine Vision Conference (BMVC), NeurIPS workshops (ViGIL, R2L) and ICLR workshops (RLGM, AISG). Before starting the PhD, Cătălina obtained her BA and MPhil degrees in Computer Science from the University of Cambridge.