Paper Title
Visual Question Answering: An Analysis of Various AI Models and Datasets
Abstract
Visual Question Answering is considered to be one of the latest advances in the field of Artificial Intelligence
(AI). This is a unique task, which combines the three most important realms of AI, namely-Computer Vision (CV), Natural
Language Processing (NLP) and Knowledge representation and reasoning (KR), each of which is being researched
extensively. Given an image and an open-ended natural language question about the image, the VQA model needs to provide
an open-ended natural language answer. To achieve this, the model would need to develop an understanding of the different
entities of an image and language, and their dependencies. This is regarded as a true AI task. In this review we detail out the
various algorithms proposed to build a VQA model, by classifying them based on the mechanisms used to extract and map
the input visual and natural language features to a common feature vector space. Finally, we analyze the correctness of these
models and propose some alternatives using Capsule Networks (CapsNet) for future directions.