Student: Niccolò Bellaccini, 2020. Thesis developed partially during a stage at Verizon Connect Research Italy, Florence

Image Captioning tackles the problem of generating textual descriptions
from pictures, and thus lies in the intersection of the Computer Vision (CV)
and Natural Language Processing (NLP) disciplines.
We proposed a template based Image Captioning method for textual
description generation of road scenes.
For this task we first selected a set of environmental aspects about each
picture. Then, a set of templates containing tags was defined in an automated
way. Each tag refers to a single, representative aspect of the scene.
Convolutional Neural Networks (CNNs) were used to predict these aspects. Multiple classification approaches were tested, considering both single-
attribute prediction and multi-attribute, by putting together the most correlated ones.
We obtained the clusters in an automatic way, leveraging clustering methods on the original annotations.
The entire project was based on the dataset SHRP2 NDS. Our solution
showed good results providing coherent captions with respect to images.

image_print
Image Captioning road scenes: a multitask deep learning approach