Image captioning with Transformers



Image captioning is the task of automatically generating natural language descriptions according to the content observed in an image. It is an important part of scene understanding, which combines the knowledge of computer vision and natural language processing [1].

In recent years, applications of image caption are extensively growing, for example, sensitive action recognition, human-computer interaction for visually impaired persons, image indexing and recommendation systems. 

Several image captioning approaches rely on a translational approach that uses a visual feature extractor encoder and recurrent natural language decoder [2]. 

More recent approaches employ the usage of attention-based networks [3]. 

Transformer-based architectures represent the state-of-the-art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, it is only recently starting to be explored [4, 5, 6].

The goal of this thesis is to study the recent advances in the field of image captioning, with particular emphasis on the use of pre-trained Transformers architectures and their applications to practical problems. 

Planned Activities

  • Initial research on history and state-of-the-art models for image captioning;
  • Research on Transformers and how they can be applied to multi-modal contexts;
  • Implementation of state of the art models for image captioning;
  • Application of Transformer-based image captioning algorithms to a real problem.

Who we’re looking for

Students that are about to get their Master Degree in: computer science, computer engineering, mechatronic engineering, mathematical engineering, mathematics, physics, informatics.


  • Proficiency in at least one programming language (Python, Lua, Matlab, C++, Java), Python is preferred;
  • Basic knowledge of machine learning and Deep Learning (CNN, RNN) algorithms;
  • Basic knowledge of one of these Deep Learning frameworks (Tensorflow, Pytorch, fastAI) 
  • Good knowledge of linear algebra

Duration of this Projects: 6-8 months

Check these links before moving on

Contact Us

Directly by email to: [email protected]

By LinkedIn:ò-sonia-66a95467