Microsoft, as part of its new research into storytelling by artificial intelligence, has released CaptionBot, an AI designed to recognise images and add an appropriate descriptive caption. However, like its previous attempt at AI – chatbot Tay – CaptionBot isn’t entirely successful. As with Tay, though, the results are hilarious (and without any fascistic or incestuous overtones).
The accompanying academic paper, titled Visual Storytelling [PDF], describes how the Microsoft Sequential Image Narrative Dataset (SIND) applies value judgements to picture content, setting, composition, and human expression in an attempt to describe the scene. The paper adds:
“There is a significant difference, yet unexplored, between remarking that a visual scene shows “sitting in a room” – typical of most image captioning work – and that the same visual scene shows “bonding”. The latter description is grounded in the visual signal, yet it brings to bear information about social relations and emotions that can be additionally inferred in context.”
To set CaptionBot’s base level, 10,117 CC-licensed Flickr albums were ploughed through by Amazon Mechanical Turks, who assigned tradition captions to a series of pictures. An ‘average’ description of each picture was derived by the multitude of entries, and that average was reduced to an algorithm that CaptionBot could apply to fresh images in order to evaluate them.
“Captioning is about taking concrete objects and putting them together in a literal description,” Margaret Mitchell, lead researcher on the project, said in a Microsoft blog post. “What I’ve been calling visual storytelling is about inferring conceptual and abstract ideas from those concrete objects.”