CS 709 Text Analytics Seminar (HWS 2017: Vision and Language)

Recently, much work in Natural Language Processing and Computer Vision looked at problems and tasks which requires to develop new approaches to content understanding and generation in both modalities - prime examples include integrated models of language and vision, as well as exploiting data and approaches from each modality in turn to help the other one. In this seminar, we will accordingly focus on a recent body of research at the intersection of vision and language computing, and aim at understanding current trends, methods, etc.

UPDATE: no more slots!

All slots for this seminar (12 students) have been filled. Additional slots may open in the following days only if some of the students who already applied decide to drop out. 


Seminar kick-off meeting

The kick-off meeting for the Seminar in Text Analytics (topic "Vision and Language") will be held on 7.9 at 2pm in our seminar room C1.01 in part C (first floor) of the B6 building.




In this seminar, you will

  • Read, understand, and explore scientific literature
  • Summarize a current research topic in a concise report
  • Give a focused presentation about a scientific publication


  • The report has to be written with Latex (Beginners are welcome)
  • Report and presentation have to be in English


  • Attend the kickoff meeting on 7.9.2017
  • Select your preferred paper and topic by 15.9.2017
  • Work on your presentation and report during the rest of the semester (details to be presented in the kick-off)


Explore the list of topics below and select at least 3 topics of your preference. Send a ranked list of your selected topics via email to Goran Glavaš until September 6, 2017. We will confirm your registration as soon as possible. The actual topic assignment takes place at the kickoff meeting; our goal is, of course, to assign to you one of your preferred topics.


The topics we offer are listed below. Next to each topic you'll find references to serve as an entry point for your literature research. 

The papers and articles listed below serve as an entry point to a topic; you are expected to explore related relevant literature on your own.

  • Grounding & Multimodal Semantics
  1. Multimodal distributional semantics: E. Bruni, N. Tran, and M. Baroni, JAIR 2014.
  2. Visually Grounded Meaning Representations: C. Silberer, V. Ferrari, and M. Lapata, PAMI 2016.
  3. Learning Visually Grounded Sentence Representations: D. Kiela, A. Conneau, A. Jabri, M. Nickel, 2017.
  4. Grounding of Textual Phrases in Images by Reconstruction: Rohrbach, A.; Rohrbach, M.; Hu, R.; Darrell, T.; Schiele, B., ECCV 2016.
  • Captioning (Image to Text)
  1. Deep visual-semantic alignments for generating image descriptions. A. Karpathy and L. Fei-Fei, CVPR. 2015.
  2. Multi-Task Video Captioning with Video and Entailment Generation. R. Pasunuru and M. Bansal. ACL 2017.
  3. Let Your Photos Talk: Generating Narrative Paragraph for Photo Stream via Bidirectional Attention Recurrent Neural Networks. Y. Liu, J. Fu, T. Mei and C. Wen Chen. AAAI 2017.
  • Image Generation (Text to Image) and Analysis
  1. Generative Adversarial Text to Image Synthesis: S. Reed, Z. Akata, X Yan, L. Logeswara, B. Schiele, H. Lee, ICML 2016
  2. From Red Wine to Red Tomato: Composition With Context:  Ishan Misra, Abhinav Gupta, Martial Hebert, CVPR 2017
  • Visual Question Answering
  1. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. Xu, H., & Saenko, K. ECCV 2016.
  2. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images: M. Malinowski, M. Rohrbach, and M. Fritz. ICCV 2015
  • Multimodal NLP
  1. Incorporating Global Visual Features into Attention-Based Neural Machine Translation: Calixto, I., Liu, Q., & Campbell, N. EMNLP 2017
  2. Black Holes and White Rabbits: Metaphor Identification with Visual Features: E. Shutova, D. Kiela, J. Maillard, NAACL 2016.