Spring 2021    Previous Years

This course focuses on core techniques and modern advances for integrating different "modalities" into a shared representation or reasoning system. Specifically, these include text, audio, images/videos and action taking.

Yonatan Bisk

Yonatan Bisk


Torsten Wörtwein
Teaching Assistant

Torsten Wörtwein


Jielin Qiu
Teaching Assistant

Jielin Qiu


Slack and Canvas

All course communication will happen via slack and canvas. All videos will be posted to Canvas for offline viewing though aspects of the class/teaching will be interactive in the zoom sessions.


Assignments Timeline and Grading

The course is primarily project based, but there will be readings throughout the course which are only graded via participation.

Project Timeline and Assignments:
Feb 18 Group Formed and Dataset Chosen
Mar 04 R1 Task Definition and Data Analysis (10%)
Mar 11 R2 Related Work and Background (10%)
Mar 18 R3 Baselines, Metrics, and Empty Results Table (10%)
Apr 01 R4 Analysis of Baselines (10%)
Apr 22 R5 Proposed Approach (10%)
May 11 Presentation (10%)
May 13 R6 Completed Report (20%)

Participation in Class or Slack (20%)
Participation is evaluated as "actively asking/answering questions based on the lectures, readings, and/or assisting other teams with project issues". Concretely, this means that every novel question or helpful answer provided in Slack will count for 1%, up to a total of 20% of your grade.

Submission Policies:


The course will be primarily centered on a few datasets/tasks to facilitate cross-team collaboration and technical assistance. If your team has a good reason to work on something not listed here, please reach out so we can discuss it and put together a proposal.
Paper GitHub Domain
Embodied instruction following with interaction
MEmoR Code
Ask TA for data
Emotion in conversational context on the big bang theory.
Natural Language for Visual Reasoning Code v1
Code v2
Visual reasoning about pairs of images and language descriptions. For NLVR2, look at the Contrastive Sets fold.
Room-Across-Room Code Embodied instruction following (with views and multilingual)
Social-IQ Code
Proj page
Video Question Answering focused on social interactions
VizWiz Challenge Challenge Image Captioning and Question Answering for the blind and visually impaired
Limited AWS and Google Cloud compute credits will be made available to each group, so please consider both your interests and available compute resources when deciding on a dataset/project.


Tuesday Thursday
Feb 2: Course Structure
  • Research and technical challenges
  • Syllabus and requirements
Feb 4: Multimodal applications and datasets
  • Research tasks and datasets
  • Team projects
Feb 9: Basics: "Deep learning"
  • Language, Vision, Audio
  • Loss functions and neural networks
Feb 11: Basics: Optimization
  • Gradients and backprop
  • Practical deep learning optimization
Feb 16: Unimodal representations (Vision)
  • CNNs
  • Residuals and Skip connections
Feb 18: Unimodal representations (Language)
  • Gating and LSTMs
  • Transformers
  • Groups Formed and Dataset Chosen
Feb 23 -- NO CLASS -- Feb 25: Project Hours (Reports 1&2)

Mar 2: Multimodal Representations
  • Auto-encoders
  • Joint representations
Mar 4: Coordinated Representation
  • Deep CCA
  • Matrix factorization
  • R1: Task Definition and Data Analysis
Mar 9: Alignment
  • Explicit - Dynamic time warping
  • Implicit - Attention models
Mar 11: Project Hours (Report 3)
  • R2: Related Work and Background
Mar 16:Alignment + Representation
  • Self-attention models
  • Multimodal Transformers
Mar 18:Alignment + Translation
  • Module networks
  • Tree-based & Stack models
  • R3: Baselines, Metrics, and Empty Results
Mar 23: Probabilistic Graphical Models
  • Dynamic Bayesian networks
  • Coupled and factor HMMs
Mar 25: Project Hours (Report 4)
Mar 30: Discriminative Graphical Models
  • Conditional Random Fields
  • Continuous and fully-connected CRFs
Apr 1: Reinforcement Learning
  • Markov Decision Process
  • Q learning and policy gradients
  • R4: Analysis of Baselines
Apr 6: Multimodal RL
  • Deep Q learning
  • Multimodal aplications
Apr 8: Project Hours (Report 5)
Apr 13: Fusion and co-learning
  • Multi-kernel learning and fusion
  • Few shot learning and co-learning
Apr 15: -- NO CLASS --
Apr 20: New research directions
  • Recent approaches in MMML
Apr 22: Embodiment
  • Action as a modality
  • R5: Proposed Approach
Apr 27: Project Hours (Final)

Apr 29: Project Hours (Final)
May 4: Guest Lecture (Mark Yatskar - UPenn)

May 6: Guest Lecture (Chris Paxton - NVIDIA)
May 11: Project Presentations (live) May 13: -- NO CLASS --
  • R6: Final Reports