Spring 2022    Previous Projects

This course focuses on core techniques and modern advances for integrating different "modalities" into a shared representation or reasoning system. Specifically, these include text, audio, images/videos and action taking.

Yonatan Bisk

Yonatan Bisk


Li-Wei Chen

Li-Wei Chen


Ta-Chung Chi

Ta-Chung Chi


Hyukjae (Alex) Kwark

Hyukjae (Alex) Kwark


Dong Won

Dong Won Lee


Yuchen Xu

Yuchen Xu


Slack and Canvas

All course communication will happen via slack and canvas. All videos will be posted to Canvas for offline viewing though aspects of the class/teaching will be interactive in the zoom sessions.


Assignments Timeline and Grading

The course is primarily project based, but there will be readings throughout the course which are only graded via participation.

Project Timeline and Assignments: (see links for more details)
Feb 03 Groups Formed
Feb 10 R1 Dataset Proposal and Analysis (as a group) (10%)
Mar 03 R2 Related Work and Model Proposal (15%)
Mar 31 R3 Baseline Analysis (15%)
Finals Week Presentation (10%)
May 6 Final Completed Report (20%)

Participation in Class or Slack (20%)
Participation is evaluated as "actively asking/answering questions based on the lectures, readings, and/or assisting other teams with project issues". Concretely, this means that every novel question or helpful answer provided in Slack will count for 1%, up to a total of 20% of your grade. Two bonus points can be earned (22%).

Paper Summaries:
Paper Summaries (10%)
Writing a three sentence summary describing the paper you read earns you 1pt. This summary will be submitted in three text boxes. Specifically, A. The goal of the paper, B. Explain the key insight, C. State a key limitation or important extension. There will be 11 opportunities, so you one bonus point can be earned (11%). Paper summaries are due the following Tuesday night (1 week after being assigned).

Submission Policies:

Tasks & Datasets

The course will be primarily centered on a few datasets/tasks to facilitate cross-team collaboration and technical assistance. If your team has a good reason to work on something else, please reach out so we can discuss it and put together a proposal.

Simulator Based
Room-Across-Room Code Multilingual Embodied Navigation
ALFRED Code Embodied instruction following with interaction
TEACh Code Embodied Teaching (and Dialogue)

Question Answering & Captioning
TextVQA Code Text in images (referring expressions and reading)
WebQA Code Multihop Visual QA
VizWiz VQA and Captioning Visual models for blind users
Social-IQ Code
Proj page
Video Question Answering focused on social interactions

Multi-turn QA
CompGuessWhat?! Visual Guessing Game and Attribute Prediction
PhotoBook Dialogue Data Visual reference game via dialogue

Spoken Image Captions A series of audio corpora and corresponding images for connecting audio directly to image regions.

TVQA Video Question Answering Dataset
VATEX Multilingual Video Captioning and Translation

Physical hardware / robots / sensors ...
What about physical hardware? robots? tasks not datasets? Let's talk.

Compute Limited AWS and Google Cloud compute credits will be made available to each group, so please consider both your interests and available compute resources when deciding on a dataset/project.


Tuesday Thursday
Jan 18: Course Structure
  • Research and technical challenges
  • Syllabus and requirements
Jan 20: Multimodal applications and datasets
  • Research tasks and datasets
  • Team projects
Jan 25: Basics: "Deep learning"
  • Language, Vision, Audio
  • Loss functions and neural networks
Jan 27: Basics: Optimization
  • Gradients and backprop
  • Practical deep learning optimization
Readings: A listed or proposed dataset/task
Feb 1: Unimodal representations (Vision)
  • CNNs
  • Residuals and Skip connections
Feb 3: Unimodal representations (Language)
  • Gating and LSTMs
  • Transformers
  • Groups formed, sign up for project hours
Feb 8 Project Hours (Project ideas) Feb 10: Project Hours (Project ideas)
  • R1: Dataset Proposal and Analysis
Readings: A paper of your choosing which is relevant to your project.
Note: Team members must choose different papers.
Feb 15: Multimodal & Coordinated Representations
  • Auto-encoders
  • CCA
  • Multi-view Clustering
Feb 17: Alignment and Attention
  • Explicit - Dynamic Time Warping
  • Implicit -- Attention
Feb 22: Alignment + Representation
  • Self-attention
  • Multimodal Transformers
Feb 24: Alignment + Representation (Cont)
  • Self-attention models
  • Multimodal Transformers
Mar 1: Alignment + Representation (Cont)
  • Video Transformers
  • Self-Attention for Vision
Mar 3: Ethics (Guest: Emma Strubell)
  • R2: Related Work and Model Proposal
Readings: None
Mar 8: Spring Break!
Mar 10: Spring Break!
Readings: None
Mar 15: Project Hours (Research Discussion) Mar 17: Project Hours (Research Discussion)
Readings: A paper of your choosing which is relevant to your project.
Note: Team members must choose different papers.
Mar 22: Alignment + Translation
  • Module Networks
  • Tree-based & Stack models
Mar 24: Fusion and co-learning
  • Multi-kernel learning and fusion
  • Few shot learning and co-learning
Mar 29: Reinforcement Learning
  • Markov Decision Processes
  • Q-learning and policy gradients
Mar 31: Multimodal RL
  • Deep Q learning
  • Multimodal applications
  • R3: Baseline analysis
Apr 5: Embodiment
  • Action as a modality
Apr 7: -- NO CLASS --
Readings: None
Apr 12: Embodiment (cont)
  • Language to Control
Apr 14: New research directions
  • Recent publications
Apr 19: Project Hours (Final)

Apr 21: Project Hours (Final)
Readings: None
Apr 26: Daniel Fried Apr 28: Chris Paxton
May 5 (5:30-8:30pm): Project Presentations (Hybrid: PH 100) May 6: Final Reports Due