11-777 MultiModal Machine Learning

Spring 2022 Previous Projects

This course focuses on core techniques and modern advances for integrating different "modalities" into a shared representation or reasoning system. Specifically, these include text, audio, images/videos and action taking.

Time & Place: 10:10am - 11:30am on Tu/Th (Doherty Hall 2210)
Canvas: Lectures and additional details (coming soon)
Course questions and discussion: Slack
Registered students will be invited daily the first week of class
GitHub Template: https://github.com/ybisk/11-777-template

Instructor

Yonatan Bisk

ybisk@cs.cmu

TA

Li-Wei Chen

liweiche@cs.cmu.edu

TA

Ta-Chung Chi

tachungc@andrew

TA

Hyukjae (Alex) Kwark

hkwark@andrew

TA

Dong Won Lee

dongwonl@cs

TA

Yuchen Xu

yuchenxu@andrew

Slack and Canvas

All course communication will happen via slack and canvas. All videos will be posted to Canvas for offline viewing though aspects of the class/teaching will be interactive in the zoom sessions.

Slack

#general: For questions about lectures, the course, or help from others on class projects
#team-N-X: Each team should come up with a name and create their own private channel (invite TAs and instructor). Use the same name for your GitHub fork and pin the link to the channel. Please also invite us to the GitHub. Example: #team-fun-vizwiz
#dataset-XYZ: Each core dataset will also have its own slack channel that anyone can join (across teams) to ask for help on setup, preprocessing, and other issues that might arise.
Private Messages: If there is a question you would like to address to the instructors, please send a DM on slack. Please check #general-questions first and post there when possible.

Assignments Timeline and Grading

The course is primarily project based, but there will be readings throughout the course which are only graded via participation.

Project Timeline and Assignments: (see links for more details)

Feb 03		Groups Formed
Feb 10	R1	Dataset Proposal and Analysis (as a group)	(10%)
Mar 03	R2	Related Work and Model Proposal	(15%)
Mar 31	R3	Baseline Analysis	(15%)
Finals Week		Presentation	(10%)
May 6	Final	Completed Report	(20%)

Participation:
Participation in Class or Slack (20%)
Participation is evaluated as "actively asking/answering questions based on the lectures, readings, and/or assisting other teams with project issues". Concretely, this means that every novel question or helpful answer provided in Slack will count for 1%, up to a total of 20% of your grade. Two bonus points can be earned (22%).

Paper Summaries:
Paper Summaries (10%)
Writing a three sentence summary describing the paper you read earns you 1pt. This summary will be submitted in three text boxes. Specifically, A. The goal of the paper, B. Explain the key insight, C. State a key limitation or important extension. There will be 11 opportunities, so you one bonus point can be earned (11%). Paper summaries are due the following Tuesday night (1 week after being assigned).

Submission Policies:

All deadlines are midnight EST (determined by Canvas submission)
Everyone must submit a PDF of the report to Canvas so we can give individual grades
Late days: Every team has a budget of 6 late days. They will be automatically calculated, after which 2% absolute is removed from max grade.

Tasks & Datasets

The course will be primarily centered on a few datasets/tasks to facilitate cross-team collaboration and technical assistance. If your team has a good reason to work on something else, please reach out so we can discuss it and put together a proposal.

Simulator Based

Room-Across-Room	Code	Multilingual Embodied Navigation
ALFRED	Code	Embodied instruction following with interaction
TEACh	Code	Embodied Teaching (and Dialogue)

Question Answering & Captioning

TextVQA	Code	Text in images (referring expressions and reading)
WebQA	Code	Multihop Visual QA
VizWiz	VQA and Captioning	Visual models for blind users
Social-IQ	Code Proj page	Video Question Answering focused on social interactions

Multi-turn QA

CompGuessWhat?!		Visual Guessing Game and Attribute Prediction
PhotoBook Dialogue	Data	Visual reference game via dialogue

Audio

Spoken Image Captions		A series of audio corpora and corresponding images for connecting audio directly to image regions.

Video

TVQA		Video Question Answering Dataset
VATEX		Multilingual Video Captioning and Translation

Physical hardware / robots / sensors ...

What about physical hardware? robots? tasks not datasets? Let's talk.

Compute Limited AWS and Google Cloud compute credits will be made available to each group, so please consider both your interests and available compute resources when deciding on a dataset/project.

Lectures

Tuesday	Thursday
Jan 18: Course Structure Research and technical challenges Syllabus and requirements	Jan 20: Multimodal applications and datasets Research tasks and datasets Team projects
Readings: Multimodal Machine Learning: A Survey and Taxonomy Sections 1-4 Representation Learning: A Review and New Perspectives Sections 1-3, 6-8, 11
Jan 25: Basics: "Deep learning" Language, Vision, Audio Loss functions and neural networks	Jan 27: Basics: Optimization Gradients and backprop Practical deep learning optimization
Readings: A listed or proposed dataset/task
Feb 1: Unimodal representations (Vision) CNNs Residuals and Skip connections	Feb 3: Unimodal representations (Language) Gating and LSTMs Transformers Groups formed, sign up for project hours
Readings: Visualizing and Understanding Convolutional Networks Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Visualizing and Understanding Recurrent Networks Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context
Feb 8 Project Hours (Project ideas)	Feb 10: Project Hours (Project ideas) R1: Dataset Proposal and Analysis
Readings: A paper of your choosing which is relevant to your project. Note: Team members must choose different papers.
Feb 15: Multimodal & Coordinated Representations Auto-encoders CCA Multi-view Clustering	Feb 17: Alignment and Attention Explicit - Dynamic Time Warping Implicit -- Attention
Readings: Every Picture Tells a Story: Generating Sentences from Images Detecting Visual Text From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions Neural Module Networks
Feb 22: Alignment + Representation Self-attention Multimodal Transformers	Feb 24: Alignment + Representation (Cont) Self-attention models Multimodal Transformers
Readings: Multimodal Transformer for Unaligned Multimodal Language Sequences Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Mar 1: Alignment + Representation (Cont) Video Transformers Self-Attention for Vision	Mar 3: Ethics (Guest: Emma Strubell) R2: Related Work and Model Proposal
Readings: None
Mar 8: Spring Break!	Mar 10: Spring Break!
Readings: None
Mar 15: Project Hours (Research Discussion)	Mar 17: Project Hours (Research Discussion)
Readings: A paper of your choosing which is relevant to your project. Note: Team members must choose different papers.
Mar 22: Alignment + Translation Module Networks Tree-based & Stack models	Mar 24: Fusion and co-learning Multi-kernel learning and fusion Few shot learning and co-learning
Readings: PixL2R: Guiding Reinforcement Learning Using Natural Language by Mapping Pixels to Rewards Multimodal sentiment analysis with word-level fusion and reinforcement learning Language Conditioned Imitation Learning Over Unstructured Data
Mar 29: Reinforcement Learning Markov Decision Processes Q-learning and policy gradients	Mar 31: Multimodal RL Deep Q learning Multimodal applications R3: Baseline analysis
Readings: Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight FILM: Following Instructions in Language with Modular Methods CLIPort: What and Where Pathways for Robotic Manipulation
Apr 5: Embodiment Action as a modality	Apr 7: -- NO CLASS --
Readings: None
Apr 12: Embodiment (cont) Language to Control	Apr 14: New research directions Recent publications
Readings: Do As I Can, Not As I Say: Grounding Language in Robotic Affordances GroupViT: Semantic Segmentation Emerges from Text Supervision Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Image Retrieval from Contextual Descriptions DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers
Apr 19: Project Hours (Final)	Apr 21: Project Hours (Final)
Readings: None
Apr 26: Daniel Fried	Apr 28: Chris Paxton
Readings: Unified Pragmatic Models for Generating and Following Instructions Speaker-Follower Models for Vision-and-Language Navigation StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects
May 5 (5:30-8:30pm): Project Presentations (Hybrid: PH 100)	May 6: Final Reports Due

11-777: MultiModal Machine Learning