Experience Grounds Language

Yonatan Bisk* Ari Holtzman* Jesse Thomason*

Jacob Andreas Yoshua Bengio Joyce Chai Mirella Lapata

Angeliki Lazaridou Jonathan May Aleksandr Nisnevich Nicolas Pinto Joseph Turian

Abstract

Language understanding research is held back

by a failure to relate language to the physical

world it describes and to the social interactions

it facilitates. Despite the incredible effective-

ness of language processing models to tackle

tasks after being trained on text alone, success-

ful linguistic communication relies on a shared

experience of the world. It is this shared expe-

rience that makes utterances meaningful.

Natural language processing is a diverse ﬁeld,

and progress throughout its development has

come from new representational theories, mod-

eling techniques, data collection paradigms,

and tasks. We posit that the present success

of representation learning approaches trained

on large, text-only corpora requires the paral-

lel tradition of research on the broader physi-

cal and social context of language to address

the deeper questions of communication.

Improvements in hardware and data collection

have galvanized progress in NLP across many

benchmark tasks. Impressive performance has been

achieved in language modeling (Radford et al.,

2019; Zellers et al., 2019b; Keskar et al., 2019) and

span-selection question answering (Devlin et al.,

2019; Yang et al., 2019b; Lan et al., 2020) through

massive data and massive models. With models

exceeding human performance on such tasks, now

is an excellent time to reﬂect on a key question:

Where is NLP going?

In this paper, we consider how the data and world

a language learner is exposed to deﬁne and con-

strains the scope of that learner’s semantics. Mean-

ing does not arise from the statistical distribution

of words, but from their use by people to communi-

cate. Many of the assumptions and understandings

on which communication relies lie outside of text.

We must consider what is missing from models

Meaning is not a unique property of language, but a

general characteristic of human activity ... We cannot

say that each morpheme or word has a single or central

meaning, or even that it has a continuous or coherent

range of meanings ... there are two separate uses and

meanings of language – the concrete ... and the abstract.

Zellig S. Harris (Distributional Structure 1954)

trained solely on text corpora, even when those cor-

pora are meticulously annotated or Internet-scale.

You can’t learn language from the radio. Nearly

every NLP course will at some point make this

claim. The futility of learning language from lin-

guistic signal alone is intuitive, and mirrors the

belief that humans lean deeply on non-linguistic

knowledge (Chomsky, 1965, 1980). However, as

a ﬁeld we attempt this futility: trying to learn lan-

guage from the Internet, which stands in as the

modern radio to deliver limitless language. In this

piece, we argue that the need for language to attach

to “extralinguistic events" (Ervin-Tripp, 1973) and

the requirement for social context (Baldwin et al.,

1996) should guide our research.

Drawing inspiration from previous work in NLP,

Cognitive Science, and Linguistics, we propose the

notion of a World Scope (WS) as a lens through

which to audit progress in NLP. We describe ﬁve

WSs, and note that most trending work in NLP

operates in the second (Internet-scale data).

We deﬁne ﬁve levels of World Scope:

WS1. Corpus (our past)

WS2. Internet (most of current NLP)

WS3. Perception (multimodal NLP)

WS4. Embodiment

WS5. Social

These World Scopes go beyond text to consider

the contextual foundations of language: grounding,

embodiment, and social interaction. We describe a

brief history and ongoing progression of how con-

textual information can factor into representations

and tasks. We conclude with a discussion of how

this integration can move the ﬁeld forward. We be-

lieve this World Scope framing serves as a roadmap

for truly contextual language understanding.

1 WS1: Corpora and Representations

The story of data-driven language research begins

with the corpus. The Penn Treebank (Marcus et al.,

1993) is the canonical example of a clean subset of

naturally generated language, processed and anno-

tated for the purpose of studying representations.

Such corpora and the model representations built

from them exemplify WS1. Community energy

was initially directed at ﬁnding formal linguistic

structure, such as recovering syntax trees. Recent

success on downstream tasks has not required such

explicitly annotated signal, leaning instead on un-

structured fuzzy representations. These representa-

tions span from dense word vectors (Mikolov et al.,

2013) to contextualized pretrained representations

(Peters et al., 2018; Devlin et al., 2019).

Word representations have a long history predat-

ing the recent success of deep learning methods.

Outside of NLP, philosophy (Austin, 1975) and lin-

guistics (Lakoff, 1973; Coleman and Kay, 1981)

recognized that meaning is ﬂexible yet structured.

Early experiments on neural networks trained with

sequences of words (Elman, 1990; Bengio et al.,

2003) suggested that vector representations could

capture both syntax and semantics. Subsequent

experiments with larger models, documents, and

corpora have demonstrated that representations

learned from text capture a great deal of informa-

tion about meaning in and out of context (Collobert

and Weston, 2008; Turian et al., 2010; Mikolov

et al., 2013; McCann et al., 2017).

The intuition of such embedding representations,

that context lends meaning, has long been acknowl-

edged (Firth, 1957; Turney and Pantel, 2010). Ear-

lier on, discrete, hierarchical representations, such

as agglomerative clustering guided by mutual in-

formation (Brown et al., 1992), were constructed

with some innate interpretability. A word’s position

in such a hierarchy captures semantic and syntac-

tic distinctions. When the Baum–Welch algorithm

(Welch, 2003) is applied to unsupervised Hidden

Markov Models, it assigns a class distribution to

every word, and that distribution is a partial rep-

resentation of a word’s “meaning.” If the set of

classes is small, syntax-like classes are induced;

if the set is large, classes become more semantic.

These representations are powerful in that they cap-

Academic interest in Firth and Harris increases dramatically

around 2010, perhaps due to the popularization of Firth (1957)

“You shall know a word by the company it keeps."

ture linguistic intuitions without supervision, but

they are constrained by the structure they impose

with respect to the number of classes chosen.

The intuition that meaning requires a large con-

text, that “You shall know a word by the company

it keeps." – Firth (1957), manifested early via La-

tent Semantic Indexing/Analysis (Deerwester et al.,

1988, 1990; Dumais, 2004) and later in the gen-

erative framework of Latent Dirichlet Allocation

(Blei et al., 2003). LDA represents a document as

a bag-of-words conditioned on latent topics, while

LSI/A use singular value decomposition to project

a co-occurrence matrix to a low dimensional word

vector that preserves locality. These methods dis-

card sentence structure in favor of the document.

Representing words through other words is a

comfortable proposition, as it provides the illusion

of deﬁnitions by implicit analogy to thesauri and

related words in a dictionary deﬁnition. However,

the recent trends in deep learning approaches to

language modeling favor representing meaning in

ﬁxed-length vectors with no obvious interpretation.

The question of where meaning resides in “connec-

tionist” systems like Deep Neural Networks is an

old one (Pollack, 1987; James and Miikkulainen,

1995). Are concepts distributed through edges or

local to units in an artiﬁcial neural network?

“... there has been a long and unresolved

debate between those who favor localist

representations in which each process-

ing element corresponds to a meaningful

concept and those who favor distributed

representations.” Hinton (1990)

Special Issue on Connectionist Symbol Processing

In connectionism, words were no longer deﬁned

over interpretable dimensions or symbols, which

were perceived as having intrinsic meaning. The

tension of modeling symbols and distributed repre-

sentations is articulated by Smolensky (1990), and

alternative representations (Kohonen, 1984; Hinton

et al., 1986; Barlow, 1989) and approaches to struc-

ture and composition (Erk and Padó, 2008; Socher

et al., 2012) span decades of research.

The Brown Corpus (Francis, 1964) and Penn

Treebank (Marcus et al., 1993) deﬁned context and

structure in NLP for decades. Only relatively re-

cently (Baroni et al., 2009) has the cost of annota-

tions decreased enough, and have large-scale web-

crawls become viable, to enable the introduction of

more complex text-based tasks. This transition to

larger, unstructured context (WS2) induced a richer

semantics than was previously believed possible

under the distributional hypothesis.

2 WS2: The Written World

Corpora in NLP have broadened to include large

web-crawls. The use of unstructured, unlabeled,

multi-domain, and multilingual data broadens our

world scope, in the limit, to everything humanity

has ever written.

We are no longer constrained to

a single author or source, and the temptation for

NLP is to believe everything that needs knowing

can be learned from the written world. But, a large

and noisy text corpus is still a text corpus.

This move towards using large scale raw data

has led to substantial advances in performance on

existing and novel community benchmarks (Devlin

et al., 2019; Brown et al., 2020). Scale in data and

modeling has demonstrated that a single represen-

tation can discover both rich syntax and semantics

without our help (Tenney et al., 2019). This change

is perhaps best seen in transfer learning enabled

by representations in deep models. Traditionally,

transfer learning relied on our understanding of

model classes, such as English grammar. Domain

adaptation simply required sufﬁcient data to cap-

ture lexical variation, by assuming most higher-

level structure would remain the same. Unsuper-

vised representations today capture deep associ-

ations across multiple domains, and can be used

successfully transfer knowledge into surprisingly

diverse contexts (Brown et al., 2020).

These representations require scale in terms of

both data and parameters. Concretely, Mikolov

et al. (2013) trained on 1.6 billion tokens, while

Pennington et al. (2014) scaled up to 840 billion

tokens from Common Crawl. Recent approaches

A parallel discussion would focus on the hardware re-

quired to enable advances to higher World Scopes. Playsta-

tions (Pinto et al., 2009) and then GPUs (Krizhevsky et al.,

2012) made many WS2 advances possible. Perception, inter-

action, and robotics leverage other new hardware.

have made progress by substantially increasing the

number of model parameters to better consume

these vast quantities of data. Where Peters et al.

(2018) introduced ELMo with

∼10

parameters,

Transformer models (Vaswani et al., 2017) have

continued to scale by orders of magnitude between

papers (Devlin et al., 2019; Radford et al., 2019;

Zellers et al., 2019b) to

∼10

(Brown et al., 2020).

Current models are the next (impressive) step

in language modeling which started with Good

(1953), the weights of Kneser and Ney (1995);

Chen and Goodman (1996), and the power-law

distributions of Teh (2006). Modern approaches

to learning dense representations allow us to bet-

ter estimate these distributions from massive cor-

pora. However, modeling lexical co-occurrence,

no matter the scale, is still modeling the written

world. Models constructed this way blindly search

for symbolic co-occurences void of meaning.

How can models yield both “impressive results”

and “diminishing returns”? Language modeling—

the modern workhorse of neural NLP systems—is

a canonical example. Recent pretraining literature

has produced results that few could have predicted,

crowding leaderboards with “super-human" accu-

racy (Rajpurkar et al., 2018). However, there are

diminishing returns. For example, on the LAM-

BADA dataset (Paperno et al., 2016), designed

to capture human intuition, GPT2 (Radford et al.,

2019) (1.5B), Megatron-LM (Shoeybi et al., 2019)

(8.3B), and TuringNLG (Rosset, 2020) (17B) per-

form within a few points of each other and very far

from perfect (<68%). When adding another order

of magnitude of parameters (175B) Brown et al.

(2020) gain 8 percentage-points, impressive but

still leaving 25% unsolved. Continuing to expand

hardware, data sizes, and ﬁnancial compute cost

by orders of magnitude will yield further gains, but

the slope of the increase is quickly decreasing.

The aforementioned approaches for learning

transferable representations demonstrate that sen-

tence and document context provide powerful sig-

nals for learning aspects of meaning, especially se-

mantic relations among words (Fu et al., 2014) and

inferential relationships among sentences (Wang

et al., 2019a). The extent to which they capture

deeper notions of contextual meaning remains an

open question. Past work has found that pretrained

word and sentence representations fail to capture

many grounded features of words (Lucy and Gau-

thier, 2017) and sentences, and current NLU sys-

tems fail on the thick tail of experience-informed in-

ferences, such as hard coreference problems (Peng

et al., 2015). “I parked my car in the compact park-

ing space because it looked (big/small) enough.”

still presents problems for text-only learners.

As text pretraining schemes seem to be reach-

ing the point of diminishing returns, even for some

syntactic phenomena (van Schijndel et al., 2019),

we posit that other forms of supervision, such as

multimodal perception (Ilharco et al., 2019), are

necessary to learn the remaining aspects of mean-

ing in context. Learning by observation should not

be a purely linguistic process, since leveraging and

combining the patterns of multimodal perception

can combinatorially boost the amount of signal in

data through cross-referencing and synthesis.

3 WS3: The World of Sights and Sounds

Language learning needs perception, because per-

ception forms the basis for many of our semantic

axioms. Learned, physical heuristics, such as the

fact that a falling cat will land quietly, are general-

ized and abstracted into language metaphors like

as nimble as a cat (Lakoff, 1980). World knowl-

edge forms the basis for how people make entail-

ment and reasoning decisions, commonly driven

by mental simulation and analogy (Hofstadter and

Sander, 2013). Perception is the foremost source

of reporting bias. The assumption that we all see

and hear the same things informs not just what we

name, but what we choose to assume and leave un-

written. Further, there exists strong evidence that

children require grounded sensory perception, not

just speech, to learn language (Sachs et al., 1981;

O’Grady, 2005; Vigliocco et al., 2014).

Perception includes auditory, tactile, and visual

input. Even restricted to purely linguistic sig-

nals, sarcasm, stress, and meaning can be implied

through prosody. Further, tactile senses lend mean-

ing, both physical (Sinapov et al., 2014; Thomason

et al., 2016) and abstract, to concepts like heavy and

soft. Visual perception is a rich signal for modeling

a vastness of experiences in the world that cannot

be documented by text alone (Harnad, 1990).

For example, frames and scripts (Schank and

Abelson, 1977; Charniak, 1977; Dejong, 1981;

Mooney and Dejong, 1985) require understand-

ing often unstated sets of pre- and post-conditions

about the world. To borrow from Charniak (1977),

how should we learn the meaning, method, and im-

plications of painting? A web crawl of knowledge

Eugene Charniak (A Framed PAINTING: The Representation

of a Common Sense Knowledge Fragment 1977)

from an exponential number of possible how-to,

text-only guides and manuals (Bisk et al., 2020)

is misdirected without some fundamental referents

to which to ground symbols. Models must be able

to watch and recognize objects, people, and activi-

ties to understand the language describing them (Li

et al., 2019b; Krishna et al., 2017; Yatskar et al.,

2016; Perlis, 2016) and access ﬁne-grained notions

of causality, physics, and social interactions.

While the NLP community has played an im-

portant role in the history of grounding (Mooney,

2008), recently remarkable progress has taken

place in the Computer Vision community. It is

tempting to assume that vision models trained

to identify 1,000 ImageNet classes (Russakovsky

et al., 2015)

are limited to extracting a bag of vi-

sual words. In reality, Computer Vision has been

making in-roads into complex visual, physical, and

social phenomena, while providing reusable infras-

tructure.

The stability of these architectures allows

for new research into more challenging world mod-

eling. Mottaghi et al. (2016) predicts the effects of

forces on objects in images. Bakhtin et al. (2019)

extends this physical reasoning to complex puzzles

of cause and effect. Sun et al. (2019b,a) models

scripts and actions, and alternative unsupervised

training regimes (Bachman et al., 2019) open up

research towards automatic concept formation.

Advances in computer vision have enabled build-

ing semantic representations rich enough to inter-

act with natural language. In the last decade of

work descendant from image captioning (Farhadi

et al., 2010; Mitchell et al., 2012), a myriad of

tasks on visual question answering (Antol et al.,

2015; Das et al., 2018; Yagcioglu et al., 2018),

natural language and visual reasoning (Suhr et al.,

2019b), visual commonsense (Zellers et al., 2019a),

Or the 1,600 classes of Anderson et al. (2017).

Torchvision/Detectron2 include dozens of trained models.

and multilingual captioning/translation via video

(Wang et al., 2019b) have emerged. These com-

bined text and vision benchmarks are rich enough

to train large-scale, multimodal transformers (Li

et al., 2019a; Lu et al., 2019; Zhou et al., 2019)

without language pretraining (e.g. via conceptual

captions (Sharma et al., 2018)) or further broad-

ened to include audio (Tsai et al., 2019). Vision can

also help ground speech signals (Srinivasan et al.,

2020; Harwath et al., 2019) to facilitate discovery

of linguistic concepts (Harwath et al., 2020).

At the same time, NLP resources contributed

to the success of these vision backbones. Hierar-

chical semantic representations emerge from Im-

ageNet classiﬁcation pretraining partially due to

class hypernyms owed to that dataset’s WordNet

origins. For example, the person class sub-divides

into many professions and hobbies, like ﬁreﬁghter,

gymnast, and doctor. To differentiate such sibling

classes, learned vectors can also encode lower-level

characteristics like clothing, hair, and typical sur-

rounding scenes. These representations allow for

pixel level masks and skeletal modeling, and can be

extended to zero-shot settings targeting all 20K Im-

ageNet categories (Chao et al., 2016; Changpinyo

et al., 2017). Modern architectures also learn to dif-

ferentiate instances within a general class, such as

face. For example, facial recognition benchmarks

require distinguishing over 10K unique faces (Liu

et al., 2015). While vision is by no means “solved,”

benchmarks have led to off-the-shelf tools for build-

ing representations rich enough to identify tens of

thousands of objects, scenes, and individuals.

A WS3 agent, having access to potentially end-

less hours of video data showing the intricate de-

tails of daily comings and goings, procedures, and

events, reduces susceptibility to the reporting bias

of WS2. An ideal WS3 agent will exhibit bet-

ter long-tail generalization and understanding than

any language-only system could. This generaliza-

tion should manifest in existing benchmarks, but

would be most prominent in a test of zero-shot cir-

cumstances, such as “Will this car ﬁt through that

tunnel?,” and rarely documented behaviors as ex-

amined in script learning. Yet the WS3 agent will

likely fail to answer, "Would a ceramic or paper

plate make a better frisbee?" The agent has not tried

to throw various objects and understand how their

velocity and shape interact with the atmosphere to

create lift. The agent cannot test novel hypotheses

by intervention and action in the world.

If A and B have some environments in common and

some not ... we say that they have different meanings,

the amount of meaning difference corresponding

roughly to the amount of difference in their

environments ...

Zellig S. Harris (Distributional Structure 1954)

4 WS4: Embodiment and Action

In human development, interactive multimodal sen-

sory experience forms the basis of action-oriented

categories (Thelen and Smith, 1996) as children

learn how to manipulate their perception by ma-

nipulating their environment. Language grounding

enables an agent to connect words to these action-

oriented categories for communication (Smith and

Gasser, 2005), but requires action to fully discover

such connections. Embodiment—situated action

taking—is therefore a natural next broader context.

An embodied agent, whether in a virtual world,

such as a 2D Maze (MacMahon et al., 2006), a

grid world (Chevalier-Boisvert et al., 2019), a sim-

ulated house (Anderson et al., 2018; Thomason

et al., 2019b; Shridhar et al., 2020), or the real

world (Tellex et al., 2011; Matuszek, 2018; Thoma-

son et al., 2020; Tellex et al., 2020) must translate

from language to action. Control and action taking

open several new dimensions to understanding and

actively learning about the world. Queries can be

resolved via dialog-based exploration with a hu-

man interlocutor (Liu and Chai, 2015), even as new

object properties, like texture and weight (Thoma-

son et al., 2017), or feedback, like muscle activa-

tions (Moro and Kennington, 2018), become avail-

able. We see the need for embodied language with

complex meaning when thinking deeply about even

the most innocuous of questions:

Is an orange more like a baseball or more

like a banana?

WS1 is likely not to have an answer beyond that

the objects are common nouns that can both be held.

WS2 may capture that oranges and baseballs both

roll, but is not the deformation strength, surface tex-

ture, or relative sizes of these objects (Elazar et al.,

2019). WS3 may realize the relative deformability

of these objects, but is likely to confuse how much

force is necessary given that baseballs are used

much more roughly than oranges. WS4 can appre-

ciate the nuances of the question—the orange and

baseball afford similar manipulation because they

have similar texture and weight, while the orange

and banana both contain peels, deform, and are

edible. People can reason over rich representations

of common objects that these words evoke.

Planning is where people ﬁrst learn abstraction

and simple examples of post-conditions through

trial and error. The most basic scripts humans learn

start with moving our own bodies and achieving

simple goals as children, such as stacking blocks.

In this space, we have unlimited supervision from

the environment and can learn to generalize across

plans and actions. In general, simple worlds do

not entail simple concepts: even in a block world

concepts like “mirroring” appear (Bisk et al., 2018).

Humans generalize and apply physical phenomena

to abstract concepts with ease.

In addition to learning basic physical proper-

ties of the world from interaction, WS4 also al-

lows the agent to construct rich pre-linguistic rep-

resentations from which to generalize. Hespos and

Spelke (2004) show pre-linguistic category forma-

tion within children that are then later codiﬁed by

social constructs. Mounting evidence seems to indi-

cate that children have trouble transferring knowl-

edge from the 2D world of books (Barr, 2013) and

iPads (Lin et al., 2017) to the physical 3D world.

So while we might choose to believe that we can en-

code parameters (Chomsky, 1981) more effectively

and efﬁciently than evolution provided us, develop-

mental experiments indicate doing so without 3D

interaction may prove difﬁcult.

Part of the problem is that much of the knowl-

edge humans hold about the world is intuitive,

possibly incommunicable by language, but still

required to understand language. Much of this

knowledge revolves around physical realities that

real-world agents will encounter. Consider how

many explicit and implicit metaphors are based on

the idea that far-away things have little inﬂuence

on manipulating local space: “a distant concern”

and “we’ll cross that bridge when we come to it.”

Robotics and embodiment are not available in

the same off-the-shelf manner as computer vision

models. However, there is rapid progress in simu-

lators and commercial robotics, and as language re-

searchers we should match these advances at every

step. As action spaces grow, we can study complex

language instructions in simulated homes (Shrid-

har et al., 2020) or map language to physical robot

control (Blukis et al., 2019; Chai et al., 2018). The

last few years have seen massive advances in both

In order to talk about concepts, we must understand the

importance of mental models... we set up a model of

the world which serves as a framework in which to

organize our thoughts. We abstract the presence of

particular objects, having properties, and entering into

events and relationships.

Terry Winograd - 1971

high ﬁdelity simulators for robotics (Todorov et al.,

2012; Coumans and Bai, 2016–2019; NVIDIA,

2019; Xiang et al., 2020) and the cost and avail-

ability of commodity hardware (Fitzgerald, 2013;

Campeau-Lecours et al., 2019; Murali et al., 2019).

As computers transition from desktops to perva-

sive mobile and edge devices, we must make and

meet the expectation that NLP can be deployed in

any of these contexts. Current representations have

very limited utility in even the most basic robotic

settings (Scalise et al., 2019), making collaborative

robotics (Rosenthal et al., 2010) largely a domain

of custom engineering rather than science.

5 WS5: The Social World

Interpersonal communication is the foundational

use case of natural language (Dunbar, 1993). The

physical world gives meaning to metaphors and

instructions, but utterances come from a source

with a purpose. Take J.L. Austin’s classic example

of “BULL” being written on the side of a fence in

a large ﬁeld (Austin, 1975). It is a fundamentally

social inference to realize that this word indicates

the presence of a dangerous creature, and that the

word is written on the opposite side of the fence

from where that creature lives.

Interpersonal dialogue as a grand test for AI is

older than the term “artiﬁcial intelligence,” begin-

ning at least with Turing (1950)’s Imitation Game.

Turing was careful to show how easily a naïve tester

could be tricked. Framing, such as suggesting that a

chatbot speaks English as a second language (Sam-

ple and Hern, 2014), can create the appearance of

genuine content where there is none (Weizenbaum,

1966). This phenomenon has been noted countless

times, from criticisms of Speech Recognition as

“deceit and glamour” (Pierce, 1969) to complaints

of humanity’s “gullibility gap” (Marcus and Davis,

2019). We instead focus on why the social world

is vital to language learning.

Language that Does Something

Work in the

philosophy of language has long suggested that

function is the source of meaning, as famously il-

lustrated through Wittgenstein’s “language games”

(Wittgenstein, 1953, 1958). In linguistics, the

usage-based theory of language acquisition sug-

gests that constructions that are useful are the build-

ing blocks for everything else (Langacker, 1987,

1991). The economy of this notion of use has

been the subject of much inquiry and debate (Grice,

1975). In recent years, these threads have begun to

shed light on what use-cases language presents in

both acquisition and its initial origins in our species

(Tomasello, 2009; Barsalou, 2008), indicating the

fundamental role of the social world.

WS1, WS2, WS3, and WS4 expand the fac-

torizations of information available to linguistic

meaning. allows language to be a cause instead of

just a source of data. This is the ultimate goal for

a language learner: to generate language that does

something to the world.

Passive creation and evaluation of generated lan-

guage separates generated utterances from their

effects on other people, and while the latter is

a rich learning signal it is inherently difﬁcult to

annotate. In order to learn the effects language

has on the world, an agent must participate in lin-

guistic activity, such as negotiation (Yang et al.,

2019a; He et al., 2018; Lewis et al., 2017), collab-

oration (Chai et al., 2017), visual disambiguation

(Anderson et al., 2018; Lazaridou et al., 2017; Liu

and Chai, 2015), or providing emotional support

(Rashkin et al., 2019). These activities require in-

ferring mental states and social outcomes—a key

area of interest in itself (Zadeh et al., 2019).

What “lame” means in terms of discriminative

information is always at question: it can be deﬁned

as “undesirable,” but what it tells one about the

processes operating in the environment requires

social context to determine (Bloom, 2002). It is

the toddler’s social experimentation with “You’re

so lame!” that gives the word weight and deﬁnite

intent (Ornaghi et al., 2011). In other words, the

discriminative signal for the most foundational part

of a word’s meaning can only be observed by its ef-

fect on the world, and active experimentation is key

to learning that effect. Active experimentation with

language starkly contrasts with the disembodied

chat bots that are the focus of the current dialogue

community (Roller et al., 2020; Adiwardana et al.,

2020; Zhou et al., 2020; Chen et al., 2018; Serban

et al., 2017), which often do not learn from individ-

ual experiences and whose environments are not

persistent enough to learn the effects of actions.

Theory of Mind

When attempting to get what

we want, we confront people who have their own

desires and identities. The ability to consider the

feelings and knowledge of others is now com-

monly referred to as the “Theory of Mind” (Ne-

matzadeh et al., 2018). This paradigm has also

been described under the “Speaker-Listener” model

(Stephens et al., 2010), and a rich theory to describe

this computationally is being actively developed

under the Rational Speech Act Model (Frank and

Goodman, 2012; Bergen et al., 2016).

A series of challenges that attempt to address this

fundamental aspect of communication have been

introduced (Nematzadeh et al., 2018; Sap et al.,

2019). These works are a great start towards deeper

understanding, but static datasets can be problem-

atic due to the risk of embedding spurious patterns

and bias (de Vries et al., 2020; Le et al., 2019;

Gururangan et al., 2018; Glockner et al., 2018),

especially because examples where annotators can-

not agree (which are usually thrown out before

the dataset is released) still occur in real use cases.

More ﬂexible, dynamic evaluation (Zellers et al.,

2020; Dinan et al., 2019) are a partial solution, but

true persistence of identity and adaption to change

are both necessary and still a long way off.

Training data in WS1-4, complex and large as

it can be, does not offer the discriminatory signals

that make the hypothesizing of consistent identity

or mental states an efﬁcient path towards lowering

perplexity or raising accuracy (Liu et al., 2016; De-

Vault et al., 2006). First, there is a lack of inductive

bias (Martin et al., 2018). Models learn what they

need to discriminate between potential labels, and

it is unlikely that universal function approximators

such as neural networks would ever reliably posit

that people, events, and causality exist without be-

ing biased towards such solutions (Mitchell, 1980).

Second, current cross entropy training losses ac-

tively discourage learning the tail of the distribu-

tion properly, as statistically infrequent events are

drowned out (Pennington et al., 2014; Holtzman

et al., 2020). Meanwhile, it is precisely human’s

ability to draw on past experience and make zero-

shot decisions that AI aims to emulate.

Language in a Social Context

Whenever lan-

guage is used between people, it exists in a concrete

social context: status, role, intention, and countless

other variables intersect at a speciﬁc point (Ward-

haugh, 2011). These complexities are overlooked

through selecting labels on which crowd workers

agree. Current notions of ground truth in dataset

construction are based on crowd consensus bereft

of social context. We posit that ecologically valid

evaluation of generative models will require the

construction of situations where artiﬁcial agents are

considered to have enough identity to be granted

social standing for these interactions.

Social interaction is a precious signal, but ini-

tial studies have been strained by the training-

validation-test set scenario and reference-backed

evaluations. Collecting data about rich natural sit-

uations is often impossible. To address this gap,

learning by participation, where users can freely

interact with an agent, is a necessary step to the

ultimately social venture of communication. By

exhibiting different attributes and sending varying

signals, the sociolinguistic construction of identity

(Ochs, 1993) could be examined more deeply. Such

experimentation in social intelligence is simply not

possible with a ﬁxed corpus. Once models are ex-

pected to be interacted with when tested, probing

their decision boundaries for simpliﬁcations of re-

ality and a lack of commonsense knowledge as in

Gardner et al.; Kaushik et al. will become natural.

6 Self-Evaluation

We use the notion of World Scopes to make the

following concrete claims:

You can’t learn language ...

... from the radio (Internet). WS2 ⊂ WS3

A task learner cannot be said to be in

WS3 if it can succeed without perception

(e.g., visual, auditory).

... from a television. WS3 ⊂ WS4

A task learner cannot be said to be in

WS4 if the space of its world actions

and consequences can be enumerated.

... by yourself. WS4 ⊂ WS5

A task learner cannot be said to be in

WS5 unless achieving its goals requires

cooperating with a human in the loop.

By these deﬁnitions, most of NLP research still

resides in WS2. This fact does not invalidate the

utility or need for any of the research within NLP,

but it is to say that much of that existing research

targets a different goal than language learning.

These problems include the need to bring meaning

and reasoning into systems that perform natural

language processing, the need to infer and

represent causality, the need to develop

computationally-tractable representations of

uncertainty and the need to develop systems that

formulate and pursue long-term goals.

Michael Jordan (Artiﬁcial intelligence – the

revolution hasn’t happened yet, 2019)

Where Should We Start?

Many in our commu-

nity are already examining phenomena in WSs

3-5. Note that research can explore higher WS

phenomena without a resultant learner being in a

higher WS. For example, a chatbot can investigate

principles of the social world, but still lack the un-

derlying social standing required for WS5. Next

we describe four language use contexts which we

believe are both research questions to be tackled

and help illustrate the need to move beyond WS2.

Second language acquisition

when visiting a

foreign country leverages a shared, social world

model that allows pointing to referent objects and

miming internal states like hunger. The interlingua

is physical and experiential. Such a rich internal

world model should also be the goal for MT models:

starting with images (Huang et al., 2020), moving

through simulation, and then to the real world.

Coreference and WSD

leverage a shared scene

and theory of mind. To what extent are current

coreference resolution issues resolved if an agent

models the listener’s desires and experiences explic-

itly rather than looking solely for adjacent lexical

items? This setting is easiest to explore in embod-

ied environments, but is not exclusive to them (e.g.,

TextWorld (Côté et al., 2018)).

Novel word learning

from tactile knowledge

and use: What is the instrument that you wear like

a guitar but play like a piano? Objects can be de-

scribed with both gestures and words about appear-

ance and function. Such knowledge could begin

to tackle physical metaphors that current NLP sys-

tems struggle with.

Personally charged language:

How should a

dialogue agent learn what is hurtful to a speciﬁc

person? To someone who is sensitive about their

grades because they had a period of struggle in

school, the sentiment of “Don’t be a fool!” can be

hurtful, while for others it may seem playful. Social

knowledge is requisite for realistic understanding

of sentiment in situated human contexts.

Relevant recent work

The move from WS2 to

WS3 requires rethinking existing tasks and investi-

gating where their semantics can be expanded and

grounded. This idea is not new (Chen and Mooney,

2008; Feng and Lapata, 2010; Bruni et al., 2014;

Lazaridou et al., 2016) and has accelerated in the

last few years. Elliott et al. (2016) reframes ma-

chine translation with visual observations, a trend

extended into videos (Wang et al., 2019b). Regneri

et al. (2013) introduce a foundational dataset align-

ing text descriptions and semantic annotations of

actions with videos. Vision can even inform core

tasks like syntax (Shi et al., 2019) and language

modeling (Ororbia et al., 2019). Careful design is

key, as visually augmented tasks can fail to require

sensory perception (Thomason et al., 2019a).

Language-guided, embodied agents invoke many

of the challenges of WS4. Language-based nav-

igation (Anderson et al., 2018) and task comple-

tion (Shridhar et al., 2020) in simulation environ-

ments ground language to actions, but even com-

plex simulation action spaces can be discretized

and enumerated. Real world, language-guided

robots for task completion (Tellex et al., 2014) and

learning (She et al., 2014) face challenging, con-

tinuous perception and control (Tellex et al., 2020).

Consequently, research in this space is often re-

stricted to small grammars (Paul et al., 2018; Walter

et al., 2013) or controlled dialog responses (Thoma-

son et al., 2020). These efforts to translate language

instructions to actions build towards using language

for end-to-end, continuous control (WS4).

Collaborative games have long served as a

testbed for studying language (Werner and Dyer,

1991) and emergent communication (Schlangen,

2019a; Lazaridou et al., 2018; Chaabouni et al.,

2020). Suhr et al. (2019a) introduced an environ-

ment for evaluating language understanding in the

service of a shared goal, and Andreas and Klein

(2016) use a visual paradigm for studying pragmat-

ics. Such efforts help us examine how inductive

biases and environmental pressures build towards

socialization (WS5), even if full social context is

still too difﬁcult and expensive to be practical.

Most of these works provide resources such as

data, code, simulators and methodology for evaluat-

ing the multimodal content of linguistic representa-

tions (Schlangen, 2019b; Silberer and Lapata, 2014;

Bruni et al., 2012). Moving forward, we encourage

a broad re-examination of how NLP frames the rela-

tionship between meaning and context (Bender and

Koller, 2020) and how pretraining obfuscates our

ability to measure generalization (Linzen, 2020).

7 Conclusions

Our World Scopes are steep steps. WS5 implies a

persistent agent experiencing time and a personal-

ized set of experiences. With few exceptions (Carl-

son et al., 2010), machine learning models have

been conﬁned to IID datasets that lack the structure

in time from which humans draw correlations about

long-range causal dependencies. What if a machine

was allowed to participate consistently? This is dif-

ﬁcult to test under current evaluation paradigms for

generalization. Yet, this is the structure of gener-

alization in human development: drawing analo-

gies to episodic memories and gathering new data

through non-independent experiments.

As with many who have analyzed the history

of NLP, its trends (Church, 2007), its maturation

toward a science (Steedman, 2008), and its major

challenges (Hirschberg and Manning, 2015; Mc-

Clelland et al., 2019), we hope to provide momen-

tum for a direction many are already heading. We

call for and embrace the incremental, but purpose-

ful, contextualization of language in human expe-

rience. With all that we have learned about what

words can tell us and what they keep implicit, now

is the time to ask: What tasks, representations, and

inductive-biases will ﬁll the gaps?

Computer vision and speech recognition are ma-

ture enough for investigation of broader linguistic

contexts (WS3). The robotics industry is rapidly

developing commodity hardware and sophisticated

software that both facilitate new research and ex-

pect to incorporate language technologies (WS4).

Simulators and videogames provide potential envi-

ronments for social language learners (WS5). Our

call to action is to encourage the community to lean

in to trends prioritizing grounding and agency, and

explicitly aim to broaden the corresponding World

Scopes available to our models.

Acknowledgements

Thanks to Raymond Mooney for suggestions, Paul

Smolensky for disagreements, Catriona Silvey for

developmental psychology help, and to a superset

of: Emily Bender, Ryan Cotterel, Jesse Dunietz,

Edward Grefenstette, Dirk Hovy, Casey Kenning-

ton, Ajay Divakaran, David Schlangend, Diyi Yang,

and Semih Yagcioglu for pointers and suggestions.

References

Daniel Adiwardana, Minh-Thang Luong, David R So,

Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang,

Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,

et al. 2020. Towards a human-like open-domain

chatbot. arXiv preprint arXiv:2001.09977.

Peter Anderson, Xiaodong He, Chris Buehler, Damien

Teney, Mark Johnson, Stephen Gould, and Lei

Zhang. 2017. Bottom-up and top-down attention for

image captioning and visual question answering. Vi-

sual Question Answering Challenge at CVPR 2017.

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce,

Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen

Gould, and Anton van den Hengel. 2018. Vision-

and-Language Navigation: Interpreting visually-

grounded navigation instructions in real environ-

ments. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition (CVPR).

Jacob Andreas and Dan Klein. 2016. Reasoning about

pragmatics with neural listeners and speakers. In

Proceedings of the 2016 Conference on Empirical

Methods in Natural Language Processing, pages

1173–1182, Austin, Texas.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-

garet Mitchell, Dhruv Batra, C Lawrence Zitnick,

and Devi Parikh. 2015. Vqa: Visual question an-

swering. In Proceedings of the IEEE international

conference on computer vision, pages 2425–2433.

John Langshaw Austin. 1975. How to do things with

words. Oxford university press.

Philip Bachman, R Devon Hjelm, and William Buch-

walter. 2019. Learning representations by maximiz-

ing mutual information across views. In Advances

in Neural Information Processing Systems 32.

Anton Bakhtin, Laurens van der Maaten, Justin John-

son, Laura Gustafson, and Ross Girshick. 2019.

Phyre: A new benchmark for physical reasoning. In

Advances in Neural Information Processing Systems

32 (NIPS 2019).

Dare A. Baldwin, Ellen M. Markman, Brigitte Bill, Re-

nee N. Desjardins, Jane M. Irwin, and Glynnis Tid-

ball. 1996. Infants’ reliance on a social criterion for

establishing word-object relations. Child Develop-

ment, 67(6):3135–3153.

H.B. Barlow. 1989. Unsupervised learning. Neural

Computation, 1(3):295–311.

Marco Baroni, Silvia Bernardini, Adriano Ferraresi,

and Eros Zanchetta. 2009. The wacky wide web: a

collection of very large linguistically processed web-

crawled corpora. Language resources and evalua-

tion, 43(3):209–226.

Rachel Barr. 2013. Memory constraints on infant learn-

ing from picture books, television, and touchscreens.

Child Development Perspectives, 7(4):205–210.

Lawrence W Barsalou. 2008. Grounded cognition.

Annu. Rev. Psychol., 59:617–645.

Emily M Bender and Alexander Koller. 2020. Climb-

ing towards nlu: On meaning, form, and understand-

ing in the age of data. In Association for Computa-

tional Linguistics (ACL).

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and

Christian Jauvin. 2003. A neural probabilistic lan-

guage model. Journal of Machine Learning Re-

search, 3:1137–1155.

Leon Bergen, Roger Levy, and Noah Goodman. 2016.

Pragmatic reasoning through semantic inference.

Semantics and Pragmatics, 9.

Yonatan Bisk, Kevin Shih, Yejin Choi, and Daniel

Marcu. 2018. Learning Interpretable Spatial Oper-

ations in a Rich 3D Blocks World . In Proceedings

of the Thirty-Second Conference on Artiﬁcial Intelli-

gence (AAAI-18).

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jian-

feng Gao, and Yejin Choi. 2020. PIQA: Reasoning

about physical commonsense in natural language. In

Thirty-Fourth AAAI Conference on Artiﬁcial Intelli-

gence.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.

2003. Latent dirichlet allocation. Journal of Ma-

chine Learning Research, 3:993–1022.

Paul Bloom. 2002. How children learn the meanings

of words. MIT press.

Valts Blukis, Yannick Terme, Eyvind Niklasson,

Ross A. Knepper, and Yoav Artzi. 2019. Learning to

map natural language instructions to physical quad-

copter control using simulated ﬂight. In 3rd Confer-

ence on Robot Learning (CoRL).

Peter F Brown, Peter V deSouza, Robert L Mercer, Vin-

cent J Della Pietra, and Jenifer C Lai. 1992. Class-

based n-gram models of natural language. Compu-

tational Linguistics, 18.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, Sandhini Agarwal, Ariel Herbert-Voss,

Gretchen Krueger, Tom Henighan, Rewon Child,

Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,

Clemens Winter, Christopher Hesse, Mark Chen,

Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin

Chess, Jack Clark, Christopher Berner, Sam Mc-

Candlish, Alec Radford, Ilya Sutskever, and Dario

Amodei. 2020. Language models are few-shot learn-

ers. In preprint.

Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-

Khanh Tran. 2012. Distributional semantics in tech-

nicolor. In Proceedings of the 50th Annual Meet-

ing of the Association for Computational Linguistics

(Volume 1: Long Papers), pages 136–145, Jeju Is-

land, Korea.

Elia Bruni, Nam Khanh Tran, and Marco Baroni. 2014.

Multimodal distributional semantics. Journal of Ar-

tiﬁcial Intelligence Research, 49:1–47.

Alexandre Campeau-Lecours, Hugo Lamontagne, Si-

mon Latour, Philippe Fauteux, Véronique Maheu,

François Boucher, Charles Deguire, and Louis-

Joseph Caron L’Ecuyer. 2019. Kinova modular

robot arms for service robotics applications. In

Rapid Automation: Concepts, Methodologies, Tools,

and Applications, pages 693–719. IGI Global.

Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr

Settles, Estevam R Hruschka, and Tom M Mitchell.

2010. Toward an architecture for never-ending lan-

guage learning. In Twenty-Fourth AAAI Conference

on Artiﬁcial Intelligence.

Rahma Chaabouni, Eugene Kharitonov, Diane Boucha-

court, Emmanuel Dupoux, and Marco Baroni. 2020.

Compositionality and generalization in emergent

languages. In Association for Computational Lin-

guistics (ACL).

Joyce Y. Chai, Rui Fang, Changsong Liu, and Lanbo

She. 2017. Collaborative language grounding to-

ward situated human-robot dialogue. AI Magazine,

37(4):32–45.

Joyce Y. Chai, Qiaozi Gao, Lanbo She, Shaohua Yang,

Sari Saba-Sadiya, and Guangyue Xu. 2018. Lan-

guage to action: Towards interactive task learning

with physical agents. In Proceedings of the Twenty-

Seventh International Joint Conference on Artiﬁcial

Intelligence (IJCAI-18).

Soravit Changpinyo, Wei-Lun Chao, and Fei Sha. 2017.

Predicting visual exemplars of unseen classes for

zero-shot learning. In ICCV.

Wei-Lun Chao, Soravit Changpinyo, Boqing Gong,

and Fei Sha. 2016. An empirical study and analysis

of generalized zero-shot learning for object recog-

nition in the wild. In ECCV, pages 52–68, Cham.

Springer International Publishing.

Eugene Charniak. 1977. A framed painting: The rep-

resentation of a common sense knowledge fragment.

Cognitive Science, 1(4):355–394.

Chun-Yen Chen, Dian Yu, Weiming Wen, Yi Mang

Yang, Jiaping Zhang, Mingyang Zhou, Kevin Jesse,

Austin Chau, Antara Bhowmick, Shreenath Iyer,

et al. 2018. Gunrock: Building a human-like social

bot by leveraging large scale real user data. Alexa

Prize Proceedings.

David L. Chen and Raymond J. Mooney. 2008. Learn-

ing to sportscast: A test of grounded language ac-

quisition. In Proceedings of the 25th International

Conference on Machine Learning (ICML), Helsinki,

Finland.

SF Chen and Joshua Goodman. 1996. An empirical

study of smoothing techniques for language model-

ing. In Association for Computational Linguistics,

pages 310–318.

Maxime Chevalier-Boisvert, Dzmitry Bahdanau,

Salem Lahlou, Lucas Willems, Chitwan Saharia,

Thien Huu Nguyen, and Yoshua Bengio. 2019.

Babyai: First steps towards grounded language

learning with a human in the loop. In ICLR’2019.

Noam Chomsky. 1965. Aspects of the Theory of Syntax.

MIT Press.

Noam Chomsky. 1980. Language and learning: the de-

bate between Jean Piaget and Noam Chomsky. Har-

vard University Press.

Noam Chomsky. 1981. Lectures on Government and

Binding. Mouton de Gruyter.

Kenneth Church. 2007. A pendulum swung too far.

Linguistic Issues in Language Technology – LiLT, 2.

L. Coleman and P. Kay. 1981. The english word “lie".

Linguistics, 57.

Ronan Collobert and Jason Weston. 2008. A uniﬁed

architecture for natural language processing: deep

neural networks with multitask learning. In ICML.

Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben

Kybartas, Tavian Barnes, Emery Fine, James Moore,

Ruo Yu Tao, Matthew Hausknecht, Layla El Asri,

Mahmoud Adada, Wendy Tay, and Adam Trischler.

2018. Textworld: A learning environment for text-

based games. ArXiv, abs/1806.11532.

Erwin Coumans and Yunfei Bai. 2016–2019. Pybullet,

a python module for physics simulation for games,

robotics and machine learning. http://pybullet.

org.

Abhishek Das, Samyak Datta, Georgia Gkioxari, Ste-

fan Lee, Devi Parikh, and Dhruv Batra. 2018. Em-

bodied question answering. In Proceedings of the

IEEE Conference on Computer Vision and Pattern

Recognition Workshops, pages 2054–2063.

Scott Deerwester, Susan T. Dumais, George W. Furnas,

Thomas K. Landauer, and Richard Harshman. 1988.

Improving information retrieval with latent semantic

indexing. In Proceedings of the 51st Annual Meet-

ing of the American Society for Information Science

25, pages 36 – 40.

Scott Deerwester, Susan T. Dumais, George W. Fur-

nas, Thomas K. Landauer, and Richard Harshman.

1990. Indexing by latent semantic analysis. Jour-

nal of the American Society for Information Science,

41(6):391–407.

Gerald Dejong. 1981. Generalizations based on expla-

nations. In Proceedings of the 7th international joint

conference on Artiﬁcial intelligence (IJCAI).

David DeVault, Iris Oved, and Matthew Stone. 2006.

Societal grounding is essential to meaningful lan-

guage use. In Proceedings of the National Confer-

ence on Artiﬁcial Intelligence, volume 21, page 747.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2019. BERT: Pre-training of

Deep Bidirectional Transformers for Language Un-

derstanding. In North American Chapter of the As-

sociation for Computational Linguistics (NAACL).

Emily Dinan, Samuel Humeau, Bharath Chintagunta,

and Jason Weston. 2019. Build it break it ﬁx it for

dialogue safety: Robustness from adversarial human

attack. In Proceedings of the 2019 Conference on

Empirical Methods in Natural Language Processing

and the 9th International Joint Conference on Natu-

ral Language Processing (EMNLP-IJCNLP), pages

4529–4538.

Susan T. Dumais. 2004. Latent semantic analysis. An-

nual Review of Information Science and Technology,

38(1):188–230.

Robin IM Dunbar. 1993. Coevolution of neocortical

size, group size and language in humans. Behav-

ioral and brain sciences, 16(4):681–694.

Yanai Elazar, Abhijit Mahabal, Deepak Ramachandran,

Tania Bedrax-Weiss, and Dan Roth. 2019. How

large are lions? inducing distributions over quanti-

tative attributes. In Proceedings of the 57th Annual

Meeting of the Association for Computational Lin-

guistics, pages 3973–3983.

Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu-

cia Specia. 2016. Multi30k: Multilingual english-

german image descriptions. In Workshop on Vision

and Langauge at ACL ’16.

J Elman. 1990. Finding structure in time. Cognitive

Science, 14(2):179–211.

Katrin Erk and Sebastian Padó. 2008. A structured

vector space model for word meaning in context.

In Proceedings of the 2008 Conference on Empiri-

cal Methods in Natural Language Processing, pages

897–906, Honolulu, Hawaii.

Susan Ervin-Tripp. 1973. Some strategies for the ﬁrst

two years. In Timothy E. Moore, editor, Cognitive

Development and Acquisition of Language, pages

261 – 286. Academic Press, San Diego.

Ali Farhadi, M Hejrati, M Sadeghi, Peter Young, Cyrus

Rashtchian, Julia Hockenmaier, and David Forsyth.

2010. Every picture tells a story: Generating sen-

tences from images. In European Conference on

Computer Vision. Springer.

Yansong Feng and Mirella Lapata. 2010. Topic models

for image annotation and text illustration. In Human

Language Technologies: The 2010 Annual Confer-

ence of the North American Chapter of the Associa-

tion for Computational Linguistics, pages 831–839,

Los Angeles, California.

J. R. Firth. 1957. A synopsis of linguistic theory, 1930-

1955. Studies in Linguistic Analysis.

Cliff Fitzgerald. 2013. Developing baxter. In 2013

IEEE Conference on Technologies for Practical

Robot Applications (TePRA).

W. Nelson Francis. 1964. A standard sample of

present-day english for use with digital computers.

Report to the U.S Ofﬁce of Education on Coopera-

tive Research Project No. E-007.

Michael C Frank and Noah D Goodman. 2012. Pre-

dicting pragmatic reasoning in language games. Sci-

ence, 336(6084):998–998.

Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng

Wang, and Ting Liu. 2014. Learning semantic hier-

archies via word embeddings. In Proceedings of the

52nd Annual Meeting of the Association for Compu-

tational Linguistics (Volume 1: Long Papers), pages

1199–1209.

Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan

Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi,

Dheeru Dua, Yanai Elazar, Ananth Gottumukkala,

Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco,

Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel-

son F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer

Singh, Noah A. Smith, Sanjay Subramanian, Reut

Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou.

2020. Evaluating NLP Models via Contrast Sets.

arXiv:2004.02709.

Max Glockner, Vered Shwartz, and Yoav Goldberg.

2018. Breaking nli systems with sentences that re-

quire simple lexical inferences. In Proceedings of

the 56th Annual Meeting of the Association for Com-

putational Linguistics (Volume 2: Short Papers),

pages 650–655.

I J Good. 1953. The population frequencies of

species and the estimation of population parameters.

Biometrika, 40:237–264.

Herbert P Grice. 1975. Logic and conversation. In

Speech acts, pages 41–58. Brill.

Suchin Gururangan, Swabha Swayamdipta, Omer

Levy, Roy Schwartz, Samuel Bowman, and Noah A

Smith. 2018. Annotation artifacts in natural lan-

guage inference data. In Proceedings of the 2018

Conference of the North American Chapter of the

Association for Computational Linguistics: Human

Language Technologies, Volume 2 (Short Papers),

pages 107–112.

Stevan Harnad. 1990. The symbol grounding problem.

Physica D, 42:335–346.

Zellig S Harris. 1954. Distributional structure. Word,

10:146–162.

David Harwath, Wei-Ning Hsu, and James Glass. 2020.

Learning hierarchical discrete linguistic units from

visually-grounded speech. In ICLR 2020.

David Harwath, Adrià Recasens, Dídac Surís, Galen

Chuang, Antonio Torralba, and James Glass. 2019.

Jointly discovering visual objects and spoken words

from raw sensory input. International Journal of

Computer Vision.

He He, Derek Chen, Anusha Balakrishnan, and Percy

Liang. 2018. Decoupling strategy and generation in

negotiation dialogues. In Proceedings of the 2018

Conference on Empirical Methods in Natural Lan-

guage Processing, pages 2333–2343.

Susan J. Hespos and Elizabeth S. Spelke. 2004. Con-

ceptual precursors to language. Nature, 430.

G. E. Hinton, J. L. McClelland, and D. E. Rumelhart.

1986. Distributed representations. Parallel Dis-

tributed Processing: Explorations in the Microstruc-

ture of Cognition, Volume 1: Foundations.

Geoffrey E. Hinton. 1990. Preface to the special issue

on connectionist symbol processing. Artiﬁcial Intel-

ligence, 46(1):1 – 4.

Julia Hirschberg and Christopher D Manning. 2015.

Advances in natural language processing. Science,

349(6245):261–266.

Douglas Hofstadter and Emmanuel Sander. 2013. Sur-

faces and essences: Analogy as the fuel and ﬁre of

thinking. Basic Books.

Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin

Choi. 2020. The curious case of neural text degener-

ation. In ICLR 2020.

Po-Yao Huang, Junjie Hu, Xiaojun Chang, and Alexan-

der Hauptmann. 2020. Unsupervised multimodal

neural machine translation with pseudo visual piv-

oting. In Proceedings of the 58th Annual Meet-

ing of the Association for Computational Linguistics,

pages 8226–8237, Online.

Gabriel Ilharco, Yuan Zhang, and Jason Baldridge.

2019. Large-scale representation learning from visu-

ally grounded untranscribed speech. In Proceedings

of the 23rd Conference on Computational Natural

Language Learning (CoNLL), pages 55–65, Hong

Kong, China.

Daniel L. James and Risto Miikkulainen. 1995. Sard-

net: A self-organizing feature map for sequences. In

Advances in Neural Information Processing Systems

7 (NIPS’94), pages 577–584, Denver, CO. Cam-

bridge, MA: MIT Press.

Michael I Jordan. 2019. Artiﬁcial intelligence – the rev-

olution hasn’t happened yet. Harvard Data Science

Review.

Divyansh Kaushik, Eduard Hovy, and Zachary Lipton.

2020. Learning the difference that makes a differ-

ence with counterfactually-augmented data. In Inter-

national Conference on Learning Representations.

Nitish Shirish Keskar, Bryan McCann, Lav R

Varshney, Caiming Xiong, and Richard Socher.

2019. CTRL: A conditional transformer language

model for controllable generation. arXiv preprint

arXiv:1909.05858.

Reinhard Kneser and Hermann Ney. 1995. Improved

backing-off for m-gram language modeling. In Pro-

ceedings of the IEEE International Conference on

Acoustics, Speech and Signal Processing.

Teuvo Kohonen. 1984. Self-Organization and Associa-

tive Memory. Springer.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-

son, Kenji Hata, Joshua Kravitz, Stephanie Chen,

Yannis Kalantidis, Li-Jia Li, David A Shamma,

Michael S. Bernstein, and Fei-Fei Li. 2017. Vi-

sual genome: Connecting language and vision us-

ing crowdsourced dense image annotations. Interna-

tional Journal of Computer Vision, 123(1):32–73.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-

ton. 2012. Imagenet classiﬁcation with deep con-

volutional neural networks. In F. Pereira, C. J. C.

Burges, L. Bottou, and K. Q. Weinberger, editors,

Advances in Neural Information Processing Systems

25, pages 1097–1105. Curran Associates, Inc.

George Lakoff. 1973. Hedges: A study in meaning

criteria and the logic of fuzzy concepts. Journal of

Philosophical Logic, 2:458–508.

George Lakoff. 1980. Metaphors We Live By. Univer-

sity of Chicago Press.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,

Kevin Gimpel, Piyush Sharma, and Radu Soricut.

2020. Albert: A lite bert for self-supervised learning

of language representations. In International Con-

ference on Learning Representations.

Ronald W Langacker. 1987. Foundations of cogni-

tive grammar: Theoretical prerequisites, volume 1.

Stanford university press.

Ronald W Langacker. 1991. Foundations of Cognitive

Grammar: descriptive application., volume 2. Stan-

ford university press.

Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls,

and Stephen Clark. 2018. Emergence of linguis-

tic communication from referential games with sym-

bolic and pixel input. In Internationl Conference on

Learning Representations.

Angeliki Lazaridou, Alexander Peysakhovich, and

Marco Baroni. 2017. Multi-agent cooperation and

the emergence of (natural) language. In ICLR 2017.

Angeliki Lazaridou, Nghia The Pham, and Marco Ba-

roni. 2016. The red one!: On learning to refer to

things based on discriminative properties. In Pro-

ceedings of the 54th Annual Meeting of the Associa-

tion for Computational Linguistics (Volume 2: Short

Papers), pages 213–218, Berlin, Germany.

Matthew Le, Y-Lan Boureau, and Maximilian Nickel.

2019. Revisiting the evaluation of theory of mind

through question answering. In Proceedings of the

2019 Conference on Empirical Methods in Natu-

ral Language Processing and the 9th International

Joint Conference on Natural Language Processing

(EMNLP-IJCNLP), pages 5871–5876, Hong Kong,

China.

Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh,

and Dhruv Batra. 2017. Deal or no deal? end-to-end

learning of negotiation dialogues. In Proceedings of

the 2017 Conference on Empirical Methods in Natu-

ral Language Processing, pages 2443–2453.

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui

Hsieh, and Kai-Wei Chang. 2019a. VisualBERT: A

Simple and Performant Baseline for Vision and Lan-

guage. In Work in Progress.

Yong-Lu Li, Liang Xu, Xinpeng Liu, Xijie Huang, Yue

Xu, Mingyang Chen, Ze Ma, Shiyi Wang, Hao-Shu

Fang, and Cewu Lu. 2019b. HAKE: Human Activ-

ity Knowledge Engine. arXiv:1904.06539.

Ling-Yi Lin, Rong-Ju Cherng, and Yung-Jung Chen.

2017. Effect of touch screen tablet use on ﬁne motor

development of young children. Physical & Occupa-

tional Therapy In Pediatrics, 37(5):457–467. PMID:

28071977.

Tal Linzen. 2020. How can we accelerate progress to-

wards human-like linguistic generalization? In As-

sociation for Computational Linguistics (ACL).

Changsong Liu and Joyce Yue Chai. 2015. Learning

to mediate perceptual differences in situated human-

robot dialogue. In Proceedings of the 29th AAAI

Conference on Artiﬁcial Intelligence, pages 2288–

2294.

Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose-

worthy, Laurent Charlin, and Joelle Pineau. 2016.

How not to evaluate your dialogue system: An em-

pirical study of unsupervised evaluation metrics for

dialogue response generation. In Proceedings of the

2016 Conference on Empirical Methods in Natural

Language Processing, pages 2122–2132.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou

Tang. 2015. Deep learning face attributes in the wild.

In Proceedings of International Conference on Com-

puter Vision (ICCV).

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan

Lee. 2019. Vilbert: Pretraining task-agnostic visi-

olinguistic representations for vision-and-language

tasks. In Advances in Neural Information Process-

ing Systems, pages 13–23.

Li Lucy and Jon Gauthier. 2017. Are distributional

representations ready for the real world? evaluat-

ing word vectors for grounded perceptual meaning.

In Proceedings of the First Workshop on Language

Grounding for Robotics, pages 76–85, Vancouver,

Canada. Association for Computational Linguistics.

Matt MacMahon, Brian Stankiewicz, and Benjamin

Kuipers. 2006. Walk the talk: Connecting language,

knowledge, and action in route instructions. In Pro-

ceedings of the 21st National Conference on Artiﬁ-

cial Intelligence (AAAI-2006), Boston, MA, USA.

Gary Marcus and Ernest Davis. 2019. Rebooting AI:

Building Artiﬁcial Intelligence We Can Trust. Pan-

theon.

Mitchell P Marcus, Beatrice Santorini, and Mary Ann

Marcinkiewicz. 1993. Building a large annotated

corpus of english: The penn treebank. Computa-

tional Linguistics, 19:313–330.

Lara J Martin, Prithviraj Ammanabrolu, Xinyu Wang,

William Hancock, Shruti Singh, Brent Harrison, and

Mark O Riedl. 2018. Event representations for au-

tomated story generation with deep neural nets. In

Thirty-Second AAAI Conference on Artiﬁcial Intelli-

gence.

Cynthia Matuszek. 2018. Grounded language learn-

ing: Where robotics and nlp meet (early career spot-

light). In Proceedings of the 27th International Joint

Conference on Artiﬁcial Intelligence (IJCAI), Stock-

holm, Sweden.

Bryan McCann, James Bradbury, Caiming Xiong, and

Richard Socher. 2017. Learned in translation: Con-

textualized word vectors. In Advances in Neural In-

formation Processing Systems, pages 6297–6308.

James L. McClelland, Felix Hill, Maja Rudolph, Ja-

son Baldridge, and Hinrich Schütze. 2019. Ex-

tending Machine Language Models toward Human-

Level Language Understanding. arXiv:1912.05877.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-

rado, and Jeffrey Dean. 2013. Distributed represen-

tations of words and phrases and their composition-

ality. Advances in Neural Information Processing

Systems, 26.

Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Ya-

maguchi, Karl Stratos, Xufeng Han, Alyssa Men-

sch, Alexander C. Berg, Tamara L. Berg, and Hal

Daumé III. 2012. Midge: Generating image descrip-

tions from computer vision detections. In European

Chapter of the Association for Computational Lin-

guistics (EACL).

Tom M Mitchell. 1980. The need for biases in learning

generalizations. Department of Computer Science,

Laboratory for Computer Science Research.

Raymond J. Mooney. 2008. Learning to connect lan-

guage and perception. In Proceedings of the 23rd

AAAI Conference on Artiﬁcial Intelligence (AAAI),

pages 1598–1601, Chicago, IL. Senior Member Pa-

per.

Raymond J Mooney and Gerald Dejong. 1985. Learn-

ing schemata for natural language processing. In

Proceedings of the Ninth International Joint Confer-

ence on Artiﬁcial Intelligence (IJCAI-85).

Daniele Moro and Casey Kennington. 2018. Multi-

modal visual and simulated muscle activations for

grounded semantics of hand-related descriptions. In

Workshop on the Semantics and Pragmatics of Dia-

logue. SEMDIAL.

Roozbeh Mottaghi, Mohammad Rastegari, Abhinav

Gupta, and Ali Farhadi. 2016. “what happens if...”

learning to predict the effect of forces in images.

In Computer Vision – ECCV 2016, pages 269–285,

Cham. Springer International Publishing.

Adithyavairavan Murali, Tao Chen, Kalyan Vasudev

Alwala, Dhiraj Gandhi, Lerrel Pinto, Saurabh Gupta,

and Abhinav Gupta. 2019. Pyrobot: An open-source

robotics framework for research and benchmarking.

arXiv preprint arXiv:1906.08236.

Aida Nematzadeh, Kaylee Burns, Erin Grant, Alison

Gopnik, and Tom Grifﬁths. 2018. Evaluating theory

of mind in question answering. In Proceedings of

the 2018 Conference on Empirical Methods in Natu-

ral Language Processing, pages 2392–2400.

NVIDIA. 2019. NVIDIA Isaac software develop-

ment kit. https://developer.nvidia.com/

isaac-sdk. Accessed 2019-12-09.

Elinor Ochs. 1993. Constructing social identity: A lan-

guage socialization perspective. Research on lan-

guage and social interaction, 26(3):287–306.

William O’Grady. 2005. How Children Learn Lan-

guage. Cambridge University Press.

Veronica Ornaghi, Jens Brockmeier, and Ilaria Graz-

zani Gavazzi. 2011. The role of language games in

children’s understanding of mental states: A train-

ing study. Journal of cognition and development,

12(2):239–259.

Alexander Ororbia, Ankur Mali, Matthew Kelly, and

David Reitter. 2019. Like a baby: Visually situated

neural language acquisition. In Proceedings of the

57th Annual Meeting of the Association for Com-

putational Linguistics, pages 5127–5136, Florence,

Italy.

Denis Paperno, Germán Kruszewski, Angeliki Lazari-

dou, Ngoc-Quan Pham, Raffaella Bernardi, San-

dro Pezzelle, Marco Baroni, Gemma Boleda, and

Raquel Fernández. 2016. The LAMBADA dataset:

Word prediction requiring a broad discourse context.

In Proceedings of the 54th Annual Meeting of the

Association for Computational Linguistics (Volume

1: Long Papers), pages 1525–1534.

Rohan Paul, Jacob Arkin, Derya Aksaray, Nicholas

Roy, and Thomas M Howard. 2018. Efﬁcient

grounding of abstract spatial concepts for nat-

ural language interaction with robot platforms.

The International Journal of Robotics Research,

37(10):1269–1299.

Haoruo Peng, Daniel Khashabi, and Dan Roth. 2015.

Solving hard coreference problems. In Proceedings

of the 2015 Conference of the North American Chap-

ter of the Association for Computational Linguistics:

Human Language Technologies, pages 809–819.

Jeffrey Pennington, Richard Socher, and Christopher

Manning. 2014. Glove: Global vectors for word rep-

resentation. In Proceedings of the 2014 Conference

on Empirical Methods in Natural Language Process-

ing (EMNLP), pages 1532–1543, Doha, Qatar.

Don Perlis. 2016. Five dimensions of reasoning in the

wild. In Association for the Advancement of Artiﬁ-

cial Intelligence (AAAI).

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt

Gardner, Christopher Clark, Kenton Lee, and Luke

Zettlemoyer. 2018. Deep contextualized word repre-

sentations. In North American Chapter of the Asso-

ciation for Computational Linguistics (NAACL).

John R Pierce. 1969. Whither speech recognition?

The journal of the acoustical society of america,

46(4B):1049–1051.

Nicolas Pinto, David Doukhan, James J DiCarlo, and

David D Cox. 2009. A high-throughput screening

approach to discovering good forms of biologically

inspired visual representation. PLoS computational

biology, 5(11):e1000579.

Jordan B. Pollack. 1987. On Connectionist Models of

Natural Language Processing. Ph.D. thesis, Univer-

sity of Illinois.

Alec Radford, Jeff Wu, Rewon Child, David Luan,

Dario Amodei, and Ilya Sutskever. 2019. Language

models are unsupervised multitask learners.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.

Know what you don’t know: Unanswerable ques-

tions for SQuAD. In Proceedings of the 56th Annual

Meeting of the Association for Computational Lin-

guistics (Volume 2: Short Papers), pages 784–789.

Hannah Rashkin, Eric Michael Smith, Margaret Li, and

Y-Lan Boureau. 2019. Towards empathetic open-

domain conversation models: A new benchmark and

dataset. In Proceedings of the 57th Annual Meet-

ing of the Association for Computational Linguistics,

pages 5370–5381, Florence, Italy.

Michaela Regneri, Marcus Rohrbach, Dominikus Wet-

zel, Stefan Thater, Bernt Schiele, and Manfred

Pinkal. 2013. Grounding action descriptions in

videos. Transactions of the Association for Compu-

tational Linguistics (TACL), 1:25–36.

Stephen Roller, Emily Dinan, Naman Goyal, Da Ju,

Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott,

Kurt Shuster, Eric M. Smith, Y-Lan Boureau, and

Jason Weston. 2020. Recipes for building an open-

domain chatbot. In arXiv.

Stephanie Rosenthal, Joydeep Biswas, and Manuela

Veloso. 2010. An effective personal mobile robot

agent through symbiotic human-robot interaction.

In Proceedings of the 9th International Conference

on Autonomous Agents and Multiagent Systems: vol-

ume 1-Volume 1, pages 915–922. International Foun-

dation for Autonomous Agents and Multiagent Sys-

tems.

Corby Rosset. 2020. Turing-NLG: A 17-billion-

parameter language model by Microsoft.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,

Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-

drej Karpathy, Aditya Khosla, Michael Bernstein,

Alexander C. Berg, and Li Fei-Fei. 2015. Ima-

geNet Large Scale Visual Recognition Challenge.

International Journal of Computer Vision (IJCV),

115(3):211–252.

Jacqueline Sachs, Barbara Bard, and Marie L Johnson.

1981. Language learning with restricted input: Case

studies of two hearing children of deaf parents. Ap-

plied Psycholinguistics, 2(1):33–54.

Ian Sample and Alex Hern. 2014. Scientists dispute

whether computer ‘eugene goostman‘ passed turing

test. The Guardian, 9.

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan

Le Bras, and Yejin Choi. 2019. Social IQa: Com-

monsense reasoning about social interactions. In

Proceedings of the 2019 Conference on Empirical

Methods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-

guage Processing (EMNLP-IJCNLP), pages 4462–

4472, Hong Kong, China.

Rosario Scalise, Jesse Thomason, Yonatan Bisk, and

Siddhartha Srinivasa. 2019. Improving robot suc-

cess detection using static object data. In Proceed-

ings of the 2019 IEEE/RSJ International Conference

on Intelligent Robots and Systems.

Roger C. Schank and Robert P. Abelson. 1977. Scripts,

Plans, Goals and Understanding: an Inquiry into

Human Knowledge Structures. L. Erlbaum, Hills-

dale, NJ.

Marten van Schijndel, Aaron Mueller, and Tal

Linzen. 2019. Quantity doesn’t buy quality syn-

tax with neural language models. arXiv preprint

arXiv:1909.00111.

David Schlangen. 2019a. Grounded agreement games:

Emphasizing conversational grounding in visual dia-

logue settings. arXiv.

David Schlangen. 2019b. Language tasks and language

games: On methodology in current natural language

processing research. arXiv.

Iulian V. Serban, Chinnadhurai Sankar, Mathieu Ger-

main, Saizheng Zhang, Zhouhan Lin, Sandeep Sub-

ramanian, Taesup Kim, Michael Pieper, Sarath

Chandar, Nan Rosemary Ke, Sai Rajeshwar, Alexan-

dre de Brebisson, Jose M. R. Sotelo, Dendi

Suhubdy, Vincent Michalski, Alexandre Nguyen,

Joelle Pineau, and Yoshua Bengio. 2017. A deep

reinforcement learning chatbot. arXiv preprint

arXiv:1709.02349.

Piyush Sharma, Nan Ding, Sebastian Goodman, and

Radu Soricut. 2018. Conceptual captions: A

cleaned, hypernymed, image alt-text dataset for au-

tomatic image captioning. In Proceedings of the

56th Annual Meeting of the Association for Compu-

tational Linguistics (Volume 1: Long Papers), pages

2556–2565, Melbourne, Australia.

Lanbo She, Shaohua Yang, Yu Cheng, Yunyi Jia,

Joyce Y. Chai, and Ning Xi. 2014. Back to the

blocks world: Learning new actions through situated

human-robot dialogue. In Proceedings of 15th SIG-

DIAL Meeting on Discourse and Dialogue.

Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen

Livescu. 2019. Visually grounded neural syntax ac-

quisition. In Proceedings of the 57th Annual Meet-

ing of the Association for Computational Linguistics,

pages 1842–1861, Florence, Italy.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri,

Patrick LeGresley, Jared Casper, and Bryan Catan-

zaro. 2019. Megatron-lm: Training multi-billion

parameter language models using gpu model paral-

lelism. arXiv preprint arXiv:1909.08053.

Mohit Shridhar, Jesse Thomason, Daniel Gordon,

Yonatan Bisk, Winson Han, Roozbeh Mottaghi,

Luke Zettlemoyer, and Dieter Fox. 2020. ALFRED:

A benchmark for interpreting grounded instructions

for everyday tasks. Computer Vision and Pattern

Recognition (CVPR).

Carina Silberer and Mirella Lapata. 2014. Learn-

ing grounded meaning representations with autoen-

coders. In Proceedings of the 52nd Annual Meet-

ing of the Association for Computational Linguis-

tics (Volume 1: Long Papers), pages 721–732, Balti-

more, Maryland.

Jivko Sinapov, Connor Schenck, and Alexander

Stoytchev. 2014. Learning relational object cate-

gories using behavioral exploration and multimodal

perception. In IEEE International Conference on

Robotics and Automation.

Linda Smith and Michael Gasser. 2005. The develop-

ment of embodied cognition: Six lessons from ba-

bies. Artiﬁcial life, 11(1-2):13–29.

Paul Smolensky. 1990. Tensor product variable bind-

ing and the representation of symbolic structures

in connectionist systems. Artiﬁcial Intelligence,

46:159–216.

Richard Socher, Brody Huval, Christopher Manning,

and Andrew Ng. 2012. Semantic compositional-

ity through recursive matrix-vector spaces. In Em-

pirical Methods in Natural Language Processing

(EMNLP).

T. Srinivasan, R. Sanabria, and F. Metze. 2020. Look-

ing enhances listening: Recovering missing speech

using images. In ICASSP 2020 - 2020 IEEE Interna-

tional Conference on Acoustics, Speech and Signal

Processing (ICASSP), pages 6304–6308.

Mark Steedman. 2008. Last words: On becoming a dis-

cipline. Computational Linguistics, 34(1):137–144.

Greg J Stephens, Lauren J Silbert, and Uri Hasson.

2010. Speaker–listener neural coupling underlies

successful communication. Proceedings of the Na-

tional Academy of Sciences, 107(32):14425–14430.

Alane Suhr, Claudia Yan, Jack Schluger, Stanley Yu,

Hadi Khader, Marwa Mouallem, Iris Zhang, and

Yoav Artzi. 2019a. Executing instructions in situ-

ated collaborative interactions. In Proceedings of

the 2019 Conference on Empirical Methods in Nat-

ural Language Processing and the 9th International

Joint Conference on Natural Language Processing

(EMNLP-IJCNLP), pages 2119–2130, Hong Kong,

China.

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang,

Huajun Bai, and Yoav Artzi. 2019b. A corpus for

reasoning about natural language grounded in pho-

tographs. In Proceedings of the 57th Annual Meet-

ing of the Association for Computational Linguistics,

pages 6418–6428, Florence, Italy.

Chen Sun, Fabien Baradel, Kevin Murphy, and

Cordelia Schmid. 2019a. Contrastive bidirectional

transformer for temporal representation learning.

arxiv:1906.05743.

Chen Sun, Austin Myers, Carl Vondrick, Kevin Mur-

phy, and Cordelia Schmid. 2019b. VideoBERT: A

Joint Model for Video and Language Representation

Learning. In International Conference on Computer

vision.

Yee-Whye Teh. 2006. A hierarchical bayesian lan-

guage model based on pitman-yor processes. In

Proceedings of the 21st International Conference on

Computational Linguistics and 44th Annual Meet-

ing of the Association for Computational Linguistics,

pages 985–992, Sydney, Australia.

Stefanie Tellex, Nakul Gopalan, Hadas Kress-Gazit,

and Cynthia Matuszek. 2020. Robots that use lan-

guage. The Annual Review of Control, Robotics, and

Autonomous Systems, 15.

Stefanie Tellex, Ross Knepper, Adrian Li, Daniela Rus,

and Nicholas Roy. 2014. Asking for help using in-

verse semantics. In Proceedings of Robotics: Sci-

ence and Systems (RSS), Berkeley, California.

Stefanie Tellex, Thomas Kollar, Steven Dickerson,

Matthew R Walter, Ashis Gopal Banerjee, Seth

Teller, and Nicholas Roy. 2011. Understanding nat-

ural language commands for robotic navigation and

mobile manipulation. In Proceedings of the Na-

tional Conference on Artiﬁcial Intelligence.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.

BERT rediscovers the classical NLP pipeline. In

Proceedings of the 57th Annual Meeting of the Asso-

ciation for Computational Linguistics, pages 4593–

4601, Florence, Italy.

Esther Thelen and Linda B. Smith. 1996. A Dynamic

Systems Approach to the Development of Cognition

and Action. MIT Press.

Jesse Thomason, Daniel Gordon, and Yonatan Bisk.

2019a. Shifting the baseline: Single modality perfor-

mance on visual navigation & QA. In North Amer-

ican Chapter of the Association for Computational

Linguistics (NAACL).

Jesse Thomason, Michael Murray, Maya Cakmak, and

Luke Zettlemoyer. 2019b. Vision-and-dialog navi-

gation. In Conference on Robot Learning (CoRL).

Jesse Thomason, Aishwarya Padmakumar, Jivko

Sinapov, Justin Hart, Peter Stone, and Raymond J.

Mooney. 2017. Opportunistic active learning for

grounding natural language descriptions. In Pro-

ceedings of the 1st Annual Conference on Robot

Learning (CoRL).

Jesse Thomason, Aishwarya Padmakumar, Jivko

Sinapov, Nick Walker, Yuqian Jiang, Harel Yedid-

sion, Justin Hart, Peter Stone, and Raymond J.

Mooney. 2020. Jointly improving parsing and per-

ception for natural language commands through

human-robot dialog. The Journal of Artiﬁcial Intel-

ligence Research (JAIR), 67.

Jesse Thomason, Jivko Sinapov, Maxwell Svetlik, Pe-

ter Stone, and Raymond J. Mooney. 2016. Learning

multi-modal grounded linguistic semantics by play-

ing “I spy”. In International Joint Conference on

Artiﬁcial Intelligence (IJCAI).

Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012.

Mujoco: A physics engine for model-based con-

trol. In 2012 IEEE/RSJ International Conference

on Intelligent Robots and Systems, pages 5026–5033.

IEEE.

Michael Tomasello. 2009. Constructing a language.

Harvard university press.

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang,

J. Zico Kolter, Louis-Philippe Morency, and Rus-

lan Salakhutdinov. 2019. Multimodal transformer

for unaligned multimodal language sequences. In

Proceedings of the 57th Annual Meeting of the Asso-

ciation for Computational Linguistics, pages 6558–

6569, Florence, Italy.

Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio.

2010. Word representations: A simple and general

method for semi-supervised learning. In Proceed-

ings of the 48th Annual Meeting of the Association

for Computational Linguistics, pages 384–394.

Alan M Turing. 1950. Computing machinery and intel-

ligence. Mind.

Peter D Turney and Patrick Pantel. 2010. From fre-

quency to meaning: Vector space models of se-

mantics. Journal of artiﬁcial intelligence research,

37:141–188.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob

Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz

Kaiser, and Illia Polosukhin. 2017. Attention is all

you need. In Advances in neural information pro-

cessing systems, pages 5998–6008.

Gabriella Vigliocco, Pamela Perniss, and David Vinson.

2014. Language as a multimodal phenomenon: im-

plications for language learning, processing and evo-

lution.

Harm de Vries, Dzmitry Bahdanau, and Christopher

Manning. 2020. Towards ecologically valid re-

search on language user interfaces. In arXiv.

Matthew Walter, Sachithra Hemachandra, Bianca

Homberg, Stefanie Tellex, and Seth Teller. 2013.

Learning semantic maps from natural language de-

scriptions. In Proceedings of Robotics: Science and

Systems (RSS), Berlin, Germany.

Alex Wang, Amanpreet Singh, Julian Michael, Felix

Hill, Omer Levy, and Samuel R Bowman. 2019a.

GLUE: A multi-task benchmark and analysis plat-

form for natural language understanding. In Inter-

national Conference on Learning Representations.

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-

Fang Wang, and William Yang Wang. 2019b. Vatex:

A large-scale, high-quality multilingual dataset for

video-and-language research. In The IEEE Interna-

tional Conference on Computer Vision (ICCV).

Ronald Wardhaugh. 2011. An introduction to sociolin-

guistics, volume 28. John Wiley & Sons.

Joseph Weizenbaum. 1966. Eliza – a computer pro-

gram for the study of natural language communica-

tion between man and machine. Communications of

the ACM, 9(1):36–45.

Lloyd R Welch. 2003. Hidden markov models and the

baum-welch algorithm. IEEE Information Theory

Society Newsletter, 53(4):1–24.

Gregory M Werner and Michael G Dyer. 1991. Evolu-

tion of communication in artiﬁcial organisms. ALife.

Terry Winograd. 1971. Procedures as a representation

for data in a computer program for understanding

natural language. Technical report, Massachusetts

Institute of Technology, Project MAC.

Ludwig Wittgenstein. 1953. Philosophical Investiga-

tions. Macmillan.

Ludwig Wittgenstein. 1958. The blue and brown books.

Basil Blackwell.

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia,

Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao

Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang,

Leonidas J. Guibas, and Hao Su. 2020. SAPIEN:

A simulated part-based interactive environment. In

Computer Vision and Pattern Recognition (CVPR).

Semih Yagcioglu, Aykut Erdem, Erkut Erdem, and Na-

zli Ikizler-Cinbis. 2018. RecipeQA: A challenge

dataset for multimodal comprehension of cooking

recipes. In Proceedings of the 2018 Conference on

Empirical Methods in Natural Language Processing,

pages 1358–1368, Brussels, Belgium.

Diyi Yang, Jiaao Chen, Zichao Yang, Dan Jurafsky, and

Eduard Hovy. 2019a. Let’s make your request more

persuasive: Modeling persuasive strategies via semi-

supervised neural nets on crowdfunding platforms.

In Proceedings of the 2019 Conference of the North

American Chapter of the Association for Computa-

tional Linguistics: Human Language Technologies,

Volume 1 (Long and Short Papers).

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-

bonell, Ruslan Salakhutdinov, and Quoc V Le.

2019b. Xlnet: Generalized autoregressive pretrain-

ing for language understanding. In Advances in Neu-

ral Information Processing Systems 32 (NIPS 2019).

Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi.

2016. Situation recognition: Visual semantic role

labeling for image understanding. In Conference on

Computer Vision and Pattern Recognition.

Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund

Tong, and Louis-Philippe Morency. 2019. Social-iq:

A question answering benchmark for artiﬁcial social

intelligence. In The IEEE Conference on Computer

Vision and Pattern Recognition (CVPR).

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin

Choi. 2019a. From recognition to cognition: Vi-

sual commonsense reasoning. In The IEEE Confer-

ence on Computer Vision and Pattern Recognition

(CVPR).

Rowan Zellers, Ari Holtzman, Elizabeth Clark, Lianhui

Qin, Ali Farhadi, and Yejin Choi. 2020. Evaluating

machines by their real-world language use. arXiv

preprint arXiv:2004.03607.

Rowan Zellers, Ari Holtzman, Hannah Rashkin,

Yonatan Bisk, Ali Farhadi, Franziska Roesner, and

Yejin Choi. 2019b. Defending against neural fake

news. In Thirty-third Conference on Neural Infor-

mation Processing Systems.

Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum.

2020. The design and implementation of xiaoice, an

empathetic social chatbot. Computational Linguis-

tics, 46(1):53–93.

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong

Hu, Jason J. Corso, and Jianfeng Gao. 2019. Uni-

ﬁed vision-language pre-training for image caption-

ing and vqa. In Thirty-Fourth AAAI Conference on

Artiﬁcial Intelligence.