Old Page

Physical Interaction: Question Answering

PIQA was designed to investigate the physical knowledge of existing models. To what extent are current approaches actually learning about the world?

Submitting to the leaderboard

Submission is simple. Please email your predictions.

To: ybisk--_--cs.cmu.edu

Subject: [PIQA Leaderboard Submission]

Body:

A predictions lst file (one prediction per line)
A name for your model
Your team name (including your affiliation)
Optionally: A github repo or paper link.

I'll try to get back to you within a few days, usually sooner. Teams can only submit results from a model once every 7 days. Additionally, we reserve the right to not score any of your submissions if you cheat -- for instance, fake names / email addresses and multiple submissions under those names.

Citation


          @inproceedings{Bisk2020,

            author = {Yonatan Bisk and Rowan Zellers and 

                      Ronan Le Bras and Jianfeng Gao

                      and Yejin Choi},

            title = {PIQA: Reasoning about Physical Commonsense in
 
                     Natural Language},

            booktitle = {Thirty-Fourth AAAI Conference on

                         Artificial Intelligence},

            year = {2020},

          }

License

Academic Free License ("AFL") v. 3.0

Questions?

Please email me

PIQA Leaderboard

Physical IQA is a binary choice task, often better viewed as a set of two (Goal, Solution) pairs

Goal To separate egg whites from the yolk using a water bottle, you should ...
Solution 1 Squeeze the water bottle and press it against the yolk. Release, which creates suction and lifts the yolk.
Solution 2 Place the water bottle and press it against the yolk. Keep pushing, which creates suction and lifts the yolk.

Evaluation is a simple accuracy prediction over this binary task.

Rank	Model	Accuracy
	Human Performance (Bisk et al. '20)	94.9
1	DeBERTa-xxlarge Alibaba Group ICBU Tech	83.5
2	GPT-3 OpenAI	82.8*
3	Anonymous Anonymous	79.0
4	RoBERTa-Large Baseline lr 1e-5, 8 ep, 4/batch/GPU (4x V100), max seq 150	77.1
5	Zero-shot GPT-XL self-talk with GPT-medium AI2	69.5
6	OpenAI GPT Baseline	69.2
7	Zero-shot GPT-XL with COMET AI2	68.4
8	BERT-Large Baseline lr 1e-5, 8 ep, 6/batch/GPU (4x V100), max seq 150	66.8
9	Zero-shot GPT-XL AI2	63.4
10	Majority Class	50.4
	Random Performance	50.0