Physical Interaction: Question Answering

PIQA was designed to investigate the physical knowledge of existing models. To what extent are current approaches actually learning about the world?



Submitting to the leaderboard

Submission is simple. Please email your predictions.

To: ybisk--_--cs.cmu.edu

Subject: [PIQA Leaderboard Submission]

Body:

  1. A predictions lst file (one prediction per line)
  2. A name for your model
  3. Your team name (including your affiliation)
  4. Optionally: A github repo or paper link.


I'll try to get back to you within a few days, usually sooner. Teams can only submit results from a model once every 7 days. Additionally, we reserve the right to not score any of your submissions if you cheat -- for instance, fake names / email addresses and multiple submissions under those names.


Citation

@inproceedings{Bisk2020,
  author = {Yonatan Bisk and Rowan Zellers and
            Ronan Le Bras and Jianfeng Gao
            and Yejin Choi},
  title = {PIQA: Reasoning about Physical Commonsense in
           Natural Language},
  booktitle = {Thirty-Fourth AAAI Conference on
               Artificial Intelligence},
  year = {2020},
}

License

Academic Free License ("AFL") v. 3.0


Questions?

Please email me

PIQA Leaderboard

Physical IQA is a binary choice task, often better viewed as a set of two (Goal, Solution) pairs

  • Goal To separate egg whites from the yolk using a water bottle, you should ...
  • Solution 1 Squeeze the water bottle and press it against the yolk. Release, which creates suction and lifts the yolk.
  • Solution 2 Place the water bottle and press it against the yolk. Keep pushing, which creates suction and lifts the yolk.

Evaluation is a simple accuracy prediction over this binary task.

Rank Model Accuracy
Human Performance
(Bisk et al. '20)
94.9
1 DeBERTa-xxlarge

Alibaba Group ICBU Tech

83.5
2 GPT-3

OpenAI

82.8*
3 Anonymous

Anonymous

79.0
4 RoBERTa-Large

Baseline
lr 1e-5, 8 ep, 4/batch/GPU (4x V100), max seq 150

77.1
5 Zero-shot GPT-XL self-talk with GPT-medium

AI2

69.5
6 OpenAI GPT

Baseline

69.2
7 Zero-shot GPT-XL with COMET

AI2

68.4
8 BERT-Large

Baseline
lr 1e-5, 8 ep, 6/batch/GPU (4x V100), max seq 150

66.8
9 Zero-shot GPT-XL

AI2

63.4
10 Majority Class

50.4
Random Performance 50.0