Physical Interaction: Question Answering
PIQA was designed to investigate the physical knowledge of existing models. To what extent are current approaches actually learning about the world?
Submitting to the leaderboard
Submission is simple. Please email your predictions.
To: ybisk--_--cs.cmu.edu Subject: [PIQA Leaderboard Submission] Body:
| |
I'll try to get back to you within a few days, usually sooner. Teams can only submit results from a model once every 7 days. Additionally, we reserve the right to not score any of your submissions if you cheat -- for instance, fake names / email addresses and multiple submissions under those names.
Citation
@inproceedings{Bisk2020,
author = {Yonatan Bisk and Rowan Zellers and
Ronan Le Bras and Jianfeng Gao
and Yejin Choi},
title = {PIQA: Reasoning about Physical Commonsense in
Natural Language},
booktitle = {Thirty-Fourth AAAI Conference on
Artificial Intelligence},
year = {2020},
}
License
Academic Free License ("AFL") v. 3.0
Questions?
Please email me
PIQA Leaderboard
Physical IQA is a binary choice task, often better viewed as a set of two (Goal, Solution) pairs
- Goal To separate egg whites from the yolk using a water bottle, you should ...
- Solution 1 Squeeze the water bottle and press it against the yolk. Release, which creates suction and lifts the yolk.
- Solution 2 Place the water bottle and press it against the yolk. Keep pushing, which creates suction and lifts the yolk.
Evaluation is a simple accuracy prediction over this binary task.
Rank | Model | Accuracy |
---|---|---|
Human Performance (Bisk et al. '20) |
94.9 | |
1 | DeBERTa-xxlarge Alibaba Group ICBU Tech |
83.5 |
2 | GPT-3 OpenAI |
82.8* |
3 | Anonymous Anonymous |
79.0 |
4 | RoBERTa-Large Baseline |
77.1 |
5 | Zero-shot GPT-XL self-talk with GPT-medium AI2 |
69.5 |
6 | OpenAI GPT Baseline |
69.2 |
7 | Zero-shot GPT-XL with COMET AI2 |
68.4 |
8 | BERT-Large Baseline |
66.8 |
9 | Zero-shot GPT-XL AI2 |
63.4 |
10 | Majority Class | 50.4 |
Random Performance | 50.0 |