Currently in the early stages of writing the benchmark for measuring performance.
First, create a benchmark for multi task learning and transfer learning. The benchmark should measure improvement in learning that is directly attributable to knowledge transfer between games. The benchmark should also be able to measure performance by a single agent on multiple games. The benchmark should use cross-validation to mitigate the effects of a small sample size of games.
Second, design and implement deep reinforcement learning architectures that do well on the benchmark. For methodological reasons, we think it’s important to design the ideal benchmark before getting too attached to a particular architecture. It’s important that we’re sure the benchmark is measuring the crux of the transfer and multi-task problem rather than measuring something our architecture is good at.
Generalizing across tasks is a crucial component of human intelligence. Current deep RL architectures get less effective the more tasks they are put to, whereas for humans, diversity of experience is a strength that improves performance on new tasks. Overcoming catastrophic forgetting and achieving one-shot learning are abilities that should fall out naturally if this task is solved convincingly.
At a more meta-level, this problem is both out of reach of current reinforcement learning architectures, but it seems reasonably within reach within a year or two. Much like ImageNet spurred innovation by creating a common target for researchers to aim for, this project could similarly provide a common idea of success for multitask and transfer learning. Many papers researching multi-task and transfer learning using Atari are doing it in ad-hoc ways that cherry-pick games that get good results.
Success is in degrees, since an architecture (in principle) could surpass human ability in multi-task Atari, getting both higher scores on all games, and picking up new games faster than a human does. Ideally, a good waterline would be human level performance on the benchmark, but creating a robust dataset on human performance is beyond the scope of this project.
The fundamental benchmark then will be two measures:
In addition to the scores, the benchmark will also make some strict demands on the architecture itself due to the testing/training regime:
Currently no datasets, but it’s possible the dataset being created at atarigrandchallenge.com will potentially be a useful comparison once it’s available. Measuring human performance needs to be done with a large sample size, both to control for pre-training (some people have played Atari games before, or other video games before) and to control for individual human skill levels (this could be seen as pre-training on non-Atari games, generalization from real life, or natural ability etc).
Akin to a dataset will be the benchmark framework itself. Since this is a reinforcement learning problem, the testing environment provides the data, rather than a static dataset.
Since the original Mnih paper, the Atari 2600 environment has been a popular target for testing out RL architectures