# rl-experiments

**Repository Path**: gsj2021/rl-experiments

## Basic Information

- **Project Name**: rl-experiments
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2021-05-14
- **Last Updated**: 2021-09-01

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

## RLlib Reference Results

Benchmarks of [RLlib](https://rllib.io) algorithms against published results. These benchmarks are a work in progress. For other results to compare against, see [yarlp](https://github.com/btaba/yarlp) and [more plots](https://github.com/openai/baselines-results/blob/master/acktr_ppo_acer_a2c_atari.ipynb) from OpenAI.

#### Ape-X Distributed Prioritized Experience Replay

`rllib train -f atari-apex/atari-apex.yaml`

Comparison of RLlib Ape-X to Async DQN after 10M time-steps (**40M frames**). Results compared to learning curves from [Mnih et al, 2016](https://arxiv.org/pdf/1602.01783.pdf) extracted at 10M time-steps from Figure 3.

|env|RLlib Ape-X 8-workers|Mnih et al Async DQN 16-workers|Mnih et al DQN 1-worker|
|---|---|---|---|
|BeamRider|6134|~6000|~3000|
|Breakout|123|~50|~10|
|QBert|15302|~1200|~500|
|SpaceInvaders|686|~600|~500|

Here we use only eight workers per environment in order to run all experiments concurrently on a single g3.16xl machine. Further speedups may be obtained by using more workers. Comparing wall-time performance after 1 hour of training:

|env|RLlib Ape-X 8-workers|Mnih et al Async DQN 16-workers|Mnih et al DQN 1-worker|
|---|---|---|---|
|BeamRider|4873|~1000|~300|
|Breakout|77|~10|~1|
|QBert|4083|~500|~150|
|SpaceInvaders|646|~300|~160|

Ape-X plots:
![apex](/atari-apex/apex.png)

#### IMPALA and A2C

`rllib train -f atari-impala/atari-impala.yaml`

`rllib train -f atari-a2c/atari-a2c.yaml`

RLlib IMPALA and A2C on 10M time-steps (**40M frames**). Results compared to learning curves from [Mnih et al, 2016](https://arxiv.org/pdf/1602.01783.pdf) extracted at 10M time-steps from Figure 3.

|env|RLlib IMPALA 32-workers|RLlib A2C 5-workers|Mnih et al A3C 16-workers|
|---|---|---|---|
|BeamRider|2071|1401|~3000|
|Breakout|385|374|~150|
|QBert|4068|3620|~1000|
|SpaceInvaders|719|692|~600|

IMPALA and A2C vs A3C after 1 hour of training:

|env|RLlib IMPALA 32-workers|RLlib A2C 5-workers|Mnih et al A3C 16-workers|
|---|---|---|---|
|BeamRider|3181|874|~1000|
|Breakout|538|268|~10|
|QBert|10850|1212|~500|
|SpaceInvaders|843|518|~300|

IMPALA plots:
![tensorboard](/atari-impala/atari-impala.png)

A2C plots:
![tensorboard](/atari-a2c/atari-a2c.png)

#### Pong in 3 minutes
With a bit of tuning, RLlib IMPALA can solve Pong in ~3 minutes:

`rllib train -f pong-speedrun/pong-impala-fast.yaml`

![tensorboard](/pong-speedrun/pong-impala.png)

#### DQN / Rainbow

`rllib train -f atari-dqn/basic-dqn.yaml`
`rllib train -f atari-dqn/duel-ddqn.yaml`
`rllib train -f atari-dqn/dist-dqn.yaml`

RLlib DQN after 10M time-steps (**40M frames**). Note that RLlib evaluation scores include the 1% random actions of epsilon-greedy exploration. You can expect slightly higher rewards when rolling out the policies without any exploration at all.

| env  |  RLlib Basic DQN | RLlib Dueling DDQN | RLlib Distributional DQN  |  Hessel et al. DQN |  Hessel et al. Rainbow |
|---|---|---|---|---|---|
|BeamRider|2869|1910|4447|~2000|~13000|
|Breakout|287|312|410|~150|~300|
|QBert|3921|7968|15780|~4000|~20000|
|SpaceInvaders|650|1001|1025|~500|~2000|

Basic DQN plots:
![tensorboard](/atari-dqn/basic-dqn.png)

Dueling DDQN plots:
![tensorboard](/atari-dqn/dueling-ddqn.png)

Distributional DQN plots:
![tensorboard](/atari-dqn/dist-dqn.png)

#### Proximal Policy Optimization

`rllib train -f atari-ppo/atari-ppo.yaml`

`rllib train -f halfcheetah-ppo/halfcheetah-ppo.yaml`

##### *2018-09:*
RLlib PPO with 10 workers (5 envs per worker) after 10M and 25M time-steps 
(**40M/100M frames**). Note that RLlib does not use clip parameter annealing.

|env|RLlib PPO @10M|RLlib PPO @25M|Baselines PPO @10M|
|---|---|---|---|
|BeamRider|2807|4480|~1800|
|Breakout|104|201|~250|
|QBert|11085|14247|~14000|
|SpaceInvaders|671|944|~800|

![tensorboard](/atari-ppo/2018-09/atari-ppo.png)

RLlib PPO wall-time performance vs other implementations using a single Titan XP and the same number of CPUs. Results compared to learning curves from [Fan et al, 2018](https://surreal.stanford.edu/img/surreal-corl2018.pdf) extracted at 1 hour of training from Figure 7. Here we get optimal results with a vectorization of 32 environment instances per worker:

|env|RLlib PPO 16-workers|Fan et al PPO 16-workers|TF BatchPPO 16-workers|
|---|---|---|---|
|HalfCheetah|9664|~7700|~3200|

![tensorboard](/halfcheetah-ppo/halfcheetah-ppo.png)

##### *2020-01:*

Same as 2018-09, comparing only RLlib PPO-tf vs PPO-torch.

|env|RLlib PPO @20M (tf)|RLlib PPO @20M (torch)|plot|
|---|---|---|---|
|BeamRider|4142|3850|![tensorboard](/atari-ppo/BeamRiderNoFrameskip-v4/episode_reward_mean_tf_vs_torch_timesteps.png)|
|Breakout|132|166|![tensorboard](/atari-ppo/BreakoutNoFrameskip-v4/episode_reward_mean_tf_vs_torch_timesteps.png)|
|QBert|7987|14294|![tensorboard](/atari-ppo/QbertNoFrameskip-v4/episode_reward_mean_tf_vs_torch_timesteps.png)|
|SpaceInvaders|956|1016|![tensorboard](/atari-ppo/SpaceInvadersNoFrameskip-v4/episode_reward_mean_tf_vs_torch_timesteps.png)|

#### Soft Actor Critic

`rllib train -f halfcheetah-sac/halfcheetah-sac.yaml`

RLlib SAC after 3M time-steps. 

RLlib SAC versus SoftLearning implementation [Haarnoja et al, 2018](https://arxiv.org/pdf/1801.01290.pdf) benchmarked at 500k and 3M timesteps respectively.

|env|RLlib SAC @500K|Haarnoja et al SAC @500K|RLlib SAC @3M|Haarnoja et al SAC @3M|
|---|---|---|---|---|
|HalfCheetah|9000|~9000|13000|~15000|

![tensorboard](/halfcheetah-sac/halfcheetah-sac.PNG)

#### MAML

MAML uses additional metrics to measure performance; `episode_reward_mean` measures the agent's returns before adaptation, `episode_reward_mean_adapt_N` measures the agent's returns after N gradient steps of inner adaptation, and `adaptation_delta` measures the difference in performance before and after adaptation.

`rllib train -f maml/halfcheetah-rand-direc-maml.yaml`

![tensorboard](/maml/halfcheetah-rand-direc.png)

`rllib train -f maml/ant-rand-goal-maml.yaml`

![tensorboard](/maml/ant-rand-goal.png)

`rllib train -f maml/pendulum-mass-maml.yaml`

![tensorboard](/maml/pendulum-mass.png)

#### MB-MPO

`rllib train -f mbmpo/halfcheetah-mbmpo.yaml`

`rllib train -f mbmpo/hopper-mbmpo.yaml`

MBMPO uses additional metrics to measure performance. For each MBMPO iteration, MBMPO samples fake data from the transition dynamics workers and steps through MAML for `N` iterations. `MAMLIter$i$_DynaTrajInner_$j$_episode_reward_mean` corresponds to agent's performance across the dynamics models at the `i`th iteration of MAML and the `j`th step of inner adaptation.

RLlib MBMPO versus [Clavera et al, 2018](https://arxiv.org/pdf/1809.05214.pdf) benchmarked at 100k timesteps. Results reported below were ran on RLLib and the master branch of the [original codebase](https://github.com/jonasrothfuss/model_ensemble_meta_learning) respectively.

|env|RLlib MBPO @100K|Clavera et al MBMPO @100K|
|---|---|---|
|HalfCheetah|520|~550|
|Hopper|620|~650|

![tensorboard](/mbmpo/mbmpo-mujoco.png)

#### Dreamer

`rllib train -f dreamer/dreamer-deepmind-control.yaml`

RLlib Dreamer at 1M time-steps. 

RLlib Dreamer versus Google implementation [Danijar et al, 2020](https://arxiv.org/pdf/1912.01603.pdf) benchmarked at 100k and 1M timesteps respectively.

|env|RLlib Dreamer @100K|Danijar et al Dreamer @100K|RLlib Dreamer @1M|Danijar et al Dreamer @1M|
|---|---|---|---|---|
|Walker|320|~250|920|~930|
|Cheetah|300|~250|640|~800|

![tensorboard](/dreamer/deepmind-dreamer.png)

RLlib Dreamer also logs gifs of Dreamer's imagined trajectories (Top: Ground truth, Middle: Model prediction, Bottom: Delta).

![Alt Text](/dreamer/walker_dreamer.gif) ![Alt Text](/dreamer/halfcheetah_dreamer.gif)

#### CQL

`rllib train -f halfcheetah-cql/halfcheetah-cql.yaml`

`rllib train -f halfcheetah-cql/halfcheetah-bc.yaml`

Since CQL is an offline RL algorithm, CQL's returns are evaluated only during the evaluation loop (once every 1000 gradient steps for Mujoco-based envs). 

RLlib CQL versus Behavior Cloning (BC) benchmarked at 1M gradient steps over the dataset derived from the D4RL benchmark ([Fu et al, 2020](https://arxiv.org/abs/2004.07219)). Results reported below were ran on RLLib. The only difference between BC and CQL is the `bc_iters` parameter in CQL (how many iterations to run BC loss).

RLlib's CQL is evaluated on four different enviornments: `HalfCheetah-Random-v0` and `Hopper-Random-v0` contain datasets collected by a random policy, while `HalfCheetah-Medium-v0` and `Hopper-Medium-v0` contain datasets collected by a policy trained 1/3 of the way through. In all envs, CQL does better than BC by a significant margin (especially `HalfCheetah-Random-v0`).

|env|RLlib BC @1M|RLlib CQL @1M|
|---|---|---|
|HalfCheetah-Random-v0|-320|3000|
|Hopper-Random-v0|290|320|
|HalfCheetah-Medium-v0|3450|3850|
|Hopper-Medium-v0|1000|2000|

`rllib train -f cql/halfcheetah-cql.yaml` & `rllib train -f cql/halfcheetah-bc.yaml`

![tensorboard](/cql/halfcheetah-random-cql.png)

![tensorboard](/cql/halfcheetah-medium-cql.png)

`rllib train -f cql/hopper-cql.yaml` & `rllib train -f cql/hopper-bc.yaml`

![tensorboard](/cql/hopper-random-cql.png)

![tensorboard](/cql/hopper-medium-cql.png)

#### Transformers

`rllib train -f vizdoom-attention/vizdoom-attention.yaml`

RLlib's model catalog feature implements a variety of different models for the policy and value network, one of which supports using attention in RL. In particular, RLlib implements a Gated Transformer ([Parisotta et al, 2019](https://arxiv.org/pdf/1910.06764.pdf)), abbreviated as GTrXL.

GTrXL is benchmarked in the Vizdoom environment, where the goal is to shoot a monster as quickly as possible. With PPO as the algorithm and GTrXL as the model, RLlib can successfuly solve the Vizdoom environment and reach human level performance.

|env|RLlib Transformer @2M|
|---|---|
|VizdoomBasic-v0|~75|

![tensorboard](/vizdoom-attention/vizdoom-attention.png)