DLR-RM · araffin · Oct 11, 2020 · Jul 22, 2020 · Jul 22, 2020 · Jul 31, 2020
diff --git a/README.md b/README.md
@@ -50,9 +50,9 @@ Planned features:
 - [ ] TRPO
 
 
-## Migration guide
+## Migration guide: from Stable-Baselines (SB2) to Stable-Baselines3 (SB3)
 
-**TODO: migration guide from Stable-Baselines in the documentation**
+A migration guide from SB2 to SB3 can be found in the [documentation](https://stable-baselines3.readthedocs.io/en/master/guide/migration.html).
 
 ## Documentation
 

diff --git a/docs/guide/install.rst b/docs/guide/install.rst
@@ -29,7 +29,7 @@ To install Stable Baselines3 with pip, execute:
 
     pip install stable-baselines3[extra]
 
-This includes an optional dependencies like Tensorboard, OpenCV or ```atari-py``` to train on atari games. If you do not need those, you can use:
+This includes an optional dependencies like Tensorboard, OpenCV or ``atari-py`` to train on atari games. If you do not need those, you can use:
 
 .. code-block:: bash
 

diff --git a/docs/guide/migration.rst b/docs/guide/migration.rst
@@ -9,4 +9,184 @@ This is a guide to migrate from Stable-Baselines to Stable-Baselines3.
 
 It also references the main changes.
 
-**TODO**
+.. warning::
+	This section is still a Work In Progress (WIP) Things might be added in the future before 1.0 release.
+
+
+
+Overview
+========
+
+Overall Stable-Baselines3 (SB3) keeps the high-level API of Stable-Baselines (SB2).
+Most of the changes are to ensure more consistency and are internal ones.
+Because of the backend change, from Tensorflow to PyTorch, the internal code is much much readable and easy to debug
+at the cost of some speed (dynamic graph vs static graph., see `Issue #90 <https://github.com/DLR-RM/stable-baselines3/issues/90>`_)
+However, the algorithms were extensively benchmarked on Atari games and continuous control PyBullet envs
+(see `Issue #48 <https://github.com/DLR-RM/stable-baselines3/issues/48>`_  and `Issue #49 <https://github.com/DLR-RM/stable-baselines3/issues/49>`_)
+so you should not expect performance drop when switching from SB2 to SB3.
+
+Breaking Changes
+================
+
+- SB3 requires python 3.6+ (instead of python 3.5+ for SB2)
+- Dropped MPI support
+- Dropped layer normalized policies (e.g. ``LnMlpPolicy``)
+- Dropped parameter noise for DDPG and DQN
+- PPO is now closer to the original implementation (no clipping of the value function by default), cf PPO section below
+- orthogonal initialization is only used by A2C/PPO
+- the features extractor (CNN extractor) is shared between policy and q-networks for DDPG/SAC/TD3 and only the policy loss used to update it (much faster)
+- Tensorboard legacy logging was dropped in favor of having one logger for the terminal and Tensorboard (cf :ref:`Tensorboard integration <tensorboard>`)
+- we dropped ACKTR/ACER support because of their complexity compared to simpler alternatives (PPO, SAC, TD3) performing as good.
+- we dropped GAIL support as we are focusing on model-free RL only, you can however take a look at the `Imitation Learning Baseline Implementations <https://github.com/HumanCompatibleAI/imitation>`_
+  which are based on SB3.
+
+TODO: change to deterministic predict for SAC/TD3
+
+state api breaking changes and implementation differences (e.g. clip range ppo and renaming of parameters)
+
+Moved Files
+-----------
+
+- ``bench/monitor.py`` -> ``common/monitor.py``
+- ``logger.py`` -> ``common/logger.py``
+- ``results_plotter.py`` -> ``common/results_plotter.py``
+
+Utility functions are no longer exported from ``common`` module, you should import them with their absolute path, e.g.:
+
+.. code-block:: python
+
+  from stable_baselines3.common.cmd_util import make_atari_env, make_vec_env
+  from stable_baselines3.common.utils import set_random_seed
+
+instead of ``from stable_baselines3.common import make_atari_env``
+
+
+
+Parameters Change and Renaming
+------------------------------
+
+Base-class (all algorithms)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- ``load_parameters`` -> ``set_parameters``
+
+  - ``get/set_parameters`` return a dictionary mapping object names
+    to their respective PyTorch tensors and other objects representing
+    their parameters, instead of simpler mapping of parameter name to
+    a NumPy array. These functions also return PyTorch tensors rather
+    than NumPy arrays.
+
+
+Policies
+^^^^^^^^
+
+- ``cnn_extractor`` -> ``feature_extractor``, as ``feature_extractor`` in now used with ``MlpPolicy`` too
+
+A2C
+^^^
+
+- ``epsilon`` -> ``rms_prop_eps``
+- ``lr_schedule`` is part of ``learning_rate`` (it can be a callable).
+- ``alpha``, ``momentum`` are modifiable through ``policy_kwargs`` key ``optimizer_kwargs``.
+
+.. warning::
+
+	PyTorch implementation of RMSprop `differs from Tensorflow's <https://github.com/pytorch/pytorch/issues/23796>`_,
+	which leads to `different and potentially more unstable results <https://github.com/DLR-RM/stable-baselines3/pull/110#issuecomment-663255241>`_.
+	Use ``stable_baselines3.common.sb2_compat.rmsprop_tf_like.RMSpropTFLike`` optimizer to match the results
+	with Tensorflow's implementation. This can be done through ``policy_kwargs``: ``A2C(policy_kwargs=dict(optimizer_class=RMSpropTFLike))``
+
+
+PPO
+^^^
+
+- ``cliprange`` -> ``clip_range``
+- ``cliprange_vf`` -> ``clip_range_vf``
+- ``nminibatches`` -> ``batch_size``
+
+.. warning::
+
+	``nminibatches`` gave different batch size depending on the number of environments:  ``batch_size = (n_steps * n_envs) // nminibatches``
+
+
+- ``clip_range_vf`` behavior for PPO is slightly different: Set it to ``None`` (default) to deactivate clipping (in SB2, you had to pass ``-1``, ``None`` meant to use ``clip_range`` for the clipping)
+- ``lam`` -> ``gae_lambda``
+- ``noptepochs`` -> ``n_epochs``
+
+PPO default hyperparameters are the one tuned for continuous control environment.
+We recommend taking a look at the :ref:`RL Zoo <rl_zoo>` for hyperparameters tuned for Atari games.
+
+
+DQN
+^^^
+
+Only the vanilla DQN is implemented right now but extensions will follow (cf planned features).
+Default hyperparameters are taken from the nature paper, except for the optimizer and learning rate that were taken from Stable Baselines defaults.
+
+DDPG
+^^^^
+
+DDPG now follows the same interface as SAC/TD3.
+For state/reward normalization, you should use ``VecNormalize`` as for all other algorithms.
+
+SAC/TD3
+^^^^^^^
+
+SAC/TD3 now accept any number of critics, e.g. ``policy_kwargs=dict(n_critics=3)``, instead of only two before.
+
+
+.. note::
+
+	SAC/TD3 default hyperparameters (including network architecture) now match the ones from the original papers.
+	DDPG is using TD3 defaults.
+
+
+New logger API
+--------------
+
+- Methods were renamed in the logger:
+
+  - ``logkv`` -> ``record``, ``writekvs`` -> ``write``, ``writeseq`` ->  ``write_sequence``,
+  - ``logkvs`` -> ``record_dict``, ``dumpkvs`` -> ``dump``,
+  - ``getkvs`` -> ``get_log_dict``, ``logkv_mean`` -> ``record_mean``,
+
+
+Internal Changes
+----------------
+
+Please read the :ref:`Developper Guide <developer>` section.
+
+
+New Features
+============
+
+- much cleaner and consistent base code (and no more warnings =D!) and static type checks
+- independent saving/loading/predict for policies
+- A2C now supports Generalized Advantage Estimation (GAE) and advantage normalization (both are deactivated by default)
+- generalized State-Dependent Exploration (gSDE) exploration is available for A2C/PPO/SAC. It allows to use RL directly on real robots (cf https://arxiv.org/abs/2005.05719)
+- proper evaluation (using separate env) is included in the base class (using ``EvalCallback``),
+  if you pass the environment as a string, you can pass ``create_eval_env=True`` to the algorithm constructor.
+- better saving/loading: optimizers are now included in the saved parameters and there is two new methods ``save_replay_buffer`` and ``load_replay_buffer`` for the replay buffer when using off-policy algorithms (DQN/DDPG/SAC/TD3)
+- you can pass ``optimizer_class`` and ``optimizer_kwargs`` to ``policy_kwargs`` in order to easily
+  customize optimizers
+- seeding now works properly to have deterministic results
+- replay buffer does not grow, allocate everything at build time (faster)
+- we added a memory efficient replay buffer variant (pass ``optimize_memory_usage=True`` to the constructor), it reduces drastically the memory used especially when using images
+- you can specify an arbitrary number of critics for SAC/TD3 (e.g. ``policy_kwargs=dict(n_critics=3)``)
+
+
+How to migrate?
+===============
+
+In most cases, replacing ``from stable_baselines`` by ``from stable_baselines3`` will be sufficient.
+Some files were moved to the common folder (cf above) and could result to import errors.
+We recommend looking at the `rl-zoo3 <https://github.com/DLR-RM/rl-baselines3-zoo>`_ and compare the imports
+to the `rl-zoo <https://github.com/araffin/rl-baselines-zoo>`_ of SB2 to have a concrete example of successful migration.
+
+Planned Features
+================
+
+- Recurrent (LSTM) policies
+- DQN extensions (the current implementation is a vanilla DQN)
+
+cf `roadmap <https://github.com/DLR-RM/stable-baselines3/issues/1>`_
diff --git a/docs/misc/changelog.rst b/docs/misc/changelog.rst
@@ -26,6 +26,7 @@ Others:
 
 Documentation:
 ^^^^^^^^^^^^^^
+- Added first draft of migration guide
 
 
 Pre-Release 0.9.0 (2020-10-03)