DLR-RM · araffin · Oct 28, 2020 · Oct 23, 2020 · Oct 24, 2020 · Oct 24, 2020
diff --git a/README.md b/README.md
@@ -19,6 +19,9 @@ These algorithms will make it easier for the research community and industry to
 
 ## Main Features
 
+**The performance of each algorithm was tested** (see *Results* section in their respective page),
+you can take a look at the issues [#48](https://github.com/DLR-RM/stable-baselines3/issues/48) and [#49](https://github.com/DLR-RM/stable-baselines3/issues/49) for more details.
+
 
 | **Features**                | **Stable-Baselines3** |
 | --------------------------- | ----------------------|

diff --git a/docs/guide/callbacks.rst b/docs/guide/callbacks.rst
@@ -15,7 +15,7 @@ To build a custom callback, you need to create a class that derives from ``BaseC
 This will give you access to events (``_on_training_start``, ``_on_step``) and useful variables (like `self.model` for the RL model).
 
 
-.. You can find two examples of custom callbacks in the documentation: one for saving the best model according to the training reward (see :ref:`Examples <examples>`), and one for logging additional values with Tensorboard (see :ref:`Tensorboard section <tensorboard>`).
+You can find two examples of custom callbacks in the documentation: one for saving the best model according to the training reward (see :ref:`Examples <examples>`), and one for logging additional values with Tensorboard (see :ref:`Tensorboard section <tensorboard>`).
 
 
 .. code-block:: python

diff --git a/docs/guide/custom_policy.rst b/docs/guide/custom_policy.rst
@@ -7,6 +7,13 @@ Stable Baselines3 provides policy networks for images (CnnPolicies)
 and other type of input features (MlpPolicies).
 
 
+.. warning::
+  For A2C and PPO, continuous actions are clipped during training and testing
+  (to avoid out of bound error). SAC, DDPG and TD3 squash the action, using a ``tanh()`` transformation,
+  which handles bounds more correctly.
+
+
+
 Custom Policy Architecture
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 

diff --git a/docs/guide/examples.rst b/docs/guide/examples.rst
@@ -19,6 +19,7 @@ notebooks:
 -  `RL Baselines zoo`_
 -  `PyBullet`_
 -  `Hindsight Experience Replay`_
+-  `Advanced Saving and Loading`_
 
 .. _Getting Started: https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/stable_baselines_getting_started.ipynb
 .. _Training, Saving, Loading: https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/saving_loading_dqn.ipynb
@@ -28,6 +29,7 @@ notebooks:
 .. _Hindsight Experience Replay: https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/stable_baselines_her.ipynb
 .. _RL Baselines zoo: https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/rl-baselines-zoo.ipynb
 .. _PyBullet: https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/pybullet.ipynb
+.. _Advanced Saving and Loading: https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/advanced_saving_loading.ipynb
 
 .. |colab| image:: ../_static/img/colab.svg
 
@@ -417,6 +419,171 @@ The parking env is a goal-conditioned continuous control task, in which the vehi
           obs = env.reset()
 
 
+Advanced Saving and Loading
+---------------------------------
+
+In this example, we show how to use some advanced features of Stable-Baselines3 (SB3):
+how to easily create a test environment to evaluate an agent periodically,
+use a policy independently from a model (and how to save it, load it) and save/load a replay buffer.
+
+By default, the replay buffer is not saved when calling ``model.save()``, in order to save space on the disk (a replay buffer can be up to several GB when using images).
+However, SB3 provides a ``save_replay_buffer()`` and ``load_replay_buffer()`` method to save it separately.
+
+
+.. image:: ../_static/img/colab-badge.svg
+   :target: https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/advanced_saving_loading.ipynb
+
+Stable-Baselines3 automatic creation of an environment for evaluation.
+For that, you only need to specify ``create_eval_env=True`` when passing the Gym ID of the environment while creating the agent.
+Behind the scene, SB3 uses an :ref:`EvalCallback <callbacks>`.
+
+.. code-block:: python
+
+  from stable_baselines3 import SAC
+  from stable_baselines3.common.evaluation import evaluate_policy
+  from stable_baselines3.sac.policies import MlpPolicy
+
+  # Create the model, the training environment
+  # and the test environment (for evaluation)
+  model = SAC('MlpPolicy', 'Pendulum-v0', verbose=1,
+              learning_rate=1e-3, create_eval_env=True)
+
+  # Evaluate the model every 1000 steps on 5 test episodes
+  # and save the evaluation to the "logs/" folder
+  model.learn(6000, eval_freq=1000, n_eval_episodes=5, eval_log_path="./logs/")
+
+  # save the model
+  model.save("sac_pendulum")
+
+  # the saved model does not contain the replay buffer
+  loaded_model = SAC.load("sac_pendulum")
+  print(f"The loaded_model has {loaded_model.replay_buffer.size()} transitions in its buffer")
+
+  # now save the replay buffer too
+  model.save_replay_buffer("sac_replay_buffer")
+
+  # load it into the loaded_model
+  loaded_model.load_replay_buffer("sac_replay_buffer")
+
+  # now the loaded replay is not empty anymore
+  print(f"The loaded_model has {loaded_model.replay_buffer.size()} transitions in its buffer")
+
+  # Save the policy independently from the model
+  # Note: if you don't save the complete model with `model.save()`
+  # you cannot continue training afterward
+  policy = model.policy
+  policy.save("sac_policy_pendulum.pkl")
+
+  # Retrieve the environment
+  env = model.get_env()
+
+  # Evaluate the policy
+  mean_reward, std_reward = evaluate_policy(policy, env, n_eval_episodes=10, deterministic=True)
+
+  print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")
+
+  # Load the policy independently from the model
+  saved_policy = MlpPolicy.load("sac_policy_pendulum")
+
+  # Evaluate the loaded policy
+  mean_reward, std_reward = evaluate_policy(saved_policy, env, n_eval_episodes=10, deterministic=True)
+
+  print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")
+
+
+
+Accessing and modifying model parameters
+----------------------------------------
+
+You can access model's parameters via ``load_parameters`` and ``get_parameters`` functions,
+or via ``model.policy.state_dict()`` (and ``load_state_dict()``),
+which use dictionaries that map variable names to PyTorch tensors.
+
+These functions are useful when you need to e.g. evaluate large set of models with same network structure,
+visualize different layers of the network or modify parameters manually.
+
+Policies also offers a simple way to save/load weights as a NumPy vector, using ``parameters_to_vector()``
+and ``load_from_vector()`` method.
+
+Following example demonstrates reading parameters, modifying some of them and loading them to model
+by implementing `evolution strategy (es) <http://blog.otoro.net/2017/10/29/visual-evolution-strategies/>`_
+for solving the ``CartPole-v1`` environment. The initial guess for parameters is obtained by running
+A2C policy gradient updates on the model.
+
+.. code-block:: python
+
+  from typing import Dict
+
+  import gym
+  import numpy as np
+  import torch as th
+
+  from stable_baselines3 import A2C
+  from stable_baselines3.common.evaluation import evaluate_policy
+
+
+  def mutate(params: Dict[str, th.Tensor]) -> Dict[str, th.Tensor]:
+      """Mutate parameters by adding normal noise to them"""
+      return dict((name, param + th.randn_like(param)) for name, param in params.items())
+
+
+  # Create policy with a small network
+  model = A2C(
+      "MlpPolicy",
+      "CartPole-v1",
+      ent_coef=0.0,
+      policy_kwargs={"net_arch": [32]},
+      seed=0,
+      learning_rate=0.05,
+  )
+
+  # Use traditional actor-critic policy gradient updates to
+  # find good initial parameters
+  model.learn(total_timesteps=10000)
+
+  # Include only variables with "policy", "action" (policy) or "shared_net" (shared layers)
+  # in their name: only these ones affect the action.
+  # NOTE: you can retrieve those parameters using model.get_parameters() too
+  mean_params = dict(
+      (key, value)
+      for key, value in model.policy.state_dict().items()
+      if ("policy" in key or "shared_net" in key or "action" in key)
+  )
+
+  # population size of 50 invdiduals
+  pop_size = 50
+  # Keep top 10%
+  n_elite = pop_size // 10
+  # Retrieve the environment
+  env = model.get_env()
+
+  for iteration in range(10):
+      # Create population of candidates and evaluate them
+      population = []
+      for population_i in range(pop_size):
+          candidate = mutate(mean_params)
+          # Load new policy parameters to agent.
+          # Tell function that it should only update parameters
+          # we give it (policy parameters)
+          model.policy.load_state_dict(candidate, strict=False)
+          # Evaluate the candidate
+          fitness, _ = evaluate_policy(model, env)
+          population.append((candidate, fitness))
+      # Take top 10% and use average over their parameters as next mean parameter
+      top_candidates = sorted(population, key=lambda x: x[1], reverse=True)[:n_elite]
+      mean_params = dict(
+          (
+              name,
+              th.stack([candidate[0][name] for candidate in top_candidates]).mean(dim=0),
+          )
+          for name in mean_params.keys()
+      )
+      mean_fitness = sum(top_candidate[1] for top_candidate in top_candidates) / n_elite
+      print(f"Iteration {iteration + 1:<3} Mean top fitness: {mean_fitness:.2f}")
+      print(f"Best fitness: {top_candidates[0][1]:.2f}")
+
+
+
 Record a Video
 --------------
 

diff --git a/docs/guide/export.rst b/docs/guide/export.rst
@@ -0,0 +1,67 @@
+.. _export:
+
+
+Exporting models
+================
+
+After training an agent, you may want to deploy/use it in another language
+or framework, like `tensorflowjs <https://github.com/tensorflow/tfjs>`_.
+Stable Baselines3 does not include tools to export models to other frameworks, but
+this document aims to cover parts that are required for exporting along with
+more detailed stories from users of Stable Baselines3.
+
+
+Background
+----------
+
+In Stable Baselines3, the controller is stored inside policies which convert
+observations into actions. Each learning algorithm (e.g. DQN, A2C, SAC)
+contains a policy object which represents the currently learned behavior,
+accessible via ``model.policy``.
+
+Policies hold enough information to do the inference (i.e. predict actions),
+so it is enough to export these policies (cf :ref:`examples <examples>`)
+to do inference in another framework.
+
+.. warning::
+  When using CNN policies, the observation is normalized during pre-preprocessing.
+  This pre-processing is done *inside* the policy (dividing by 255 to have values in [0, 1])
+
+
+Export to ONNX
+-----------------
+
+TODO: help is welcomed!
+
+
+Export to C++
+-----------------
+
+(using PyTorch JIT)
+TODO: help is welcomed!
+
+
+Export to tensorflowjs / ONNX-JS
+--------------------------------
+
+TODO: contributors help is welcomed!
+Probably a good starting point: https://github.com/elliotwaite/pytorch-to-javascript-with-onnx-js
+
+
+
+Manual export
+-------------
+
+You can also manually export required parameters (weights) and construct the
+network in your desired framework.
+
+You can access parameters of the model via agents'
+:func:`get_parameters <stable_baselines3.common.base_class.BaseAlgorithm.get_parameters>` function.
+As policies are also PyTorch modules, you can also access ``model.policy.state_dict()`` directly.
+To find the architecture of the networks for each algorithm, best is to check the ``policies.py`` file located
+in their respective folders.
+
+.. note::
+
+  In most cases, we recommend using PyTorch methods ``state_dict()`` and ``load_state_dict()`` from the policy,
+  unless you need to access the optimizers' state dict too. In that case, you need to call ``get_parameters()``.
diff --git a/docs/guide/migration.rst b/docs/guide/migration.rst
@@ -59,6 +59,7 @@ Moved Files
 - ``bench/monitor.py`` -> ``common/monitor.py``
 - ``logger.py`` -> ``common/logger.py``
 - ``results_plotter.py`` -> ``common/results_plotter.py``
+- ``common/cmd_util.py`` -> ``common/env_util.py``
 
 Utility functions are no longer exported from ``common`` module, you should import them with their absolute path, e.g.:
 

diff --git a/docs/guide/rl_tips.rst b/docs/guide/rl_tips.rst
@@ -146,17 +146,17 @@ for continuous actions problems (cf *Bullet* envs).
 
 
 
-.. Goal Environment
-.. -----------------
-..
-.. If your environment follows the ``GoalEnv`` interface (cf `HER <../modules/her.html>`_), then you should use
-.. HER + (SAC/TD3/DDPG/DQN) depending on the action space.
-..
-..
-.. .. note::
-..
-.. 	The number of workers is an important hyperparameters for experiments with HER
-..
+Goal Environment
+-----------------
+
+If your environment follows the ``GoalEnv`` interface (cf :ref:`HER <her>`), then you should use
+HER + (SAC/TD3/DDPG/DQN) depending on the action space.
+
+
+.. note::
+
+	The number of workers is an important hyperparameters for experiments with HER
+
 
 
 Tips and Tricks when creating a custom environment

diff --git a/docs/guide/save_format.rst b/docs/guide/save_format.rst
@@ -0,0 +1,59 @@
+.. _save_format:
+
+
+On saving and loading
+=====================
+
+Stable Baselines3 (SB3) stores both neural network parameters and algorithm-related parameters such as
+exploration schedule, number of environments and observation/action space. This allows continual learning and easy
+use of trained agents without training, but it is not without its issues. Following describes the format
+used to save agents in SB3 along with its pros and shortcomings.
+
+Terminology used in this page:
+
+-  *parameters* refer to neural network parameters (also called "weights"). This is a dictionary
+   mapping variable name to a PyTorch tensor.
+-  *data* refers to RL algorithm parameters, e.g. learning rate, exploration schedule, action/observation space.
+   These depend on the algorithm used. This is a dictionary mapping classes variable names to their values.
+
+
+Zip-archive
+-----------
+
+A zip-archived JSON dump, PyTorch state dictionaries and PyTorch variables. The data dictionary (class parameters)
+is stored as a JSON file, model parameters and optimizers are serialized with ``torch.save()`` function and these files
+are stored under a single .zip archive.
+
+Any objects that are not JSON serializable are serialized with cloudpickle and stored as base64-encoded
+string in the JSON file, along with some information that was stored in the serialization. This allows
+inspecting stored objects without deserializing the object itself.
+
+This format allows skipping elements in the file, i.e. we can skip deserializing objects that are
+broken/non-serializable.
+
+.. This can be done via ``custom_objects`` argument to load functions.
+
+
+File structure:
+
+::
+
+  saved_model.zip/
+  ├── data              JSON file of class-parameters (dictionary)
+  ├── *.optimizer.pth   PyTorch optimizers serialized
+  ├── policy.pth        PyTorch state dictionary of the policy saved
+  ├── pytorch_variables.pth Additional PyTorch variables
+  ├── _stable_baselines3_version contains the SB3 version with which the model was saved
+
+
+Pros:
+
+- More robust to unserializable objects (one bad object does not break everything).
+- Saved files can be inspected/extracted with zip-archive explorers and by other languages.
+
+
+Cons:
+
+- More complex implementation.
+- Still relies partly on cloudpickle for complex objects (e.g. custom functions)
+  with can lead to `incompatibilities <https://github.com/DLR-RM/stable-baselines3/issues/172>`_ between Python versions.