cambrian.ml.model¶

Custom model class for Cambrian. This class is a subclass of the PPO model from Stable Baselines 3. It overrides the save and load methods to only save the policy weights. It also adds a method to load rollout data from a previous training run. The predict method is then overwritten to return the next action in the rollout if the rollout data is loaded. This is useful for testing the evolutionary loop without having to train the agent each time.

Classes¶

MjCambrianModel

Proximal Policy Optimization algorithm (PPO) (clip version)

Module Contents¶

class MjCambrianModel(*args, **kwargs)[source]¶

Bases: stable_baselines3.PPO

Proximal Policy Optimization algorithm (PPO) (clip version)

Paper: https://arxiv.org/abs/1707.06347 Code: This implementation borrows code from OpenAI Spinning Up (https://github.com/openai/spinningup/) https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail and Stable Baselines (PPO2 from https://github.com/hill-a/stable-baselines)

Introduction to PPO: https://spinningup.openai.com/en/latest/algorithms/ppo.html

Parameters:

policy – The policy model to use (MlpPolicy, CnnPolicy, …)
env – The environment to learn from (if registered in Gym, can be str)
learning_rate – The learning rate, it can be a function of the current progress remaining (from 1 to 0)
n_steps – The number of steps to run for each environment per update (i.e. rollout buffer size is n_steps * n_envs where n_envs is number of environment copies running in parallel) NOTE: n_steps * n_envs must be greater than 1 (because of the advantage normalization) See https://github.com/pytorch/pytorch/issues/29372
batch_size – Minibatch size
n_epochs – Number of epoch when optimizing the surrogate loss
gamma – Discount factor
gae_lambda – Factor for trade-off of bias vs variance for Generalized Advantage Estimator
clip_range – Clipping parameter, it can be a function of the current progress remaining (from 1 to 0).
clip_range_vf – Clipping parameter for the value function, it can be a function of the current progress remaining (from 1 to 0). This is a parameter specific to the OpenAI implementation. If None is passed (default), no clipping will be done on the value function. IMPORTANT: this clipping depends on the reward scaling.
normalize_advantage – Whether to normalize or not the advantage
ent_coef – Entropy coefficient for the loss calculation
vf_coef – Value function coefficient for the loss calculation
max_grad_norm – The maximum value for the gradient clipping
use_sde – Whether to use generalized State Dependent Exploration (gSDE) instead of action noise exploration (default: False)
sde_sample_freq – Sample a new noise matrix every n steps when using gSDE Default: -1 (only sample at the beginning of the rollout)
rollout_buffer_class – Rollout buffer class to use. If None, it will be automatically selected.
rollout_buffer_kwargs – Keyword arguments to pass to the rollout buffer on creation
target_kl – Limit the KL divergence between updates, because the clipping is not enough to prevent large update see issue #213 (cf https://github.com/hill-a/stable-baselines/issues/213) By default, there is no limit on the kl div.
stats_window_size – Window size for the rollout logging, specifying the number of episodes to average the reported success rate, mean episode length, and mean reward over
tensorboard_log – the log location for tensorboard (if None, no logging)
policy_kwargs – additional arguments to be passed to the policy on creation. See ppo_policies
verbose – Verbosity level: 0 for no output, 1 for info messages (such as device or wrappers used), 2 for debug messages
seed – Seed for the pseudo random generators
device – Device (cpu, cuda, …) on which the code should be run. Setting it to auto, the code will be run on the GPU if possible.
_init_setup_model – Whether or not to build the network at the creation of the instance

save_policy(path)[source]¶: Overwrite the save method. Instead of saving the entire state, we’ll just save the policy weights.

load_policy(path)[source]¶

Overwrite the load method. Instead of loading the entire state, we’ll just load the policy weights.

There are four cases to consider:

A layer in the saved policy is identical in shape to the current policy
- Do nothing for this layer
A layer is both present in the saved policy and the current policy, but
the shapes are different - Delete the layer from the saved policy
A layer is present in the saved policy but not the current policy
- Delete the layer from the saved policy
A layer is present in the current policy but not the saved policy
- Do nothing for this layer. By setting strict=False in the call to
  load_state_dict, we can ignore this layer.

load_rollout(path)[source]¶: Load the rollout data from a previous training run. The rollout is a list of actions based on a current step. The model.predict call will then be overwritten to return the next action. This loader is “dumb” in the sense that it doesn’t actually process the observations when it’s using rollout, it will simply keep track of the current step and return the next action in the rollout.

classmethod load_weights(weights, **kwargs)[source]¶: Load the weights for the policy. This is useful for testing the evolutionary loop without having to train the agent each time.

predict(*args, **kwargs)[source]¶

Get the policy action from an observation (and optional hidden state). Includes sugar-coating to handle different observations (e.g. normalizing images).

Parameters:

observation – the input observation
state – The last hidden states (can be None, used in recurrent policies)
episode_start – The last masks (can be None, used in recurrent policies) this correspond to beginning of episodes, where the hidden states of the RNN must be reset.
deterministic – Whether or not to return deterministic actions.

Returns:

the model’s action and the next hidden state (used in recurrent policies)