RLlib Utilities
The following tools are for training and evaluating policies with RLlib. The tools take care of a lot of boilerplate tasks such as finding the newest results directories and checkpoints and also more Phantom specific tasks such as populating supertypes with Samplers and Ranges.
Training
- phantom.utils.rllib.train(algorithm, env_class, policies, iterations, checkpoint_freq=None, num_workers=None, env_config=None, rllib_config=None, metrics=None, results_dir='/home/docs/ray_results', show_training_metrics=False)[source]
Performs training of a Phantom experiment using the RLlib library.
Any objects that inherit from BaseSampler in the env_supertype or agent_supertypes parameters will be automatically sampled from and fed back into the environment at the start of each episode.
- Parameters:
algorithm (
str
) – RL algorithm to use (optional - one of ‘algorithm’ or ‘trainer’ must be provided).env_class (
Type
[PhantomEnv
]) – A PhantomEnv subclass.policies (
Mapping
[str
,Union
[Type
[Agent
],List
[Hashable
],Tuple
[Union
[Type
[Policy
],Type
[Policy
]],Type
[Agent
]],Tuple
[Union
[Type
[Policy
],Type
[Policy
]],Type
[Agent
],Mapping
[str
,Any
]],Tuple
[Union
[Type
[Policy
],Type
[Policy
]],List
[Hashable
]],Tuple
[Union
[Type
[Policy
],Type
[Policy
]],List
[Hashable
],Mapping
[str
,Any
]]]]) – A mapping of policy IDs to policy configurations.iterations (
int
) – Number of training iterations to perform.checkpoint_freq (
Optional
[int
]) – The iteration frequency to save policy checkpoints at.num_workers (
Optional
[int
]) – Number of Ray rollout workers to use (defaults to ‘NUM CPU - 1’).env_config (
Optional
[Mapping
[str
,Any
]]) – Configuration parameters to pass to the environment init method.rllib_config (
Optional
[Mapping
[str
,Any
]]) – Optional algorithm parameters dictionary to pass to RLlib.metrics (
Optional
[Mapping
[str
,Metric
]]) – Optional set of metrics to record and log.results_dir (
str
) – A custom results directory, default is ~/ray_results/show_training_metrics (
bool
) – Set to True to print training metrics every iteration.
The
policies
parameter defines which agents will use which policy. This is key to performing shared policy learning. The function expects a mapping of{<policy_id> : <policy_setup>}
. The policy setup values can take one of the following forms:Type[Agent]
: All agents that are an instance of this class will learn the same RLlib policy.List[AgentID]
: All agents that have IDs in this list will learn the same RLlib policy.Tuple[PolicyClass, Type[Agent]]
: All agents that are an instance of this class will use the same fixed/learnt policy.Tuple[PolicyClass, Type[Agent], Mapping[str, Any]]
: All agents that are an instance of this class will use the same fixed/learnt policy configured with the given options.Tuple[PolicyClass, List[AgentID]]
: All agents that have IDs in this list use the same fixed/learnt policy.Tuple[PolicyClass, List[AgentID], Mapping[str, Any]]
: All agents that have IDs in this list use the same fixed/learnt policy configured with the given options.
- Return type:
Algorithm
- Returns:
The Ray Tune experiment results object.
Note
It is the users responsibility to invoke training via the provided
phantom
command or ensure thePYTHONHASHSEED
environment variable is set before starting the Python interpreter to run this code. Not setting this may lead to reproducibility issues.
Rollouts
- phantom.utils.rllib.rollout(directory, env_class=None, env_config=None, custom_policy_mapping=None, num_repeats=1, num_workers=None, checkpoint=None, metrics=None, record_messages=False, show_progress_bar=True, policy_inference_batch_size=1)[source]
Performs rollouts for a previously trained Phantom experiment.
Any objects that inherit from the Range class in the env_config parameter will be expanded out into a multidimensional space of rollouts.
For example, if two distinct UniformRanges are used, one with a length of 10 and one with a length of 5, 10 * 5 = 50 rollouts will be performed.
If num_repeats is also given, say with a value of 2, then each of the 50 rollouts will be repeated twice, each time with a different random seed.
- Parameters:
directory (
Union
[str
,Path
]) – Results directory containing trained policies. By default, this is located within ~/ray_results/. If LATEST is given as the last element of the path, the parent directory will be scanned for the most recent run and this will be used.env_class (
Optional
[Type
[PhantomEnv
]]) – Optionally pass the Environment class to use. If not give will fallback to the copy of the environment class saved during training.env_config (
Optional
[Dict
[str
,Any
]]) – Configuration parameters to pass to the environment init method.custom_policy_mapping (
Optional
[Mapping
[Hashable
,Type
[Policy
]]]) – Optionally replace agent policies with custom fixed policies.num_workers (
Optional
[int
]) – Number of rollout worker processes to initialise (defaults to ‘NUM CPU - 1’).num_repeats (
int
) – Number of rollout repeats to perform, distributed over all workers.checkpoint (
Optional
[int
]) – Checkpoint to use (defaults to most recent).metrics (
Optional
[Mapping
[str
,Metric
]]) – Optional set of metrics to record and log.record_messages (
bool
) – If True the full list of episode messages for each of the rollouts will be recorded. Only applies if save_trajectories is also True.show_progress_bar (
bool
) – If True shows a progress bar in the terminal output.policy_inference_batch_size (
int
) – Number of policy inferences to perform in one go.
- Return type:
- Returns:
A Generator of Rollouts.
Note
It is the users responsibility to invoke rollouts via the provided
phantom
command or ensure thePYTHONHASHSEED
environment variable is set before starting the Python interpreter to run this code. Not setting this may lead to reproducibility issues.
Policy Evaluation
- phantom.utils.rllib.evaluate_policy(directory, policy_id, obs, explore, batch_size=100, checkpoint=None, show_progress_bar=True)[source]
Evaluates a given pre-trained RLlib policy over a one of more dimensional observation space.
- Parameters:
directory (
Union
[str
,Path
]) – Results directory containing trained policies. By default, this is located within ~/ray_results/. If LATEST is given as the last element of the path, the parent directory will be scanned for the most recent run and this will be used.policy_id (
str
) – The ID of the trained policy to evaluate.obs (
Any
) – The observation space to evaluate the policy with, of which can includeRange
class instances to evaluate the policy over multiple dimensions in a similar fashion to theph.utils.rllib.rollout()
function.explore (
bool
) – Parameter passed to the policy.batch_size (
int
) – Number of observations to evaluate at a time.checkpoint (
Optional
[int
]) – Checkpoint to use (defaults to most recent).show_progress_bar (
bool
) – If True shows a progress bar in the terminal output.
- Return type:
- Returns:
A generator of tuples of the form (params, obs, action).
Rollout Trajectories & Steps
- class phantom.utils.rollout.AgentStep(i, observation, reward, done, info, action, stage=None)[source]
Describes a step taken by a single agent in an episode.
- class phantom.utils.rollout.Step(i, observations, rewards, terminations, truncations, infos, actions, messages=None, stage=None)[source]
Describes a step taken in an episode.
- class phantom.utils.rollout.Rollout(rollout_id, repeat_id, env_config, rollout_params, steps, metrics)[source]
- actions_for_agent(agent_id, drop_nones=False, stages=None)[source]
Helper method to filter all actions for a single agent.
- count_actions(stages=None)[source]
Helper method to count the occurances of all actions for all agents.
- count_agent_actions(agent_id, stages=None)[source]
Helper method to count the occurances of all actions for a single agents.
- infos_for_agent(agent_id, drop_nones=False, stages=None)[source]
Helper method to filter all ‘infos’ for a single agent.
- observations_for_agent(agent_id, drop_nones=False, stages=None)[source]
Helper method to filter all observations for a single agent.
- rewards_for_agent(agent_id, drop_nones=False, stages=None)[source]
Helper method to filter all rewards for a single agent.
- steps_for_agent(agent_id, stages=None)[source]
Helper method to filter all steps for a single agent.
- terminations_for_agent(agent_id, drop_nones=False, stages=None)[source]
Helper method to filter all ‘terminations’ for a single agent.