Trainers
Phantom provides a simple Trainer interface and class for developing and implementing new learning algorithms. Two example implementations are provided: Q-Learning and Proximal Policy Optimisation (PPO).
Note: This is a new feature in Phantom and is subject to change in the future.
The two implementations should be seen as examples of how to implement a Trainer rather
than fully tested and optimised trainers to use. It is recommended that RLlib other more
mature implementations are used in practice. The example implementations can be found
in the examples/trainers
directory.
Base Trainer
- class phantom.trainer.Trainer(tensorboard_log_dir=None)[source]
Base Trainer class providing interfaces and common functions for subclassed trainers.
Some basic tensorboard logging via tensorboardX is included.
Subclasses must set the
policy_class
class property and implement either thetrain()
ortraining_step()
methods.- Parameters:
tensorboard_log_dir (
Optional
[str
]) – If provided, will save metrics to the given directory in a format that can be viewed with tensorboard.
Note: These classes and interfaces are new in Phantom and are subject to change in the future.
- log_vec_metrics(envs)[source]
Logs the trainer’s set metrics from a provided list of envs.
- Return type:
- setup_policy_specs_and_mapping(env, policies)[source]
Parses a policy mapping object, validates it against an env instance and returns mappings of AgentID -> PolicyID and PolicyID -> Policy.
Useful for when defining custom
train()
methods.
- tbx_write_scalar(name, value, step)[source]
Writes a custom scalar value to tensorboard.
- Return type:
- tbx_write_values(step)[source]
Writes logged metrics and rewards to tensorboardX and flushes the cache.
- Return type:
- train(env_class, num_iterations, policies, policies_to_train, env_config=None, metrics=None)[source]
Entry point to training.
For some algorithms this implementation is sufficient and only the
training_step()
method needs to be implemented by the sub-class (for example, see the Q-Learning Trainer). For other algorithms it may be necessary to override this implementation (for example, see the PPO Trainer).- Parameters:
env_class (
Type
[PhantomEnv
]) – The environment class to train the policy/policies with.num_iterations (
int
) – The number of units of training, defined by each algorithm, to perform.policies (
Mapping
[Hashable
,Union
[Type
[Agent
],List
[Hashable
],Tuple
[Type
[Policy
],Type
[Agent
]],Tuple
[Type
[Policy
],Type
[Agent
],Mapping
[str
,Any
]],Tuple
[Type
[Policy
],List
[Hashable
]],Tuple
[Type
[Policy
],List
[Hashable
],Mapping
[str
,Any
]]]]) – A mapping of policy IDs to the agents to use them along with any configuration options.policies_to_train (
Sequence
[Hashable
]) – A list of IDs of policies to train (must be of the Policy type related to the Trainer).env_config (
Optional
[Mapping
[str
,Any
]]) – Configuration parameters to pass to the environment init method.metrics (
Optional
[Mapping
[str
,Metric
]]) – Optional set of metrics to record and log.
- Return type:
- Returns:
A
TrainingResults
object containing all policies (including those not trained with the Trainer).
- Policy Mapping Usage:
policies = { # Type[Agent] # (all agents of this class will use the default policy of the trainer, # policy config options are handled by the trainer) "PolicyID1": SomeAgentClass, # List[AgentID] # (all agents with the given IDs will use the default policy of the trainer) "PolicyID2": ["Agent1", "Agent2"], # Tuple[Type[Policy], Type[Agent]] # (all agents of this class will use this custom policy class with no # provided config options) "PolicyID3": (CustomPolicyClass1, SomeAgentClass), # Tuple[Type[Policy], Type[Agent], Mapping[str, Any]] # (all agents of this class will use this custom policy class with the # provided config options) "PolicyID4": (CustomPolicyClass1, SomeAgentClass, {...}), # Tuple[Type[Policy], List[AgentID]] # (all agents with the given IDs will use this custom policy class with no # provided config options) "PolicyID5": (CustomPolicyClass1, ["Agent3", "Agent4"]), # Tuple[Type[Policy], List[AgentID], Mapping[str, Any]] # (all agents with the given IDs will use this custom policy class with the # provided config options) "PolicyID6": (CustomPolicyClass1, ["Agent5", "Agent6"], {...}), }
- training_step(env, policy_mapping, policies, policies_to_train)[source]
Performs one unit of policy training.
- Parameters:
env (
PhantomEnv
) – The environment instance to use.policy_mapping (
Mapping
[Hashable
,Hashable
]) – A mapping of agent IDs to policy IDs.policies (
Mapping
[Hashable
,Policy
]) – A mapping of policy IDs to policy class instances.policies_to_train (
Sequence
[Hashable
]) – A list of IDs of policies to train.
- Return type:
- class phantom.trainer.TrainingResults(policies)[source]
Returned when
Trainer.train()
is run. By default, only contains all policy objects. Can be extended byTrainer
subclasses to return additional info.- policies
A mapping of policy IDs to policy objects for all policies, not just trained policies.