Trainers

Phantom provides a simple Trainer interface and class for developing and implementing new learning algorithms. Two example implementations are provided: Q-Learning and Proximal Policy Optimisation (PPO).

Note: This is a new feature in Phantom and is subject to change in the future.

The two implementations should be seen as examples of how to implement a Trainer rather than fully tested and optimised trainers to use. It is recommended that RLlib other more mature implementations are used in practice. The example implementations can be found in the examples/trainers directory.

Base Trainer

class phantom.trainer.Trainer(tensorboard_log_dir=None)[source]

Base Trainer class providing interfaces and common functions for subclassed trainers.

Some basic tensorboard logging via tensorboardX is included.

Subclasses must set the policy_class class property and implement either the train() or training_step() methods.

Parameters:: tensorboard_log_dir (Optional[str]) – If provided, will save metrics to the given directory in a format that can be viewed with tensorboard.

Note: These classes and interfaces are new in Phantom and are subject to change in the future.

log_metrics(env)[source]

Logs the trainer’s set metrics from a provided env.

Return type:: None

log_rewards(rewards)[source]

Logs the rewards from a provided env.

Return type:: None

log_vec_metrics(envs)[source]

Logs the trainer’s set metrics from a provided list of envs.

Return type:: None

log_vec_rewards(rewards)[source]

Logs the rewards from a provided list of envs.

Return type:: None

setup_policy_specs_and_mapping(env, policies)[source]

Parses a policy mapping object, validates it against an env instance and returns mappings of AgentID -> PolicyID and PolicyID -> Policy.

Useful for when defining custom train() methods.

Return type:: Tuple[Dict[Hashable, Hashable], Dict[Hashable, Policy]]

tbx_write_scalar(name, value, step)[source]

Writes a custom scalar value to tensorboard.

Return type:: None

tbx_write_values(step)[source]

Writes logged metrics and rewards to tensorboardX and flushes the cache.

Return type:: None

train(env_class, num_iterations, policies, policies_to_train, env_config=None, metrics=None)[source]

Entry point to training.

For some algorithms this implementation is sufficient and only the training_step() method needs to be implemented by the sub-class (for example, see the Q-Learning Trainer). For other algorithms it may be necessary to override this implementation (for example, see the PPO Trainer).

Parameters:

env_class (Type[PhantomEnv]) – The environment class to train the policy/policies with.
num_iterations (int) – The number of units of training, defined by each algorithm, to perform.
policies (Mapping[Hashable, Union[Type[Agent], List[Hashable], Tuple[Type[Policy], Type[Agent]], Tuple[Type[Policy], Type[Agent], Mapping[str, Any]], Tuple[Type[Policy], List[Hashable]], Tuple[Type[Policy], List[Hashable], Mapping[str, Any]]]]) – A mapping of policy IDs to the agents to use them along with any configuration options.
policies_to_train (Sequence[Hashable]) – A list of IDs of policies to train (must be of the Policy type related to the Trainer).
env_config (Optional[Mapping[str, Any]]) – Configuration parameters to pass to the environment init method.
metrics (Optional[Mapping[str, Metric]]) – Optional set of metrics to record and log.

Return type:

TrainingResults

Returns:

A TrainingResults object containing all policies (including those not trained with the Trainer).

Policy Mapping Usage:

policies = {
    # Type[Agent]
    # (all agents of this class will use the default policy of the trainer,
    # policy config options are handled by the trainer)
    "PolicyID1": SomeAgentClass,

    # List[AgentID]
    # (all agents with the given IDs will use the default policy of the trainer)
    "PolicyID2": ["Agent1", "Agent2"],

    # Tuple[Type[Policy], Type[Agent]]
    # (all agents of this class will use this custom policy class with no
    # provided config options)
    "PolicyID3": (CustomPolicyClass1, SomeAgentClass),

    # Tuple[Type[Policy], Type[Agent], Mapping[str, Any]]
    # (all agents of this class will use this custom policy class with the
    # provided config options)
    "PolicyID4": (CustomPolicyClass1, SomeAgentClass, {...}),

    # Tuple[Type[Policy], List[AgentID]]
    # (all agents with the given IDs will use this custom policy class with no
    # provided config options)
    "PolicyID5": (CustomPolicyClass1, ["Agent3", "Agent4"]),

    # Tuple[Type[Policy], List[AgentID], Mapping[str, Any]]
    # (all agents with the given IDs will use this custom policy class with the
    # provided config options)
    "PolicyID6": (CustomPolicyClass1, ["Agent5", "Agent6"], {...}),
}

training_step(env, policy_mapping, policies, policies_to_train)[source]

Performs one unit of policy training.

Parameters:

env (PhantomEnv) – The environment instance to use.
policy_mapping (Mapping[Hashable, Hashable]) – A mapping of agent IDs to policy IDs.
policies (Mapping[Hashable, Policy]) – A mapping of policy IDs to policy class instances.
policies_to_train (Sequence[Hashable]) – A list of IDs of policies to train.

Return type:

None

class phantom.trainer.TrainingResults(policies)[source]

Returned when Trainer.train() is run. By default, only contains all policy objects. Can be extended by Trainer subclasses to return additional info.

policies: A mapping of policy IDs to policy objects for all policies, not just trained policies.