Tutorial - Part 2
Part 1 of the tutorial showed how to set up a simple Phantom experiment. This next part covers some additional features of Phantom that will help make your experiments even better!
Metrics
In the latter part of the previous tutorial setup we demonstrated the use of metrics for recording data from the environment and agents.
Metric values are recorded at the end of every step. When performing training, a single float/integer value must be returned for the whole episode so a reduction operation must be performed. The default is to take the most recent metric value, i.e. that of the last step in the episode. The values seen in tensorboard are the average of these reduced values over each batch of episodes.
When performing rollouts, every value for every step is recorded, giving fine grained information on each step in each episode.
In our supply chain example we want to monitor the average amount of stock the shop is
holding onto as the experiment progresses. Phantom provides a base Metric class
but for the majority of use-cases the provided helper classes SimpleAgentMetric
and SimpleEnvMetric are enough.
metrics = {
"SHOP/stock": SimpleAgentMetric(agent_id="SHOP", agent_property="stock", reduce_action="last"),
}
The SimpleAgentMetric will record the given property on the agent. Similarly
the SimpleEnvMetric records a given property that exists on the environment
instance.
As well as the last reduction operation, there is also sum and mean.
We register metrics using the metrics property on the
ph.utils.rllib.train() function. The name can be whatever the user wants however
it is sensible to include the name of the agent and the property that is being measured,
eg. SHOP/stock.
Modular Encoders, Decoders & Reward Functions
So far we have used the decode_action(), encode_observation() and
compute_reward() methods in our ShopAgent definition. However Phantom
also provides an alternative set of interfaces for more advanced use cases. We can
create custom Encoder, Decoder and RewardFunction classes
that perform the same functionality and attach them to agents.
This provides two key benefits:
Code reuse - Functionality that is shared across multiple agent types only has to be implemented once.
Composability - Using the
ChainedEncoderandChainedDecoderclasses we can cleanly combine multiple encoders and decoders into complex objects, whilst keeping the individual functionality of each sub encoder separated.
Phantom StrategicAgent`s will first check to see if a custom
:meth:`decode_action(), encode_observation() or compute_reward() method
has been implemented on the class. If not, the agent will then check to see if a custom
Encoder, Decoder or RewardFunction class has been provided
for the agent. If neither is provided for any of the three, an exception will be raised!
Lets say we want to introduce a second type of ShopAgent, one with a different
type of reward function – this new ShopAgent may not be concerned about the
amount of missed sales it has.
One option is to copy the entire ShopAgent and edit its
compute_reward() method. However a better option is to remove the
compute_reward() method from the ShopAgent and create two different
RewardFunction objects and initialise each type of agent with one:
class ShopRewardFunction(ph.RewardFunction):
def reward(self, ctx: ph.Context) -> float:
return ctx.agent.sales - 0.1 * ctx.agent.stock
class SimpleShopRewardFunction(ph.RewardFunction):
def reward(self, ctx: ph.Context) -> float:
return ctx.agent.sales
Note that we now access the ShopAgent’s state through the ctx.agent
property.
We modify our ShopAgent class so that it takes a RewardFunction object
as an initialisation parameter and passes it to the underlying Phantom
StrategicAgent class.
class ShopAgent(ph.Agent):
def __init__(self, agent_id: str, factory_id: str, reward_function: ph.RewardFunction):
super().__init__(agent_id, reward_function=reward_function)
...
Next we modify our SupplyChainEnv to allow the creation of a mix of shop types:
NUM_SHOPS_TYPE_1 = 1
NUM_SHOPS_TYPE_2 = 1
class SupplyChainEnv(ph.PhantomEnv):
def __init__(self):
...
shop_t1_ids = [f"SHOP_T1_{i+1}" for i in range(NUM_SHOPS_TYPE_1)]
shop_t2_ids = [f"SHOP_T2{i+1}" for i in range(NUM_SHOPS_TYPE_2)]
shop_ids = shop_t1_ids + shop_t2_ids
...
shop_agents = [
ShopAgent(sid, factory_id, ShopRewardFunction())
for sid in shop_t1_ids
] + [
ShopAgent(sid, factory_id, SimpleShopRewardFunction())
for sid in shop_t2_ids
]
...
Types & Supertypes
Now let’s say we want to develop a rounded policy throughout the training that works
with a range of reward functions that all slightly modify the weight of the
stock factor. Doing this manually would be cumbersome. Instead we can use the
Phantom types and supertypes feature.
For the ShopAgent we define a class as a property of the shop named
Supertype that inherits from the ph.Supertype class that defines the
supertype of the agent. In our case this only contains the excess_stock_weight
parameter we want to vary. When defining our supertype it is good practice to give all
fields a default value!
MAX_EXCESS_STOCK_WEIGHT = 0.2
class ShopAgent(ph.Agent):
@dataclass(frozen=True)
class Supertype(ph.Supertype):
excess_stock_weight: float = 0.1
We no longer need to pass in a custom RewardFunction class to the
ShopAgent:
def __init__(self, agent_id: str, factory_id: str):
super().__init__(agent_id)
...
#
As we are using the RLlib backend to train, we don’t need to provide the
ShopAgent with the new supertype, this is handled by the included training and
evaluation functions and allows the use of Sampler s and Range s.
In this example for the sake of simplicity we go back to using the
compute_reward() method on the ShopAgent. We modify it to take the
excess_stock_weight value from the agent’s type:
def compute_reward(self, ctx: ph.Context) -> float:
# We reward the agent for making sales.
# We penalise the agent for holding onto excess stock.
return self.sales - self.type.excess_stock_weight * self.stock
#
We also need to modify the ShopAgent’s observation space to include it’s type
values. This is key to allowing the ShopAgent to learn a generalised policy.
def __init__(self, agent_id: str, factory_id: str):
...
# = [Stock, Sales, Missed Sales, Type.Excess Stock Weight]
self.observation_space = gym.spaces.Box(low=0.0, high=1.0, shape=(4,))
...
def encode_observation(self, ctx: ph.Context):
max_sales_per_step = NUM_CUSTOMERS * CUSTOMER_MAX_ORDER_SIZE
return np.array(
[
self.stock / SHOP_MAX_STOCK,
self.sales / max_sales_per_step,
self.missed_sales / max_sales_per_step,
self.type.excess_stock_weight / MAX_EXCESS_STOCK_WEIGHT,
],
dtype=np.float32,
)
@property
def observation_space(self):
return gym.spaces.Tuple(
[
# We include the agent's type in it's observation space to allow it to learn
# a generalised policy.
self.type.to_obs_space(),
# We also encode the shop's current stock in the observation.
gym.spaces.Discrete(SHOP_MAX_STOCK + 1),
]
)
#
To sample from a distribution of values for the supertypes whilst training we add the
agent_supertypes argument to the train function:
ph.utils.rllib.train(
...
agent_supertypes={
"SHOP1": {"excess_stock_weight": UniformFloatSampler(0.0, MAX_EXCESS_STOCK_WEIGHT)},
"SHOP2": {"excess_stock_weight": UniformFloatSampler(0.0, MAX_EXCESS_STOCK_WEIGHT)},
},
...
)
At the start of each episode in training, each shop agent’s excess_stock_weight
type value will be independently sampled from a random uniform distribution between 0.0
and 0.2.
The supertype system in Phantom is very powerful. To see a full guide to its features see the Supertypes page.