Tutorial - Part 2
Part 1 of the tutorial showed how to set up a simple Phantom experiment. This next part covers some additional features of Phantom that will help make your experiments even better!
Metrics
In the latter part of the previous tutorial setup we demonstrated the use of metrics for recording data from the environment and agents.
Metric values are recorded at the end of every step. When performing training, a single float/integer value must be returned for the whole episode so a reduction operation must be performed. The default is to take the most recent metric value, i.e. that of the last step in the episode. The values seen in tensorboard are the average of these reduced values over each batch of episodes.
When performing rollouts, every value for every step is recorded, giving fine grained information on each step in each episode.
In our supply chain example we want to monitor the average amount of stock the shop is
holding onto as the experiment progresses. Phantom provides a base Metric
class
but for the majority of use-cases the provided helper classes SimpleAgentMetric
and SimpleEnvMetric
are enough.
metrics = {
"SHOP/stock": SimpleAgentMetric(agent_id="SHOP", agent_property="stock", reduce_action="last"),
}
The SimpleAgentMetric
will record the given property on the agent. Similarly
the SimpleEnvMetric
records a given property that exists on the environment
instance.
As well as the last
reduction operation, there is also sum
and mean
.
We register metrics using the metrics
property on the
ph.utils.rllib.train()
function. The name can be whatever the user wants however
it is sensible to include the name of the agent and the property that is being measured,
eg. SHOP/stock
.
Modular Encoders, Decoders & Reward Functions
So far we have used the decode_action()
, encode_observation()
and
compute_reward()
methods in our ShopAgent
definition. However Phantom
also provides an alternative set of interfaces for more advanced use cases. We can
create custom Encoder
, Decoder
and RewardFunction
classes
that perform the same functionality and attach them to agents.
This provides two key benefits:
Code reuse - Functionality that is shared across multiple agent types only has to be implemented once.
Composability - Using the
ChainedEncoder
andChainedDecoder
classes we can cleanly combine multiple encoders and decoders into complex objects, whilst keeping the individual functionality of each sub encoder separated.
Phantom StrategicAgent`s will first check to see if a custom
:meth:`decode_action()
, encode_observation()
or compute_reward()
method
has been implemented on the class. If not, the agent will then check to see if a custom
Encoder
, Decoder
or RewardFunction
class has been provided
for the agent. If neither is provided for any of the three, an exception will be raised!
Lets say we want to introduce a second type of ShopAgent
, one with a different
type of reward function – this new ShopAgent
may not be concerned about the
amount of missed sales it has.
One option is to copy the entire ShopAgent
and edit its
compute_reward()
method. However a better option is to remove the
compute_reward()
method from the ShopAgent
and create two different
RewardFunction
objects and initialise each type of agent with one:
class ShopRewardFunction(ph.RewardFunction):
def reward(self, ctx: ph.Context) -> float:
return ctx.agent.sales - 0.1 * ctx.agent.stock
class SimpleShopRewardFunction(ph.RewardFunction):
def reward(self, ctx: ph.Context) -> float:
return ctx.agent.sales
Note that we now access the ShopAgent
’s state through the ctx.agent
property.
We modify our ShopAgent
class so that it takes a RewardFunction
object
as an initialisation parameter and passes it to the underlying Phantom
StrategicAgent
class.
class ShopAgent(ph.Agent):
def __init__(self, agent_id: str, factory_id: str, reward_function: ph.RewardFunction):
super().__init__(agent_id, reward_function=reward_function)
...
Next we modify our SupplyChainEnv
to allow the creation of a mix of shop types:
NUM_SHOPS_TYPE_1 = 1
NUM_SHOPS_TYPE_2 = 1
class SupplyChainEnv(ph.PhantomEnv):
def __init__(self):
...
shop_t1_ids = [f"SHOP_T1_{i+1}" for i in range(NUM_SHOPS_TYPE_1)]
shop_t2_ids = [f"SHOP_T2{i+1}" for i in range(NUM_SHOPS_TYPE_2)]
shop_ids = shop_t1_ids + shop_t2_ids
...
shop_agents = [
ShopAgent(sid, factory_id, ShopRewardFunction())
for sid in shop_t1_ids
] + [
ShopAgent(sid, factory_id, SimpleShopRewardFunction())
for sid in shop_t2_ids
]
...
Types & Supertypes
Now let’s say we want to develop a rounded policy throughout the training that works
with a range of reward functions that all slightly modify the weight of the
stock
factor. Doing this manually would be cumbersome. Instead we can use the
Phantom types and supertypes feature.
For the ShopAgent
we define a class as a property of the shop named
Supertype
that inherits from the ph.Supertype
class that defines the
supertype of the agent. In our case this only contains the excess_stock_weight
parameter we want to vary. When defining our supertype it is good practice to give all
fields a default value!
MAX_EXCESS_STOCK_WEIGHT = 0.2
class ShopAgent(ph.Agent):
@dataclass(frozen=True)
class Supertype(ph.Supertype):
excess_stock_weight: float = 0.1
We no longer need to pass in a custom RewardFunction
class to the
ShopAgent
:
def __init__(self, agent_id: str, factory_id: str):
super().__init__(agent_id)
...
#
As we are using the RLlib backend to train, we don’t need to provide the
ShopAgent
with the new supertype, this is handled by the included training and
evaluation functions and allows the use of Sampler
s and Range
s.
In this example for the sake of simplicity we go back to using the
compute_reward()
method on the ShopAgent
. We modify it to take the
excess_stock_weight
value from the agent’s type:
def compute_reward(self, ctx: ph.Context) -> float:
# We reward the agent for making sales.
# We penalise the agent for holding onto excess stock.
return self.sales - self.type.excess_stock_weight * self.stock
#
We also need to modify the ShopAgent
’s observation space to include it’s type
values. This is key to allowing the ShopAgent
to learn a generalised policy.
def __init__(self, agent_id: str, factory_id: str):
...
# = [Stock, Sales, Missed Sales, Type.Excess Stock Weight]
self.observation_space = gym.spaces.Box(low=0.0, high=1.0, shape=(4,))
...
def encode_observation(self, ctx: ph.Context):
max_sales_per_step = NUM_CUSTOMERS * CUSTOMER_MAX_ORDER_SIZE
return np.array(
[
self.stock / SHOP_MAX_STOCK,
self.sales / max_sales_per_step,
self.missed_sales / max_sales_per_step,
self.type.excess_stock_weight / MAX_EXCESS_STOCK_WEIGHT,
],
dtype=np.float32,
)
@property
def observation_space(self):
return gym.spaces.Tuple(
[
# We include the agent's type in it's observation space to allow it to learn
# a generalised policy.
self.type.to_obs_space(),
# We also encode the shop's current stock in the observation.
gym.spaces.Discrete(SHOP_MAX_STOCK + 1),
]
)
#
To sample from a distribution of values for the supertypes whilst training we add the
agent_supertypes
argument to the train function:
ph.utils.rllib.train(
...
agent_supertypes={
"SHOP1": {"excess_stock_weight": UniformFloatSampler(0.0, MAX_EXCESS_STOCK_WEIGHT)},
"SHOP2": {"excess_stock_weight": UniformFloatSampler(0.0, MAX_EXCESS_STOCK_WEIGHT)},
},
...
)
At the start of each episode in training, each shop agent’s excess_stock_weight
type value will be independently sampled from a random uniform distribution between 0.0
and 0.2.
The supertype system in Phantom is very powerful. To see a full guide to its features see the Supertypes page.