Deep Reinforcement Learning for the Cannabis Retail Market

Micheal Lanham
Sep 4, 2018 · 9 min read

Legalization of marijuana has brought many first time retailers looking for that golden opportunity. Only to soon realize that government taxation and heavy competition, often just down the block, quickly shrinks those opportunities and profits. In short, many in those early legal states (Colorado) soon realized that shrinking profit margins required a bit more intelligence in what and how to sell. Fortunately, the cannabis industry happened to be emerging during a new tech explosion, known as machine learning and data science. Which in turn produced startups looking to provide insight and analytics to those now struggling retailers. At one such startup, Zefyr, I helped produce GoZefyr.com, a cannabis spatial analytics search engine. That later helped me develop several machine learning methods with the more interesting one I present below.


Why Reinforcement Learning?

At Zefyr, I had access to virtually any form of data you would want for analyzing retail. From point of sales purchases to shipping shrinkage, customer demographics and of course dispensary menu listings. Yet, what we found is that regular machine learning, the type you label or classify just would not globally fit. You could build a model for Washington state, but that model would not carry over to Colorado for instance and vice versa. In fact, in many states a model would not even predict well across cities. We did eventually determine that demographics play a bigger hand. Yet, it was more than identifying big sellers like edibles for Colorado, flowers for Washington and concentrates for California. It was giving the little guy, that single dispensary owner the power to make them stand out and compete. With that goal, I looked to games and simulation but of course ended on reinforcement learning.

Reinforcement learning is the form of learning we classify as unsupervised, meaning that the data does not have to be pre labelled before presenting. It is the form of machine learning most often associated with general artificial intelligence because the goal is to train a computer much the same way we train a dog, yes a dog, with rewards and everything. Where in RL, rewards are the goal of the computer or agent and not, I repeat not, the action or immediate prediction. This means that it is often a collection of actions or predictions that provide the agent with a reward in RL. Nothing is better than a quick and powerful demonstration of RL in action. Below is an example of the Google DeepMind agent playing the classic Atari game breakout.

Demonstration of RL playing an Atari game

After looking at RL for some time I realized we were looking at the problem backwards. It wasn’t the dispensary but rather the customer that could be modelled using RL to determine the optimum customer happiness or reward. Being able to identify satisfied or unsatisfied consumer markets would be extremely beneficial to retailers. Therefore, all I needed was a system to reward a consumer based on their purchases over a time period, enter Reinforcement Learning.

Since Reinforcement Learning is somewhat time dependent, I wanted the purchases evaluated over a period, perhaps a week or month and decided that each purchase the agent makes would be considered an action. Each purchase would be evaluated later, summed and then the total reward used as a value to train the customer agent against. With the reward for each purchase being evaluated by a random Monte Carlo selection process I will discuss further below.

Reinforcement Learning in a Nutshell or so

Reinforcement Learning is really about describing the path an agent would take to receive a reward. Along the way the agent will certainly encounter many things or what we call the environment or state in RL. State is often described as the agent’s observations and in the case of the video example above actual game screen shots. This is what we use as input into our RL model and yes, it often is very raw input. The output of an RL model is the action the agent will take and in this case the action is the purchase an agent makes. Finally, the reward is used to provide feedback into the agent in order to refine it’s decisions in the future. A far better introduction to Deep Reinforcement Learning is given in the video below, it may be worth a quick diversion if I start to make less and less sense.

A quick primer on reinforcement learning

The Technical Stuff

The discussion below is still intended to be simple and digestible by most people but several new terms will be introduced quickly and with no background. If some terms are foreign, Google them. Most of the algorithms I discuss here has been around for a while and information is readily available in all forms. Unfortunately, I am unable to provide my early samples due to concerns over data.

Determining Observations and State

For the first few versions I used a model with a simplistic DQN (a first form of DRL, watch the video for more) using a static state described by 3 dispensaries located in a 5x5 grid. Each dispensary identified its spot on the grid and its menu in 3/4 distinct categories (flower (sativa/hybrid), edible, concentrate). With each menu category representing a selecting of items ordered by price. Each item is represented by the price and the distributions that best match that products rating and amount of THC in terms of the normal distribution, represented by the mean and standard deviation.

A sample representation of input state for one dispensary

Enter Monte Carlo Methods

In 2005 or so I worked on developing a risk analysis model with a wellbore geomechanics software package I helped work on, called STABView. Quantitative risk analysis, as it is called, is based on using probability distributions to model the input of a value and provide answers in the form of a probability or risk. The underlying math is called Markov chain Monte Carlo methods, which coincidentally shares a lot in common with the foundations of RL math. Monte Carlo methods are an excellent way to model uncertainty which is rampant when discussing customer ratings or the amount of THC found in a joint. Intuitively, our brains model these distributions all the time when making decisions. So it seemed natural to just present the same distributions to the model. Since our model will be fed into a deep learning neural network that will learn to model against the distributions. As for modelling the distributions that is simply done using descriptive statistics on online customer ratings and various reference sources for THC values. In many cases the THC values were estimated but this only reinforced the benefit of using Monte Carlo methods.

Modelling Consumer Demographics

In most RL systems a single agent is used to solve the problem but in this case I use multiple agents with each agent representing a customer. All of the agents in the model will continue to share the same RL brain. Each grid, in the 5x5 map grid from above, holds several agents based on the grids population. Each grid square represents a demographic postal code region. In fact, in the real Zefyr data, we model customers using real demographic data. This demographic data would determine the amount of money a customer had to spend as well as possible preferences. Again, to be fair, all of these values are modelled as distributions which will be used to evaluate an agents remaining wealth and demote a preference. As part of the input state then we will add the agents remaining wealth after each buy/action and a value to represent customer demographic region. We base our demographic model on the ESRI Tapestry dataset, which also happens to be generated from ML techniques.

Making the Buy, How it Works

With the raw state modelled I built the rest of the model using a simple deep Q network. Without getting into specifics; the model took as input the flattened state and processed it though a few hidden layers to spit out the action in the form of 3 outputs representing the dispensary, menu category and item. Each agent/consumer was modelled against the same network but varied by demographics. Therefore, at the start of each simulation run, which lasted 7 or 30 virtual days, each agent was given a randomly selected sum of money based on their demographic (this is the Monte Carlo part). This sum of money represented their available funds for buying product. The agent would receive no more money until after the run. During each day, the agent was fed the model menu state, including their income and demographic preference and they generate the outputs representing an action or buy. The buy is made for the agent, deducting the amount of money and any travel expenses. The agent is tacked on a cost for travelling each grid square in order to reach a location. For instance an agent in square 1,1 would need to pay 2x travel cost units in order to reach a dispensary at square 3,3. In the original example I used Denver as my test case. At this point, if the customer did not have enough money to make a successful buy I penalize the agent with a -0.1 reward. This is done in order to encourage the agent to purchase product successfully everyday. If there is a small penalty reward it is fed back to the agent now, but the total evaluation of purchases is done later.

Evaluating Rewards, Pleasing the Consumer

The evaluation of all purchases is done at the end of the run, thus allowing the consumer agent to wait for total deferred reward and thus provide for better cross product comparison. For each buy, a randomly selected review and THC values are chosen using Monte Carlo methods but based on the selected item. These values combined with the customer demographic together are formulated normalized to provide a purchase reward. The mean is taken from the sum of all purchase rewards which gives a total average reward that is fed back to the DQN for training. The secret sauce in this, and the part I won’t get into, is the evaluation of the purchase reward, which ends up being a number from 0 to 1. With 1 being a really good buy.

What the Output Looks Like

Of course, as this model first runs the consumer is just buying anything and everything. Over time though, the agent gets much smarter and starts to evaluate better buys for the consumer in it’s demographic region, but again using the same customer agent model. As the consumer model gets smarter each purchase is also smarter and rewards increase. Except, now we can quickly see areas in our consumer grid that show areas where average customer satisfaction is particularly good and bad, see the image below:

Recreation of simple experiment

The example image above is a simple replication of the original example that was run over Denver. Again, in order to avoid a whole data thing I replicated the results in a couple screen shots. One of the things I immediately noticed was a couple regions with with very low average customer satisfaction values next to multiple dispensary locations. Everywhere else the model seemed to predicate quite accurately, what was the problem in these regions? Not being from Denver and a Canadian I was a little surprised, until I learned through speaking to my American colleagues that those were in fact very depressed regions and my results were not that surprising. It seems that even the dispensaries close to the depressed regions were still marketing for higher end consumers, thus reinforcing my model further. Overall I was pleasantly surprised at my results and I am currently doing further research. Simulating accurate consumer purchases also provides an excellent model to quickly test alternate dispensary menu’s products, locations and even perhaps an RL dispensary agent. There are also other ways to build on from DQN perhaps going to a full PPO model.

The Source

I am unable to provide the source to the model but if there is enough interest I may provide a more thorough and robust demonstration of the DQN model built with Keras. An alternative I have also considered is using the PPO model that is built into Unity with a full 3D map interface for running and demonstrating the simulation. Considering my other time commitments and working research that is likely only a dream but I would also be very interested to hear how this model may have worked for you…

Micheal Lanham

Written by

Micheal Lanham is a proven software and tech innovator with 20 years of experience developing games, graphics and machine learning AI apps.

Data Driven Investor

from confusion to clarity, not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade