Since I learned about super-normal profits and black markets in microeconomics 101, I’ve been fascinated by cannabis pricing. How could something grown so easily be worth so much money?
In graduate school, I came across a clever website that anonymously crowd-sources how much money people pay for different quantities of cannabis. It also details the transaction location and perceived cannabis quality. I was so fascinated I wrote my master’s thesis using their data to estimate the price elasticity of cannabis.
Five years later, cannabis is legal for medicinal use in 31 states and recreational use in 9 states. There are thousands of dispensaries from which one can obtain pricing data to analyze. I thought it might be a good time to revisit cannabis pricing to build a model that outputs a price benchmark for dispensaries (a “dankstimate” in the vein of Zillow’s “zestimate”).
How do dispensaries currently set prices?
Based on anecdotes from dispensary employees, flower product (i.e., the actual cannabis flower, not a concentrate or edible) pricing isn’t especially scientific. Dispensary managers evaluate the flower based on nose (i.e., smell), density, and strain name. While these are important features consumers look to when making their purchasing decisions, we may be able to create an additional pricing benchmark by aggregating pricing data from the majority of dispensaries in the states in which cannabis is legal for both medicinal and recreational use.
Building the “Dankstimate”
While companies like New Frontier Data and BDS Analytics have rich retail tracking data sets from POS systems to build bench marking models, I wanted to build an accurate model using publicly available information. Price prediction is a classic machine learning regression problem, which is the approach I employed.
I was able to obtain data from menus at over 20,000 dispensaries in the states that have legalized cannabis for recreational use using the weedmaps.com API.
The data set included information like the cost for varying quantities of the products, THC and CBD content, strain name, the sub-type of flower (Indica, Sativa, or Hybrid), and other information related to the dispensary.
Data Processing/Exploratory Data Analysis
The data from the weedmaps API came in typical JSON format. Some of the more significant munging exercises included:
- Initially, I only wanted to look at flower products, so I had to filter out non-flower products from the data set.
- The price data varied by product and dispensary, so I had to transform it into a price per gram for each product. I also created quantity indicator variables for each product to account for quantity discounts as the quantities available for each product varied (a weighted average is probably another option).
- The text associated with the strain names for each product was quite messy and required the use regular expressions.
- THC and CBD content was not listed for a lot of products and the structure of the strings in which it was listed varied significantly between menus. In order to ensure I was able to include cannabinoid content for a majority of the products, I fuzzy matched each product to the average lab testing results from this data set. I only kept observations that matched 100%.
- As with any data set, there were some values that I would consider outliers. Values that were greater than three standard deviations from the mean were removed.
The bar charts below illustrate average pricing by the major regions in the states that have legalized cannabis for recreational use (only includes regions that had more than 1,000 observations)
Interestingly, when looking at the data from the Priceofweed.com back in 2012, Washington D.C. was also the most expensive place to purchase cannabis.
According to the data, the cheapest place to buy cannabis is the tiny Washington town of Cathlamet. Given the density of agriculture in the area, perhaps there are a number of larger cannabis farms driving price down.
The table below details the average cost per gram for 15 historically popular strains as detailed in this article.
The pie chart below details the composition of sub types in the data set.
Based on insights from industry reports, dispensary employees, and growers I believe the following are the most relevant features:
- Nose or the fragrance of the flower is certainly one of the most important determinants of price. While the genetics or strain of a cannabis flower certainly contribute to its nose, the specific conditions under which each product is grown is likely the biggest factor. Strain and geographic related fixed-effects helped to capture some of the variation associated with the nose; however, I had to live with some omitted variable bias related to this determinant.
- Density of the buds is also a common characteristic that helps set price. Like the nose, the density of a product is a tangible characteristic that consumers can evaluate prior to making a purchasing decision. Greater density is typically associated with higher quality. Unfortunately, this is data that is not readily available and I was unable to include it in any model.
- Geographic location is important for a number of reasons. Distance from major manufacturing centers like northern California, regional economic conditions (e.g., the cost of living, electricity prices, etc.), and amount of time cannabis has been legal all contribute to pricing. My model incorporated geographic fixed effects (I evaluated both regional and zip-code based indicators) to help capture variation associated with geographic location.
- The type of cannabis is another important pricing determinant. Similar to nose and density, the strain name and type (i.e., Indica or Sativa) are additional factors that consumers can evaluate prior to purchase. Given the dominance of a cultivar (i.e., classification of cannabis based on cultural vernacular whether it is accurate or not) approach to understanding cannabis, the majority of consumers believe the effects and flavors of cannabis flower are dictated by the strain and/or type (e.g., Indica helps you sleep; “in da couch”). I was able to include both strain and type fixed-effects in my model. Additionally, certain strains have garnered significant recognition over the years and often fetch a premium (e.g., “White Widow” or “OG Kush”). Again, utilizing fuzzy string matching, I created a popular strain score (‘fuzz ratio’ of 0-100) based on how close it matched its closest strain name in this article.
- The potency of cannabis is typically measured by its content of the cannabinoids THC and CBD. The higher the concentration of these cannabinoids, the stronger the effects of the flower will be when consumed (this is more of a chemovar approach to classifying cannabis, i.e. based on its chemical composition — more on cultivar vs. chemovar in future posts). In most states, the legislation to legalize cannabis often required that all all products be laboratory tested for their concentration of THC and CBD (among other things) and the results displayed for consumers.
As detailed above, most features in the model were categorical and were one-hot encoded to create indicators. I also explored creating some polynomial permutations of the potency and popular strain features.
All modeling was completed using Python’s scikit-learn library. I explored all popular supervised machine learning regression models such as linear, lasso, ridge, xgboost, random forest, etc. I employed the powerful GridSearchCV function to complete some basic hyper-parameter tuning and cross-validation to combat over-fitting. The models were evaluated via the R² and RMSE scoring metrics.
The random forest regressor performed the best by a significant margin. One might expect on ensemble algorithm like random forest to overfit, however, the R² when the model was run with the training set was 0.82 (RMSE of 2.48) and the R² from the test set was 0.84 (RMSE of 2.40).
The table below illustrates the predicted prices vs. the actual prices:
It appears the model is overpredicting prices below $8/gram and underpredicting prices greater than $8/gram.
It is clear that the model can certainly use some more work; the average $/gram was $9.24 so a model that is consistently off by more than $2 isn’t particularly useful . For Part II and Part III of this project, I hope explore the following:
- Add more features and data — Perhaps there is a better way to capture the variation in pricing associated with the nose and density of the flower products? These features are unfortunately very case-specific, since they are highly contingent on the grower. Weedmaps does provide pictures for each flower product so perhaps some kind of computer visualization tools could be utilized. Also, websites like Cannabisreports.org and Leafly.com appear to be rich data sources for this model.
- Additional feature engineering — I would like to do a deeper dive into potential feature transformations and interactions in an attempt to improve the accuracy of the model.
- Explore the use of neural nets — Even though I was able to utilize over 400,000 observations in this analysis, significantly more data is available in Weedmaps and it is growing everyday. I believe I may be able to boost prediction efficacy even more by obtaining a larger data set and employing some of the latest deep learning models via Amazon Web Services (AWS).
- Create models for concentrates and edibles — Pricing data is also available for non-flower products as well. Concentrates are becoming increasingly more popular per BDS Analytics most recent report.
- Create a web application — Once the model is finalized, I hope to build a simple web application that dispensaries could use to easily benchmark their product pricing.
- Black market comparisons — While recent attempts to obtain new data from the priceofweed.com haven’t been fruitful, it would be great to conduct a similar analysis on cannabis pricing in the black market.
All of the code related to this post can be found in my Github repository.