I reverse-engineered a $500M Artificial Intelligence company in one week. Here’s the full story.
Why I did that:
Many of the papers, data sets, and software tools related to deep learning have been open sourced. This has had a democratizing effect, allowing individuals and small organizations to build powerful applications. WhatsApp was able to build a global messaging system that served 900M users with just 50 engineers, compared to the thousands of engineers that were needed for prior generations of messaging systems.
This “WhatsApp effect” is now happening in AI. Software tools like Theano and TensorFlow, combined with cloud data centers for training, and inexpensive GPUs for deployment, allow small teams of engineers to build state-of-the-art AI systems.
The message is simple: nowadays everyone can take advantage of Artificial Intelligence without spending years and tons of money in R&D.
I strongly believe that this works both in case you want to push the boundaries of technology (George Hotz’s style), and in case you want to apply some well established AI methodologies to new problems.
Since some people were not 100% convinced by this, I decided to prove it and invest some of my time to actually do it. As my business professor at Santa Clara University was used to say, “you’ve got to taste your dog’s food”.
Plus, I have to be honest that it was a lot of fun :)
Finding the right experiment:
I decided to pick the case of a company which I really admire (being an energy engineer): Opower.
What Opower does is simple yet very powerful: it uses smart meters data to help utilities companies in targeting their customers by providing them insights on consumption patterns. This is very important for utilities, who need to keep their customers consumption as stable as possible during the day (the so-called “peaks” and “valleys” are tough and expensive to manage for an energy producer for reasons we won’t cover here).
Once the consumption pattern of an energy user is known, it’s possible to target him with a custom offer to lower his consumption during critical moments.
Opower made the perfect case since I found this very cool dataset made of roughly 15000 .csv files containing the hourly load of 15000 buildings in the US, collected during 2004. It looked close enough to what Opower actually have available.
The case was perfect and I was ready to go ;)
Looking into the data:
The first thing I did was importing single csv files and make some very quick and dirty plots of the energy consumption profiles to check what I could pick up from them.
I made some “hairball plots”: basically I plotted on a single graph the energy consumption profile of several days with a black line set at 0.1 opacity. That makes easy to see the most common energy usage patterns in a building.
I also made three separate plots for the whole year, workdays and weekends, to look for different patterns.
Here are two examples:
They may look useless at a first sight, but we can already extrapolate some cool insights:
- The restaurant operates at breakfast, lunch, and dinner (look at the three peaks)
- The restaurant probably opens between 5am and 6am (first increase in consumption due to workers getting in)
- The restaurant probably closes between 1am and 2am
- The restaurant works on weekends and holidays.
- People stay in the restaurant more time for lunch than for dinner/breakfast.
- The restaurant probably has some fridges that are on during the night (the consumption is never zero).
- The school opens around 6am and closes around 10pm.
- Around 6pm most people leave the school, but someone keeps staying in (professors? post-school activities?)
- The school has some kind of appliance that is never turned off (fridges? IT stuff? Or maybe they keep the lights on at night for safety?)
- There’s a problem in the dataset! We should expect to have a stable and close-to-zero consumption on weekends (the horizontal lines). Is the dataset shifted? Turns out that YES it is. By moving everything 3 days in the past the situation gets closer to what we expect, and most weekends and holidays present a stable consumption. :)
Once I knew what I could find in this data, I started the process of data cleaning and preparing some functions to make my model. As usual, this was the most time-consuming part.
Machine Learning time:
What I needed to start finding patterns in energy consumption was a dataset with a “model” of each building. I built the model simply by averaging the energy consumption of the building by hour for each day of the year (plus some other features I considered that may be interesting, such as the state, city, and so on).
To make possible to compare buildings with one another I also divided the consumption of every hour by the highest registered consumption, so that it ranges between 0 and 1 for every building.
I also built my “make_building_model” function so that I can easily change the dataset and consider just the working days or just the holidays to create the model of every single building, if I want to look for more specific patterns.
I decided to start by considering the whole year, and run my script over all the ~15000 csv files I had, ending up with a dataset of 15000 rows, and started with the clustering.
I used a very simple (yet powerful and efficient) KMeans algorithm, and fed it with the 15000 samples using as input features a list containing the 24 scaled hourly consumption values.
I played with the number of clusters, and the one that allowed me to get the most significant clusters was 6 (this was a trial and error approach, for brevity I’ll report just the final outcome).
To get some insights I plotted the cluster centers (basically the 6 major patterns recognized by the algorithm), together with a hairball of the buildings that are part of that cluster so that you can appreciate the similarities (notice that there are thousands of lines with 1% opacity in those graphs! That’s the “ghost-like” effect you see).
Looks pretty cool, right? We basically now know that:
- There are users with a steadily growing consumption that peaks around 4pm and then falls (Cluster 1)
- Some users have two daily peaks, around 7am and 9pm (Cluster 2)
- Some users have three daily peaks, around 8am, 1pm and 6pm (Cluster 3)
- Some users have a “two-modes” kind of operation: high consumption from more or less 9am to 5pm, and low consumption in the rest (Cluster 4)
- Some users have a consumption similar to cluster 4, but with a modest (10% of the peak consumption) yet sharp drop around 1pm (Cluster 5)
- Some users have a rather stable consumption, with a minimum consumption of 60% of the peak (Cluster 6).
Can we go further? What kind of value could a company get out of this?
This is what we get if we print out the description of what kind of buildings belong to each of the clusters:
total elements: 3042
n Stand-aloneRetailNew = 936 out of 936
n StripMallNew = 936 out of 936
n SecondarySchoolNew = 923 out of 936
n PrimarySchoolNew = 235 out of 936
n MediumOfficeNew = 4 out of 936
n LargeOfficeNew = 8 out of 936
total elements: 2808
n LargeHotelNew = 936 out of 936
n MidriseApartmentNew = 936 out of 936
n SmallHotelNew = 936 out of 936
total elements: 1872
n FullServiceRestaurantNew = 936 out of 936
n QuickServiceRestaurantNew = 936 out of 936
total elements: 5381
n LargeOfficeNew = 928 out of 936
n MediumOfficeNew = 932 out of 936
n OutPatientNew = 935 out of 936
n PrimarySchoolNew = 701 out of 936
n SecondarySchoolNew = 13 out of 936
n SmallOfficeNew = 936 out of 936
n SuperMarketNew = 936 out of 936
total elements: 936
n WarehouseNew = 936 out of 936
total elements: 937
n HospitalNew = 936 out of 936
n OutPatientNew = 1 out of 936
Do you notice anything cool? Here’s what I’d say:
- All Stand-alone retailers, Malls, and almost all secondary schools belong to cluster 1. If we were a utilities company using this tool, when subscribing a new contract to such kind of customers we could consider to propose incentives to shift their consumption away from that single peak.
- All hotels and “Midrise” apartments belong to cluster 2. Utility companies could target those customers with offers that push them towards reducing consumption around 7am and 9pm (especially 9pm). Or at least be conscious that if they subscribe a new customer who is a big hotel chain they have to expect those peaks.
- All restaurants belong to cluster 3. Possibly because the dataset didn’t have any restaurant open just for two of the three courses (breakfast, lunch, dinner). Utilities should be careful particularly to the last peak of those users, which happens at roughly the same time of users from cluster 2. A tailored promotion could incentivize for instance a shift in opening hours and therefore a shift in consumption (I’m not sure that makes much sense business-wise though).
- Cluster 4 is the most common profile, with offices of any size, supermarkets and most primary schools. We expected that, since it’s a classic “9to5” kind of profile.
- Cluster 5 is typical just of warehouses. The drop in consumption around 1pm is probably a turn shift. We could incentivize the use of energy in night hours with tailored promotions, so that this peaks don’t add up the the ones of Cluster 4 users.
- New hospitals all have a profile like Cluster 6. This justifies the high night consumption and steady profile, since they probably have machines that can’t be turned off at night. There’s not much we can tell to a hospital to change its behaviour, and they’re pretty steady users too so we’ll keep them quiet :).
Yes, it’s possible to draw some very interesting conclusions using Machine Learning even if you’re not a $500M company, but a single person with some free time. This is mainly due to the availability of open datasets, as well as open source software that allowed me to build a rough but working model at an incredible speed.
I’d also say that clustering is an extremely powerful and easy to apply technique which has a lot of untapped potential. Imagine what you could do in marketing with the same approach.
I just scratched the surface of what’s possible to get from this dataset. We could keep on looking at differences between different climatic areas within the US, differences in working days and weekends, or make the same work with other kinds of consumption (the original dataset has also heating and some other cool things!).
If anyone wants to start digging, I made my work public on Github, where you can use my functions to save a lot of time in data cleaning, and immediately start doing the fun ML stuff :).
Ben Packer, an ex-Opower Data Scientist, pointed me to this very interesting article on the original Opower’s work (thanks Ben!). It’s very interesting how the approach is very similar to mine (even the graph representation!). Since they refer to residential customers, the clusters shapes are slightly different, for example there’s no one with three peaks in their data, and on the other hand we noticed that all restaurants have this consumption pattern.
Moreover, they give a glimpse at the techniques they used: of course they’re are more sophisticated than what I used, having hundreds of times more data, more money, time, engineers, and…being a $500M company (I guess that’s why my voluntarily provocative title offended so many on Hacker News).
Anyhow, it seems to me that the conclusions and the value (in terms of information, not $$$ of course…) we extracted are definitely comparable.
If you liked my work, a click on the recommend button is well appreciated, as well as sharing it. Sharing is caring ❤
Also, if you have some interesting data you’d like to extract some value from, you can reach me out at firstname.lastname@example.org :)