How to Build a Data Warehouse: What We’ve Learned So Far at Glossier
When I started at Glossier in 2017, I became part of a small team working on building a data warehouse prototype. Coming from an e-commerce background, I already had a good relationship with different technologies and relational databases, but I had no idea where to start in terms of building a data warehouse. I was surprised by how little information there was online and how many different types of roles there are in the field.
I read what I could find, we met with other direct to consumer e-commerce startups working on data warehouses, and since then we’ve done A LOT of prototyping and rewriting. Nine months into the project, we must have rewritten how our data transformation works a good 10 times.
In this post, we’ll discuss how we were originally using data at Glossier and why we needed to evolve. We’ll also go through the various benefits of using a data warehouse based on real-life examples and our main accomplishments in the past year.
While I don’t consider myself an expert, I’ve learned a lot during this process. I hope you’ll find this post valuable and informative for projects you may be working on or considering. One of our goals in writing this is that we want the data world to have more readable resources. It would be great if this inspired our fellow data friends (you?) to share their thoughts and write about their successes and failures so others can learn from it, with the hopes of nurturing a strong community in what is often a closed world.
So let’s get into it!
A little bit about Glossier
Glossier is a rapidly growing digitally native beauty company with a strong tech-focus, made up of over 150 employees and 80 seasonal retail workers across 3 countries, and counting.
The company has its origins in the beauty blog Into The Gloss (ITG), founded by Emily Weiss in late 2010, with the goal of talking about beauty as an element of personal style in the same way fashion is. Blog articles focus on beauty tips, makeup tutorials, product reviews and the well-known Top Shelf franchise. After several years of interviewing women for ITG, Emily saw room for a brand to actually listen to its customers and meet the needs of women today; enter Glossier, which was founded in 2014.
Fast forward to 2017: as sales increased exponentially year over year, we needed a way to collect multiple metrics in order to understand our customers’ behavior to allow us to better interact and connect with them.
In this post, we’ll mainly focus on Glossier.com’s tech-stack and less about IntoTheGloss.com. While the way these two platforms collect metrics is similar, there are some key distinctions that are out of the scope of this post.
Finally, we use Shopify for our international retail locations and temporary pop-ups. We’ve created a bridge between Solidus and Shopify that allows us to export our Glossier products directly into Shopify, thereby easing the process of opening ad-hoc stores. It also enables the CX team to see Retail orders from Shopify directly into the admin dashboard of Solidus.
How we were using data in the past
Monitoring our platform and collecting customer metrics wasn’t very well defined in the early days. Delivering the best possible customer experience on Glossier.com and building a successful e-commerce platform were our main focus. Over the years, we’ve been able to refine our solutions and processes to be more metric-focused.
During the first year of Glossier, in 2015, we were sending data directly to Google Analytics without the use of Segment and we were extracting raw data from our databases so that we could create reports, graphs and infer information using Excel.
As this was not scalable, we gradually migrated towards a tool called Looker in 2016. This platform allows us to easily query our databases without most of the Excel hassle that you may be familiar with. Looker acts as a presenter layer on top of your data and abstracts SQL, allowing you to define your own objects and query them fairly easily. It also offers the possibility to visualize your data without having to know much about how everything works behind the scenes, thus allowing multiple departments to fetch their own set of answers.
In the early days, our analyst Kate Caputo was managing all the underlying model definitions within Looker and maintaining the different reports. She was principally using our transactional data — meaning that we were unable to infer information based on our retail POS, customer behavior information or any other external sources of data. This was mainly due to the fact that, at the time, we weren’t consistent about where our data was living, i.e. there wasn’t one single source of truth containing everything.
The marketing team relied exclusively on a sole analyst querying Google Analytics for multifaceted behavioral data, a process that quickly proved unmanageable. While Looker is a powerful tool for its comprehensive dashboard and report creation, our data needs became too much for one person to handle on their own.
On the one hand, this early system had its advantages. With only one owner of the platform, we always knew who to reach out to when we had questions. There were processes in place for information requests, which meant that it was quick to get what you wanted. Kate had very strong knowledge of our data and was efficient at navigating all the various system quirks.
On the other, it also has its disadvantages. It was really hard to give proper insight on some metrics because we were only using about one data source out of the twenty that we had. Also, since our data analyst was extremely busy, it was hard to maintain a clean and a scalable LookML code-base. Our data model was also always evolving, which meant that our data over time could be inconsistent, which made it hard to clean up and ensure uniformity.
While these were all points that could have been improved, the most critical downside was that we had Looker directly hooked into our production instance instead of a slave replica, for a variety of reasons. Under a heavy load, long queries would have had a negative impact on the performance of the order checkout page. At the end of the day, we knew we needed a different approach.
The benefits of a data warehouse
The most obvious solution was to build our own data warehouse. In this section, I’ll detail how we came to this decision, explaining the benefits of building such an infrastructure. I’ll use real Glossier examples based on pain points that we experienced, and how a data warehouse has ultimately made our lives better.
Decoupling from our production website
As I mentioned above, originally we had our Looker instance directly associated with our production database. While we were aware that this could have been improved, we were constrained by elements that were beyond our control at the time. Just before building the original version of the data warehouse, we decided to have a read-only replica where a copy of the data would reside, meaning that making queries against this database wouldn’t cause problems. A couple of weeks later, we got the go-ahead to build the data warehouse and started the process of moving all of our sources to this database.
One of the initial sources that we set-up in the data warehouse was to have the read-only replica of the production database saved under a schema named prod (more on our naming convention later). Since Amazon offers a database syncing service that allows us to output data from a schema to another schema of a different database, this was a simple change for us. The fact that our data analyst could now run long queries without impacting customer experience on Glossier.com was a big win in and of itself.
Having the right infrastructure for scale and speed
Beyond running long queries, there were other advantages that we were seeking with our data warehouse, such as the ability to scale. We’ve now been collecting data for just three years (at the time this was published), and we’re already talking in terabytes in terms of the amount of data we hold. It’s also growing very rapidly, so we needed to have something that could adapt to our needs without having a dedicated dev-ops engineer constantly on the case.
When we were at the point of choosing a data warehouse platform, after investigating multiple solutions, we decided to opt-in with Redshift for our first iteration. We chose it for several reasons, primarily because it was very similar to what we were used to with PostgreSQL, and the fact that it was a product offered by Amazon, which is the main service we use for our whole tech infrastructure.
Other benefits of Redshift include easy scalability and its columnar storage, making it performant for our type of queries. On the other hand, Redshift has a lot of tooling and logs to help you make a table efficient to query, but frankly, it doesn’t really hold your hand throughout the whole process. It requires time and dedication to get your table efficient to query. You can tune the compression encoding for each individual columns, you can set sort keys, and a distribution style per table. For more on this, check out this tutorial on tuning table design and this article on optimizing Redshift performance with dynamic schemas.
Another benefit of using Redshift is that it allows us to easily export data (unload) to Amazon S3 and import data (copy) from S3, giving us the possibility of increasing the efficiency of our processes.
Let’s imagine a scenario where we want to get data from Google Analytics and insert it into our data warehouse. Here’s how it would break down:
We now have our data residing in our data warehouse without too much effort. With more polishing, this also allows us to save a historical version of the raw CSV file so that we can rollback to that version at any time.
While we’re happy with Redshift for now, we may put some time aside and re-evaluate what’s on the market in late 2018, since there seems to be a lot of new stuff coming out. Something like Athena seems fun to use.
A single location containing all data
Previously, we had all of our data residing in multiple places. Some parts were in Excel spreadsheets, while others were in our transactional databases, on third-party APIs, or even in a totally different database.
This would get very complicated if we wanted to retrieve some pieces of data. You would have to either remember how to fetch it, or read documentation to refresh your memory. While this wasn’t the worst flow, we wanted to improve on it and reconcile everything that we could into a single space.
This single space ended up being, as you have likely guessed, our data warehouse. Since we have a lot of data coming from different sources, we had to create set of rules in order to define how this data would reside in the warehouse. Based on experiences of different people within the team, we quickly decided that all of our different sources of data should reside in their own schemas. We would have a schema named prod representing a read-only copy of our production database, shopify_usa for our Shopify data, segment_glossier for our segment data, etc. Our core schema is named analytics and contains all the tables that are clean and used as a reference for the company’s critical reports, like sales numbers.
In this way, we could easily connect to the database with whatever tool we felt like using and easily access different types of data. In order to save the different data sources into our warehouse, we had to combine multiple tools together. Segment offers the possibility to sync third-party data into the DB with Warehouses, while Stitch supports some data sources that Segment doesn’t. We also have our own set of ETL processes that fetch data daily from Google Analytics’ API and save it into its own schema. All in all, we have about 19 different sources of data flowing into our data warehouse, which segues nicely into our next benefit.
Reconciling data from multiple sources
While it’s all fun and dandy to have data, it’s even more fun to cross-reference multiple sources of data together in order to build a more robust view of our customers’ experiences.
Before having a database, it was impossible for us to access different parts of the data coming from other parties. Looker is a strong tool allowing us to query data, but it only supports one connection, i.e. one source of data, at a time.
Now that we have everything in the same bucket, we can cross-reference our POS data from Shopify and our transactional data with Glossier, allowing us to better understand our customers’ behavior. Another example is to associate Google Analytics data and reference it with Glossier’s order information in order to understand marketing channels based on last-click data.
THIS IS REALLY POWERFUL.
It allows us to expand our data coverage area and answer many questions that we couldn’t before.
We’ve talked about having multiple sources of data flowing into one single database. Now let’s talk about managing historical data.
Every company goes through growing pains as they transition from a young startup to a more established entity. It comes as no surprise that schemas change, data gets lost, or its meaning changes over time. Tech debts accumulate that are hard to go back and fix. Having a data warehouse allows us to fix historical problems that are resolvable, and export them into another schema where we consider data to be clean and healthy.
Recently, we had to reconcile all our SKUs from all the orders on both Glossier.com and all of our retail stores in order to answer stakeholder questions. At the time, we were creating a new product SKUs several times a month, which quickly became messy. It was really hard to know how much we had sold of a certain product within a certain time frame because the SKU would change over time. This is not the case anymore — our product management flow has dramatically changed.
We’ve created an ETL process that creates multiple mapping tables, which allows us to map any product in the past to its current equivalent. This process was painful. It took us a long time to sanitize everything, but now that our ETL process is complete, supporting SKU changes is not a big deal. Without a data warehouse, we wouldn’t have been able to follow the same code logic to transform our Glossier.com and Shopify product data, thus exporting these result into a clean schema for our analyst to query against.
I’ve briefly talked about us having built a bridge between Solidus and Shopify. While the implementation works, it’s hard to maintain. Instead of having to maintain this bridge, we’ve decided to use the raw Shopify data.
Having a data warehouse doesn’t mean that all of our problems are solved. It still needs a lot of maintenance and support to get everything up and running and stable enough to be usable.
We started building our data warehouse early 2017 and it’s an ongoing challenge to define our priorities and focus on them. Our stakeholders understand how powerful it is, and everyone wants their share of clean data to get business insights.
Glossier’s various departments are still using Looker hooked up to the data warehouse to deliver reports, create graphs and infer information for our business intelligence. Having a data warehouse like this allows for multiple parties to be actively involved in the ETL process and transform the data to polish the information so that it can tell us a story.
What we’ve accomplished in the past 9 months
As you may have inferred by now, we started this project with a pretty bare bones solution, which we’ve slowly built up to contain all of our various sources of data. Our data warehouse now also includes our core dataset, which is cleaned and goes back to the beginning of Glossier.com.
This may not seem like a lot, but it’s a tremendous accomplishment for our first nine months. We’ve created a scalable infrastructure that aggregates all of data sources. We now have the ability to transform said data at regular intervals to keep key reports up-to-date, while saving historical information between each transformation step on S3 for easier debugging / rollback.
We’ve also created a staging flow, where we can run tasks that transform our raw data without affecting our core reporting. We’ve implemented tasks to ensure data integrity by monitoring duplicated data, missing rolled-up report dates, and erroring data import. All of our monitoring and metrics are currently being sent to DataDog, allowing us easier access to the problems and notifications for the parties involved. Lastly, we can now generate multiple key reports that were not previously feasible.
While building all of this, we’ve always kept in mind that we wanted our work to be open-source so that we could to give back to the community. We’ve forked an open-sourced project called Luigi by Spotify, and tailored its implementation to fit our custom needs. We’re still experimenting with our workflow, but once we’re confident that it’s ready we’ll release our contribution back to the community so that other companies can learn and have a workflow that they know will work in a production environment.
What we’ve learned
We weren’t sure what we were getting ourselves into when we started working on this project. Building a proper data warehouse is a long and hard process but it’s also really REALLY fun. There are a lot of critical decisions that have an impact on the long term, you really have to think through solutions to ensure that everything will keep working as time progresses.
In my opinion, there is also a substantial lack of resources online, so for a junior in the field to succeed, it involves a lot of trial and error. We’d like this situation to change, which is why we’ll continue to write more articles on the subject, encouraging further discussion.
I’ve realized that having a data warehouse really helps the business answer critical questions. While stakeholders are eager to use the warehouse to inform data-driven insights, it’s imperative to build it out well, rather than just building fast.
So what’s next?
While we’ve made huge strides in our transactional data, there is still a lot of greenfield with regards to behavioral data. This will be our next big undertaking so that we can unearth patterns in customer behavior on site.
Another goal for this quarter is to have better monitoring on Glossier.com to reduce potential friction or pain points for customers, perhaps even before they are aware of them.
And of course, maintenance, maintenance, maintenance. There are always improvements that can be made in terms of monitoring, accessibility, and the speed at which data can be accessed.
“Data must be managed properly before a data warehouse can provide a real return on investment”
— Stan Becton, President of Acton Burnell Inc.
At Glossier, one of our main goals is to be a data-driven company in order to provide an optimal customer experience. However, it takes time, effort and hard work to provide a single source of truth to help guide decision making. Sometimes a combination of experience, intuition and insights can go a long way.
So I hope this long read gave you a good idea of experience so far with building our data warehouse. Stay tuned for my next post, where I’ll go into more detail about how we use Luigi as the pipe between our data transformation. As a sneak peek here is is a screenshot of an update to one of our core table containing all our orders data.
In the meantime, let me know what you think in the comments below! Have you had similar experiences building a data warehouse?
UPDATE: I’ve published the part II of this article that goes into the intricacy on how we’ve built our current pipeline using Luigi.
Charles-André Bouffard, or “CAB” as he is known to many, is a Senior Software Engineer within Glossier’s international Montreal-based team. A native of Quebec’s beautiful Laurentians, he’s a passionate developer, hackathon aficionado, and dreams of one day becoming a base-jumper.