A Startup’s Guide on Data Warehousing… to Do Big Data Stuff
Whether you’re an individual with a side project or a fledgling startup, one of the next steps you’ll probably want to take is looking at all the data you’re collecting. Immediately, there are some potential problems:
- Your data is stuck in your internal DB.
- Running Large or complex SQL queries directly against the production database can slow down the experience for your users
- Same issues as the above if you want to try using a dashboard
- While there are a handful of vendor solutions and this is a problem actively being worked on, they may cost more than you’re willing/able to spend at the time or are still not fully developed.
- If you have a mix of multiple different databases (Hologram.io uses both MySQL and PostgreSQL), a vendor may not have support for your database of choice.
What I’ll outline below is some simple techniques Hologram used to get all our internal and external data into one place.
Amazon Redshift as Data Warehouse
For the actual data warehouse, we primarily chose to go with Amazon Redshift for the number of third-party integrations and speed. Redshift is becoming a data warehouse standard, which only encourages companies to centralize of all their disparate data sources. To learn more about using Redshift, check out this quick FAQ, Getting Started guide, and a wealth of additional documentation.
External Data Sources via Segment and Stitch
Like many tech companies, we use a variety of third-party tools for tasks like payment processing (Stripe) and customer support (Intercom). For piping data from all the apps into Redshift, Segment and Stitch are two of the top data pipeline companies. Hologram uses both since they have some different data sources and ways of mapping third-party APIs into database tables and columns.
Internal Data Sources via Python Scripting and Amazon’s COPY Command
Now that we’ve got all our data from other apps into Redshift, we need to match it all up with our proprietary data.
To get our data into Redshift, I:
- Did an initial dump of the production database to Amazon’s S3 storage service, then used the Redshift COPY command to insert a copy of production into Redshift.
- Put together a quick Python script that could be run on a schedule with a cron job that can update Redshift with production rows throughout the day.
There are two major types of tables and approaches to updating data in Redshift:
- After seeding Redshift with a copy of production, only add new rows to tables. This works well for tables that don’t have rows that are updated.
- Doing a full table drop and recreation of production data into Redshift. This helps keep the data warehouse up to date on any rows that were updated and fix any random rows missing between production and Redshift, but can take longer to run.
Let’s run through an example of adding lots of new users from your internal users table into Redshift.
For example, let’s say we have a “connections” table in our production database with the columns id, created_at, and user_id, and we want to copy a few hundred thousand rows into Redshift.
In English, here is how we approached this with a python script:
- Connect to your production database and get the rows from production to be copied into Redshift.
- Establish the Redshift connection.
- Set a batch size. These batches are how many rows from production will be inserted into Redshift at any one time.
- Generate the query string with all the values to insert into your table. For Redshift, this StackOverflow discussion has some examples of generation a long query string for inserting multiple rows in one query, instead multiple queries inserting one row at a time, for use with the Python “psycopg2” Postgres package.
- Execute the statement to insert your production data into Redshift.
For developers, here’s a sample code snippet of what this might look like:
conn = MySQLdb.connect(self.mysql_host, self.mysql_user, self.mysql_password, self.mysql_dbname) query="""SELECT id, created_at, user_id FROM connections""" cursor = conn.cursor(MySQLdb.cursors.DictCursor)
data = cursor.fetchall()
return connection_rowsdef insert_connections(self, connection_rows):
conn = self.start_connection()
cur = conn.cursor()
batchsize = 10000
for i in xrange(0, len(user_rows), batchsize):
batch = user_rows[i:i+batchsize]
args_str = ','.join(cur.mogrify("(%s,%s,%s)", (row['id'],row['created_at'],row['user_id'])
for row in batch
cur.execute("""INSERT INTO hologram.connections (id, created_at, user_id) VALUES """ + args_str) conn.close()def etl_connections(self):
connection_rows = self.get_linkscellular_data()
Success! Now you’ve got all your data centralized in one Redshift database. You can add dashboard and visualization apps on top of this data without slowing down your customer experience.
If you’ve gotten your data warehouse working with your internal data, some additional areas to look into are:
- Handling large tables which have rows that get updated frequently.
- Using a CSV dump of production data to S3 and then the Redshift COPY command for your production data on an ongoing basis.
- Handle casting between different data types from your production database and data warehouse.
- Logging the speed of your script and any errors that occur.
- If your data needs to be real-time or is significantly large that this technique no longer works for you, you may then be at stage to use an enterprise ETL (extract-transform-load) application.
If you have any questions or want to share your data warehouse setup or ETL process, we’d love to hear from you at email@example.com.
Additional thanks for notes and review by Brock over at http://www.strong.io/