Image for post
Image for post

Many people working in engineering, or engineering-adjacent, fields (including data) may have heard of the Go programming language. If you haven’t, Go is a compiled, statically typed programming language with garbage collection, memory safety, and a built-in system for CSP-style concurrency. Go was developed as an open source project, largely designed and directed by a team at Google, including some very well-known names like Rob Pike and Ken Thompson.

While Go is unlikely to displace languages like Python (especially), R, and Java/Scala from their central role in the data ecosystem in the immediate future, many teams have found it especially well-suited to infrastructure and service-oriented work—important parts of the data world. …


Our Sibyl API can serve AI predictions with a 50 millisecond response time. A developer calling the Sibyl API only has to specify an object ID (e.g., give me the prediction for person 0989k3dd84) instead of specifying all of the model’s input parameters. Sibyl supports many predictors with many versions each, and it offers data scientists a robust workflow for model deployment.

Introducing Sibyl

Image for post
Image for post
In this vague illustration of what it takes to build an AI model, Sibyl is focused on the last step (deploying AI models for production)

Creating and productionizing a useful AI system involves at least three stages:

  1. Prepare the data that the AI will learn from. This is more complicated than merely pointing an AI system at your database and will likely involve a Data Scientist carefully processing the data to maximize the AI’s ability to learn from it. …

In February, we published Automatically create a pivot table in Redshift. This is an updated version of that article, now that we’ve switched to Snowflake, where we use the same technique.

We’ve open sourced the solution described in this article here.

Creating pivot tables is a relatively common need, as seen by its popularity in software like Microsoft Excel and Google Sheets. Some databases (including Snowflake) include a pivot function, but they’re too static — you have to provide an explicit list of values rather than deriving them dynamically from the data. We built a small tool to create pivoted database tables. …


Image for post
Image for post

TL;DR

We published a small tool you can use to import semi-structured data from Google Sheets to Snowflake, taking advantage of Snowflake’s variant type.

Background

There are a variety of cases where you might want to import data provided by business users into your data warehouse — for example, inventory cost information or logs of marketing spend. These probably won’t come directly from your application, or from a third-party integration, but rather from the operations team, the finance team, the growth team, etc.

If your organization uses Google Sheets a lot, that’s probably a fairly convenient way for business users and input and manage the data. …


An Analysis of Causality

N.B., this is just for fun. We don’t really believe that Mercury being in retrograde caused these events, even though that is exactly what happened.

Image for post
Image for post
true causality

TL;DR

Mercury was in retrograde recently and it really made our lives difficult here on the Data team at Ro.

Background

When a planet is “in retrograde,” it means that it appears to be moving backward through the sky. Of course, it’s not really moving backwards — it’s just a trick of the planet’s speed and position relative to us. …


In Redshift, the type of LISTAGG is varchar(65535), which can cause
large aggregations using it to consume a lot of memory and spill to disk during processing.

Image for post
Image for post

Background

LISTAGG is a convenient way to aggregate a group of values by concatenating
them together. It supports DISTINCT, a delimiter, and a within-group ordering.

We’ll go through a few examples of using LISTAGG and then talk about a
situation we ran into where it resulted in much higher memory usage than you might expect without knowing any of its implementation details.

Given this example table:

Image for post
Image for post

Here are a few examples using LISTAGG:

select
some_id,
listagg(fruit) as fruit_list
from
fruits
group by
some_id
order by…


(article #3 in series)

For background on our Ro Looker Code Generation System, please see our intro article.

Image for post
Image for post

TL;DR

Views in Looker are often based closely on underlying database tables, with
predictable relationships between database columns and view dimensions. This means we can automatically generate large portions of views.

Background

In our last post about our system for generating LookML, we gave a
high-level overview of our motivation and goals. This time, we’ll go into a bit
more detail about the basic functionality we use to generate a Looker view
corresponding to a database table.

Looker actually has something similar to this built in — Create View From
Table — but it’s not exactly what we want. Create View From Table is a one-time thing: it produces a LookML file which you then maintain as LookML going forward. We want a process that we can configure and re-run at any time to pull in new columns without needing to copy over our changes and additions. As we’ll get into a bit in this post, and more in later posts, we also leverage this process to add several other important features. However, today we’ll focus on the basics of generating and customizing a view. …


(article #2 in series)

For background on our Ro Looker Code Generation System, please see our intro article.

TL;DR

Dimension time groups are important and useful, so we wanted to make them as easy as possible to define. In our LookML-generating system, we merely specify whether to generate a “low”, “medium”, or “high” number of dimensions, rather than listing each timeframe value individually.

Background

In Looker, you usually expose a date via dimension_group with type:
time
. As part of this, you specify timeframes, and Looker generates a
separate dimension for each of them.

For example, given a dimension_group like:

view: orders {
# ...
dimension_group: date_fulfilled {
group_label: "WHEN"
type: time
datatype: datetime
sql: ${TABLE}.created_at ;;
timeframes: [
date,
day_of_month,
day_of_week,
hour,
month,
quarter,
week,
year
]
}…


The open-source repo for this tool can be found here.

Creating pivot tables is a relatively common need, as seen by its popularity in software like Microsoft Excel and Google Sheets. We built a small tool to create pivoted database tables. It can be used as a standalone tool or integrated into ETL pipelines.

Here’s an example of what a simple pivot table transformation might look like:

Image for post
Image for post

Background

At Ro, our business and operations are growing quickly in terms of both scale and complexity. We launched at the end of October 2017 with a single brand (Roman) treating a single condition (erectile dysfunction), but in the last few weeks we’ve added a new brand (Zero, for smoking cessation) and several conditions under Roman (hair loss, cold sores, genital herpes, and premature ejaculation), with more on the way. …


The open-source repo for this tool can be found here.

When building a data warehouse or datamart, you often want to define derived tables in terms of SELECT statements, preferably without needing explicit DDL (i.e. CREATE TABLE) but without giving up the ability to use primary keys, foreign keys, or important database-specific features like Amazon Redshift’s DISTSTYLE, DISTKEY, and SORTKEY. We built a small tool at Ro that makes this easy to do without needing to pull in a larger data pipeline-management tool.

Image for post
Image for post

Background

At Ro, our business and operations are growing quickly in terms of both scale and complexity. We launched at the end of October 2017 with a single brand (Roman) treating a single condition (erectile dysfunction), but in the last few weeks we’ve added a new brand (Zero, for smoking cessation) and several conditions under Roman (hair loss, cold sores, genital herpes, and premature ejaculation), with more on the way. …

John Mastro

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store