In part one, I defined the concept of discovery from data, outlined how the discovery process is currently grossly inefficient, and proposed a software solution for automating and accelerating discovery across all industries.
Our concept is to enable domain experts to ask and interpret the answers to their own deep questions, without having to wait to communicate with data scientists, or having to wait for data scientists to do the analysis. Domain experts are thereby empowered to iterate quickly through question and answer cycles — maximizing the probability of high-impact discoveries. In other words, it takes minutes, not months.
The user interface must speak the language of the domain expert
This is where Excel, Tableau, SAS, and other business intelligence software platforms fail immediately. Those platforms, while useful, provide only a generic user interface. Therefore they require extensive training and are never able to prevent the user’s fear of doing the wrong thing.
Fear of doing the wrong thing is a major blocking point for many domain experts using data analytics tools. It cannot be understated. One will note that industry-specific analytics platforms (e.g. Google Analytics or Ingenuity) are somewhat better at limiting this fear, because they directly speak the language of their users.
Our tag.cortex platform is designed to be used either as a full stack (our own user interface), or as a component inside another software stack (a customer’s or a partner’s user interface). Like Excel and Tableau, tag.cortex is industry-agnostic. However, unlike other generic analytics tools, tag.cortex is able to speak the industry-specific languages of domain experts. Here’s how.
Protocols are analysis modules, specific to each dataset, that are added into our platform. Protocols comprise the business layer that does the work of defining how data are structured and how an analysis is run. Each protocol is able to answer a suite of questions for a domain expert, depending on how they want to ask it.
To use an example from the NFL, a protocol can perform an analysis of an upcoming opponent. It answers the question “in which game situations does a particular opponent blitz?” The user gets to select the opponent team, and perhaps some optional game criteria (e.g. only on 2nd down, or only in scenarios where they are losing). The user then runs the protocol, and views the results.
At no point in this process does the protocol require the user to know how to query and structure the data, or know which statistical method or algorithm to run. Everything is handled for them automatically. The protocol uses NFL-specific language for describing what it does and how the user can specify optional game criteria.
Each protocol takes between a few minutes and a few hours to develop. For some datasets, our platform contains hundreds of protocols, each one able to answer a suite of high-value questions for the domain expert.
So who writes the protocols and adds them into the system? Software engineers and advanced domain experts do. It’s not very complicated, thanks to our tag.script system.
tag.script enables our users to design high-level scripts and protocols that answer deep questions — all without needing to know how to access the data or perform the statistics/algorithms required.
In a sense, tag.script is like SQL, except it’s composed in JSON, has access to sophisticated statistics methods and algorithms, and the author never needs to worry about the complexity of a JOIN or a WHERE clause. Here’s an example:
method — this is set to “tag-comparison”, which will compare the set of NFL plays defined in the focus, to the set of NFL plays defined in the background.
background — this is the set of all NFL plays to compare against. It will be set by the user when they choose an opponent team.
focus — this is the set of all NFL plays about which we want to ask “what is special about these?”, in comparison to the background plays. In this case, we’re focusing on plays where the defense used a blitz tactic.
When run, this tag.script method will automatically scan all other variables in the dataset to find those that are most distinctively different when the user-specified defense blitzed, compared to all plays against that defense.
The domain expert must be able to understand the answers
Not only do domain experts have to understand how to configure and run an analysis, they also need to be able to understand the answers. Therefore each result has to be designed to speak the language of the domain-expert, and there can be no complexity or subjectivity involved.
To accomplish this, tag.cortex uses a combination of features:
- Clearly displayed averages and percentages. Data visualization can seem useful, but when it comes to generating objective insight, nothing is more effective than numbers which the domain expert can clearly understand.
- Results ranked by strength of confidence. This lets the user know that the most significant results will “bubble to the top”. Alternatively, results can be sorted and filtered by other metrics, including size of effect.
- Dynamic text that explains the result in the domain expert’s language. Each protocol is able to define sentence templates that adjust dynamically with each variable analyzed and with parameters specified by the user. These sentences are displayed next to each result, and greatly improve interpretability.
- Heat map visualization for spatial results. In many situations the results from an analysis have some form of spatial meaning — this is particularly the case in sports data (e.g. zones of the field). For spatial results our platform employs a heat map visualization, with color coding for statistical patterns.
- The ability to drill down into specific events. Users can examine the factors that contributed to the significance of each result, and use links to connect to external resources for further investigation (e.g. third party references, video databases, Google searches, etc.).
The platform must be able to connect to any data source
Every dataset has its own intricacies, and every data warehouse is designed and optimized in its own way. We knew from the start that developing one-size-fits-all data source connectors would cause headaches in implementation and underperform at scale.
Instead, the tag.cortex platform offers the tag.data SDK, which can be customized for the specific contours of each data source. The tag.data SDK even allows analysis computations to be performed closer to where the data lives, outside of the tag.cortex platform.
Alternatively, one can simply use our standard in-memory data structure which has been tuned for memory consumption and computing performance.
Putting this all together, the architecture of tag.cortex can be represented by the image below:
There’s still a lot more to discuss, including:
- Using the tag.cortex platform as an API
- Sharing of reports and collaboration between domain experts
- Reproducibility of reports and analyses
- Extending analysis functionality with the tag.algo SDK
- Graph structures and sequences
I’ll have to save those for another time. Thanks for reading, and please let me know if you have any questions.