Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterprise: The SEAL Method
Like many, I was introduced to the field of data science through a combination of schooling and on-the-job learning — including work in the precursory space of data mining before artificial intelligence and machine learning became mainstream. When we first built analytic profiling, prediction, and recommendation engines for Fortune 500 enterprises such as Comcast, Verizon, and GEICO, for example, we blended internally available consumer clickstreams, demographics, and account info to help determine which clientele generated the most sales, the highest profits, the least maintenance, and the least attrition. Then we defined patterns and personas for those customers and targeted more of them (or less, depending on the metrics). We also identified which businesses or consumers needed specific kinds of help or collaboration in order to convert or retain them.
The stories go on, of course. Heaps of data like this seemed quite large at the time — so “data mining” was suitable lingo to explain how value was derived from a sea of disparate data. Let’s bookmark this picture for later reference.
Now move forward some years as data science has matured to cover a wider variety of prediction and recommendation use cases spanning much broader KPIs, many of which thrive in real time, e.g., dynamic personalization, fraud detection, risk management, and other situations where instant feedback is critical to mission objectives. Sophisticated goals must often be addressed in mass scale where performance is crucial whether or not real time is required. Fulfillment is largely driven by improved selection of statistical models based on speed and accuracy (core tenets of data science) and by augmenting the scope of data processing with big data engineering. Much of the data beyond our “bookmark” (above) is typically available within an organization — think customer service records, marketing campaign results, financial payment history, etc. And where budgets allow, especially in medium to large enterprises, data is extended with third party brokered sources that vary by market and use case: LiveRamp/Acxiom, Experian, CoreLogic, IRI, Datalogix/Oracle, and Nielsen to name a few.
Above and beyond all this, data science outcomes can also drive actionable workflows via alerting and improvement schemes that can be autonomous, perpetual, and adaptive. Let’s consider the AI/ML example below as one of many possible scenarios for a data science lifecycle.
Hybrid Machine Learning Example — Airport Security
This happens to be a hybrid use case involving both insight and action. The insight is rooted in deep learning image recognition. Action stems from a recommendation engine to expose and quell an airport security threat. There is also a self-learning loop to improve over time.
Our basic goal is to detect aggregate risk and perform effective remediation commensurate with that risk and other circumstances. Say different parts of a weapon might pass through X-ray scanners without alarm, as their shapes and materials appear “relatively” harmless at the individual item level. Let’s emphasize the word “relatively” here since a risk index can be applied across all bags admitted on a given route. As a third person’s bag is scanned, however, the cumulative index is exceeded to reveal a macro pattern that triangulates with two earlier bags. Analysis should span multiple airports since passengers can board a multi-leg flight midway. Once onboard, the three parts could be assembled to form a weapon. Based on the object type and perhaps other factors such as location and associated passengers (automatically excluding a known security marshal, for instance), a threat level and recommended actions are escalated to security personnel who verify and remediate as needed. Automated lockdown or other triggers can also be activated. Findings can then be validated and included in a feedback loop to improve the machine learning process over time.
This is a movie-caliber example of an artificial intelligence ecosystem that is light years ahead of the “bookmark” we set above. Pardon my combined background with Warner Bros (entertainment) and Government/DoD. This dual backdrop occasionally seeps into whitepapers and other daily work :) But there are countless real examples along these lines in the general business world of KPI alerting and more. With this in mind, our stage is now set to discuss how to solutionize both simple and complex analytic challenges.
Old and New Approaches
Initial data mining projects required some kind of methodology to drive reproducible success. Most will agree the de-facto leader in this arena has been the CRoss Industry Standard Process for Data Mining (CRISP-DM — click), depicted in the first diagram below.
CRISP-DM was devised in 1996 to handle the typical flow of one-off studies that executive stakeholders often required of statisticians and early data scientists. Demand has clearly become more advanced over the past 20+ years, and what was “typical” has evolved considerably.
Let’s identify and fill the critical gaps in CRISP-DM with a modern, proven approach called the Scalable Enterprise Analytics Lifecycle (see SEAL diagram below).
Let’s break down SEAL into the four major sections labeled with blue titles. A couple of additional feedback loops are also addressed below, e.g., to ensure overall accuracy and continual improvement.
Define Charter including Goals & SLAs
Beyond the understanding of business and data (this is where CRISP-DM stops), SEAL squarely defines analytic goals and any applicable SLAs in an analytics Charter.
Wrangle & Refine
CRISP-DM was founded on one-off data mining initiatives based largely on study driven data sets which must be cleaned and “prepared.” SEAL formally adds a step to integrate and “blend” a wide variety of data as discussed in “The Backdrop” above. With the advent and sustained success of Democratized or Citizen Data Science (a la Gartner), SEAL also adds an “Analytic Solutioning” step to incorporate conventional analytics into our approach. This promotes sharing of data assets and relevant workflows — avoiding the classic data silos and swamps that otherwise arise. Citizen Data Science also maximizes economy of scale, productivity, and job satisfaction of your workforce.
Fulfill & Validate Charter
In addition to model evaluation (in CRISP-DM), SEAL adds steps to assess and validate performance in light of whatever scale is anticipated in the foreseeable future. If that horizon is too far out or impractical to forecast, an elastic environment with auto-scaling is strongly preferred. Using cloud infrastructure does not guarantee sufficient scalability which can vary by architecture or component type, so alternatives must often be considered and balanced with other objectives such as data consistency, durability, security, and more. Participants are also encouraged to tag their code, configuration files, and agile story cards (e.g. with #performance) to keep track of potential technical debt that will need to be addressed as growth occurs. This prevents the typical syndrome where things slow down and you need to engage an expensive expert to perform the heroic deed of speeding things up. By tracking our dust as it occurs, we can assign more economical resources to these tasks — and do so proactively. Tagging often allows a junior person to accomplish senior feats — increasing job satisfaction which ripples to employee retention. This seems way beyond the scope of a technical methodology, but is highly relevant in today’s modern enterprises where employee tenure is often curtailed to a few years if that.
Another callout in this section is the Actualize/Visualize box. Neither is included in CRISP-DM. The airport scenario described above includes actualization/workflow steps that are not conventional visuals (analytics charts or graphs).
Implement & Improve
This section is clearly way beyond the CRISP-DM foundation which dead-ends at Deployment. SEAL adds DevSecOps to the picture, including everything a formal enterprise needs to ensure and sustain success — from integration and security to monitoring for model drift and other data anomalies. At first glance, quality and security look stapled onto the end here, but a proper agile card in SEAL incorporates provisions to address Quality by Design (QbD), Security by Design (SbD), and Privacy by Design (PbD) from the inception of each story.
Additional Feedback Loops
CRISP-DM already has reasonably placed iteration/feedback loops, but if you look closer, there are gaps that prevent growth and continual improvement. SEAL adds upstream feedback directly from the Wrangle & Refine box to the Define Charter box at the top. Data Prep and Blend activities are generally intensive, and there is no reason to wait until Model Evaluation (at the bottom) to update our Business and Data Understanding.
SEAL also adds two broad “Optimize” loops. The one branching from Fulfill & Validate Charter (at the bottom) ensures that discoveries made during model, scale, and visual assessments can streamline directly to the core analytic activities without restarting from scratch at the top (which is also an option, as it is in CRISP-DM which only supports model evaluation).
The other optimization loops from the Implement & Improve box back to the Charter at the top. This becomes obvious in SEAL with the addition of post-deployment operations and monitoring. It goes well beyond a basic retrospective, and can include automated feedback in addition to human centered learning. For example, data anomaly monitoring can raise flags that can generate upstream data corrections (ideally) and/or visual alerts on a Data Readiness Dashboard of some kind. This way, inaccurate data is not inadvertently used until remediation is performed. This is another typical pitfall that most systems overlook.
Conclusion & Info on Other Industry Methods
SEAL is not the only method that has surfaced the need to modernize. It is, however, the only proven approach to date that fills all major gaps in CRISP-DM, and is especially relevant where big data (volume, variety, velocity, etc), DevSecOps, integrations, and support are vital to ongoing success.
In recent years (earlier for some), people have been supplementing CRISP-DM with a cadre of bolt-on techniques. Because the core of CRISP-DM provides an intuitive fulcrum, it remains a popular centerpiece even today. SEAL has moved the centerpiece to a cornerstone, so to speak. Other authors have published enhancements as well, most notably SAS Institute’s SEMMA (Sample Explore Modify Model Assess — click) and IBM’s ASUM-DM (Analytics Solutions Unified Method — click). These approaches have not widely caught on yet in a generic sense — many say because people associate them with the proprietary tool sets of their respective vendors. Another reason could be that, while filling certain gaps in CRISP-DM, there are more to address, especially for use cases in today’s medium to large enterprises. While SEMMA has added visualization and some details in the data exploration and modeling phases, it lacks a formal deployment process and other scalability, integration, security, and operational activities needed in a holistic enterprise. ASUM-DM includes visualization (like SEMMA) plus operational support as well. It is significantly ahead of SEMMA imho. But ASUM-DM lacks the scalability and security assessments required in most big data environments, as well as post-production specifics such as monitoring for model drift and other data anomalies. Both SEMMA and ASUM-DM lack support for SLAs in their Charter (they have no formal Charter) as well as Democratized/Citizen Data Science which has a plethora of critical benefits as noted above.
Imho Microsoft has arguably the most detailed modernization attempt so far, and is excellent overall. It is called TDSP (Team Data Science Process lifecycle — click). While not part of the acronym, they throw in the word “lifecycle” at the end… because that’s what it is. Lifecycles imply the delivery of something valuable, beyond a series of steps to an end. SEAL is the only other mainstream analytics lifecycle to date with this built-in theme.
Aside of TDSP’s lack of support for SLAs (it has nothing along the lines of a Charter) and Democratized Citizen Data Science, there is a lot to like here. For example, the specifics offered in its sub-boxes are effective in conveying a lot of information in a small space — especially the purple ones pertaining to AI/ML feature engineering and model refinement. The gray section on data acquisition and understanding is also relatively detailed. But for some reason it retains the limited CRISP-DM mindset of a single “data source.”
The Intelligent Applications cog in the blue section is an innovation worth calling out. With that one object, TDSP represents the hypothetical airport scenario described above. SEAL currently requires narrative to express the idea. So you see… everything can be improved :)
Future Improvements and Credits
(Article Version: 1.1.10)
A dditional details are available in the deck for a related live presentation at DataCon Oct 23–25, 2020 (www.DataConLA.com). My presentation is on Oct 25 @ 10am PDT (here) and is called “Modernizing the Analytics and Data Science Lifecycle for the Scalable Enterprise: The SEAL Method.”
Future versions of this article will incorporate content from the DataCon deck (above) and elaborate on certain innovations and best practices, as shown in the following diagram with “Callouts” in purple.
A few of the callouts will align with the purple boxes in TDSP (click), as noted in the Conclusion above. I also like TDSP’s Intelligent Applications cog. That single visual represents my hypothetical airport scenario in a nutshell (see The Backdrop section). I love nutshells :) Ideas like this that have significant and diverse use cases are often worth depicting visually.
Data Sharing is a modern core of virtually every enterprise and a personal favorite topic (e.g., Gartner article click1, and older academic Wikipedia definition click2). I’ll probably include deeper coverage on data sharing practices in SEAL’s model. This consumption based sharing will transcend the “sharing of data assets and relevant workflows” between analysts and data scientists currently touched on in the Wrangle & Refine section. Enterprise and multi-enterprise consumer sharing starts in the Charter, but is most prevalent in the Fulfillment and Implementation sections of SEAL. From a governance perspective, “need to share” is typically balanced by “need to know.”
Anyone and everyone is encouraged to share ideas for the refinement and evolution of SEAL. Future updates to this article will contain a Credits section to highlight contributors. Please see contact info below.
About the Author
Jeffrey Bertman is a “data everything” specialist with major success stories in small to large Fortune 100/1000 and government organizations, including Warner Brothers, Verizon, Comcast, GEICO, Airlines, Cigna, DoD, DoJ, FDA, and Gov Intel. He serves as CTO and lead data scientist/engineer for Dfuse Technologies, a world class consultancy with a data boutique and nationwide footprints in private and public sectors, e.g., Apple, Wells Fargo, Amgen, CVS Health, Deloitte, and numerous government agencies.
From strategy through operations and mentoring, Mr. Bertman leverages leading edge technologies to deliver high yield, enduring success stories that actualize real world improvement in market share, revenue, profit, quality, security, efficiencies, effectiveness, and more.
Mr. Bertman is a popular speaker on a wide range of technical and management topics. Business disciplines include accounting/financials, marketing, sales, social media, entertainment, digital transformation, legal, telecom, health care, e-collaboration, manufacturing, distribution, and inventory optimization.
Jeff Bertman • mobile +1 818–321–3111 • Jeff.TechBreeze@gmail.com • www.linkedin.com/in/jeffbertman • www.medium.com/@techbreeze • WhatsApp, Zoom, MS Teams, Slack, Discord, etc available upon request • Dfuse Technologies (www.dfusetech.com) work email and other info also available.