Photo from Mashable article on Nvidia’s machine learning model for celebrity face generation. http://mashable.com/2017/10/30/ai-machine-uses-celebrity-photos-to-create-fake-faces/#nXT75nSPbmq0

10 Themes Observed in Data & Analytics in 2017

Kyle Roemer
State of Analytics
Published in
8 min readDec 18, 2017

--

2017 has come and gone. Looking back at the year, it was filled with promising technologies and a fevering hype pitch regarding AI and Machine Learning. We saw the big 3 cloud providers (AWS, GCP, Azure) release major steps forward in Data Science and Machine Learning tooling (more on that below). We continue to see a talent shortage in this space and it’s only getting worse as Data & Analytics becomes an integral fiber of an organization (but universities are catching up!).

When I look back at my original predictions for 2017, there were some (IE GCP playing catch up, Data Quality & Governance, Experimentation) which have rung true but others that have not (Combining Clickstream + Rev / Sales Data). While those predictions are interesting, there is much more happening in the space…

Now, let’s move on to those broader themes from 2017 that will carry into 2018:

1. Data Science without Data Engineering is a Problem

Does this sound familiar in your organization Data scientists build their models and then hand off to a separate engineering group for deployment and productionalization?

We see this EVERYWHERE. This model of development can be successful but oftentimes is not given the potential challenges around model changes over time, interpretation required from engineering, collaboration between data scientist and engineer, non scale considerations in the model from data scientist, etc.

Encourage your data scientists to better understand data engineering and deployments of their models → up level your data scientists into data savants.

The good news is that cloud providers are making this easier than ever with services (like Google Cloud ML Engine, AWS SageMaker & Azure ML) but it doesn’t remove the need for data scientists to both understand how deployments work and performance / scale considerations in their model and approach. These services strip out a lot of the “work” but it is yet to be determined to what degree and how difficult it will be to take more complex modeling and scoring scenarios to these services.

2. “With great power comes great responsibility” — stats education paramount in the age of drag & drop machine learning

Tools like Sagemaker, Google Cloud ML and Azure ML are bringing Machine Learning (ML) to the masses, but now more than ever foundational knowledge of how to interpret models, assess models is more important than ever.

We’ve seen this trend before in other areas of data, like Data Visualization. Tools like Tableau and Qlik made it very easy to drag and drop visualizations and analysis. We then saw a proliferation of bad data viz, requiring more education and training on principles.

This will become just as important with these type of ML services as interpreting the results of a model and assessing models requires education in stats and math, of course. Formal, university studies aren’t necessarily required as there are some strong on-demand courses out there to learn. These short-term courses however only go so far in educating and preparing a data scientist for the actual work itself. My fear or reservation here is that training becomes more watered down given the accessibility these services are making the field.

The punchline if your data scientist or analysts are using these services, please question the approach, validation and how they’re interpreting the results of their model.

3. The Pace of Innovation at AWS, GCP and Azure is Staggering

2017 was the year where major competition and service parity took shape across AWS, GCP and Azure. Each of these cloud providers sent their respective haymakers at one another with each of their conferences and multiple product announcements throughout the year. This competition is fantastic for everyone and in a world where enterprises are moving to hybrid / multi cloud ecosystems, it’s reassuring to know you’re in good hands with each provider.

If you’re eager to learn more on some of the things released this year, check out this article by my colleague Andrew Tubley on GCP here or from my other colleague Jeremy Gilmore on AWS.

This brings me an interesting question given all this innovations from the cloud providers: how do tools solving a single step in the process compete and innovate to stay relevant? Looking at tools like Tableau, Alteryx, Datarobot, [insert all other competitors doing data viz, data eng, or data science], how do they augment their offerings to become more “platform-like”?

4. If you Lead a Data or BI or Analytics Group, Here’s a Friendly Warning…

You need to ensure you’ve created a data ecosystem that is cloud-first and provides all the leading tools to your data engineers, data scientists and analysts. Otherwise, you will be out of work in 2018.

I’ve started to see this in multiple places this year, even at “Tech” companies, where IT leaders are clinging to legacy technologies and not meeting the pace needed from their stakeholders / users. Look, I get it, it’s what you know and you have a whole large team on and offshore who built a platform over a decade in [Insert legacy DB & Tool]. It’s time to move on…

If you are promising your users that you’ll be enabling a new ecosystem, but the POCs are taking months, you have another problem to deal with. If you don’t more rapidly enable these services, then you’ll find yourself in a situation where those stakeholders / users are “going rogue” and deploying their own. I don’t blame them, it’s quite easy to get something up and running with AWS, GCP or Azure.

If your team lacks the skills with these new technologies, get them trained up and quickly! (It’s critical as you can’t take a group well versed in only Oracle and Informatica and expect them to understand the paradigm changes with BigQuery and Airflow, as an example.) Otherwise, reach out and I’m happy having a chat on how you can move more quickly.

5. Clean, Governed Data is Still the Biggest Challenge for Companies

I’ve wrote about this a few times, but I would love to see source / business applications get better and smarter at ensuring clean data is created, updated and removed in their applications. Could you imagine if Salesforce was smarter (out of the box / cloud) at ensuring data entered (New accts, opptys, etc) was correct?

This will allow companies to be less reliant on downstream tools for that cleanup. While these tools for cleanup will be needed in some form, cleaner upstream data should mean cleaner downstream.

Now, making source business applications smarter to ensure data is clean is just one step. Data needs to be governed in those systems and across business processes that touch that data. It’s not easy for many reasons, but a big reason for that is that we are undisciplined and don’t make it easy for business folks to be disciplined. So, as you start to better govern and put controls in place for clean data, remember that everyone will lack some degree of discipline so make it easier.

6. Pendulum Swings between Code-first Data Engineering and GUI based Data Engineering Tools

We see these pendulum swings frequently across database and analytics tools, but the current trend is that developer and engineer friendly are in. This is primarily due both to these tools working nicely with cloud provider environments and the flexibility in integration patterns. One of the most common combinations we see at Tech companies is using the combination of Python + SQL with Airflow as the orchestrator.

This approach is not for everyone and not every company, but as data engineering and data science become more symbiotic, it’s important that the tooling is aligned to what they like to use. Data scientists increasingly are using Python so having the data integration patterns leveraging Python is a useful approach. (See also theme #1)

There are some good GUI first data engineering and integration tools out there (See Matillion, SSIS, Informatica, etc.), but if you’re building a new platform that should align ultimately to factors of speed to creation, cost, what your team makeup is now and in the future, source connectivity and flexibility to change patterns.

7. The “best practices” of DevOps/SRE is Now the New Norm in Data

Everyone let out a collective YAY! I remember that the concepts of DevOps in data and analytics were quite abstract and challenging a few years ago, but that has changed dramatically.

Everything has to be versionable, trackable, reproducible, testable, and deployed seconds after leaving a developers fingers. This is the new norm and the ecosystems with the cloud providers make it easier than ever, and in containers no less!

This is important for a few important reasons:

  1. Reduction of manual deployment and code out-of-sync errors
  2. Speed to deploy data transformation and data model changes reduce drastically. Long gone are the manual, week long deployment activites.
  3. Automated unit and data quality tests greatly remove the onus on developers and QA functions for spot checking. Their testing can now be more focus on value added data validity tests.

8. The Advent of Chief Analytics Officer now owning these data functions (vision, strategy) with IT as Support

Chief Data Officers came in vogue a number of years ago, primarily in the Financial and Healthcare industries given regulation and compliance. I won’t exhaust how well this is working, but an interesting theme of late if having a Chief Analytics Officer. These folks have less of a compliance and governance focus, and more of a using data-as-a-strategic-asset focus.

An important aspect of this role is having both an internal and external purview. They should be seen as champions internally of data, breaking down walls between Product, IT and Biz Ops groups but also being the external evangelist for your organization’s data.

We’ll see how many organizations start to adopt this role in 2018, stay tuned…

9. Data Sharing (and soon Data Exchanges) via New Tooling is Finally Becoming Mature Enough to Use

Snowflake, BigQuery, and Athena/Presto are finally making sharing data simple and easy, hooray! The developers and product folks for these technologies have smartly thought about dataset sharing as a feature needed to be adopted across data-centric orgs. Rather than having to rely on additional tools for this sharing, you can enable this directly in the databases themselves. (A tech enabler / unlocker has been splitting compute & storage for databases. This is firmly established for these database technologies now.)

Governance and security are as important as ever, but when haven’t they been? (See theme #5 above)

This trend will also take shape in external data exchange markets. While the enabling technologies could look and feel different, this ability for companies to share and exchange data is starting to bubble up in interest. Imagine an ecosystem where companies who share customers (or industry data) could openly buy and sell relevant data (with appropriate agreement from those customers). This is both exciting and a bit scary at the same time given the criticality of security in ecosystems and exchanges like this.

10. AI/ML is Still Untapped, see Recent News about Generating Celebrity Faces as “unique” Application of Emerging Tech

Sigh…well, we’re still in the early days of wider adoption of AI & ML. 2018 is going to see more and more applications of these emerging technologies. Will we reach an even higher hype pitch in 2018 for AI/ML? We will see, but there will certainly be more business applications of these technologies, many of which will never reach the public eye. We’ll see this, along with the pop culture applications, but I’m hopeful some of the applications help the greater good. For every “Celebrity Face Swapper”, it would be great to see an AI application in an area of public need.

I’ve believed this for years now, but today more than ever the field of Data & Analytics is one of the most promising and fulfilling areas to be working in Technology.

I hope you enjoyed 2017 and where our industry is heading. Please add your thoughts to the comments section as I’m eager to learn what you observed this year.

--

--

Kyle Roemer
State of Analytics

Technology leader at Slalom. Ex-Winemaker. Enthusiast. These thoughts are my own.