Big data: 8 ideas to watch
A look at the major forces shaping the data world.
By Ben Lorica
This story was originally published on O’Reilly Radar as part of our exploration of big data’s big ideas.
Looking back at the evolution of our Strata events, and the data space in general, we marvel at the impressive data applications and tools now being employed by companies in many industries. Data is having an impact on business models and profitability. It’s hard to find a non-trivial application that doesn’t use data in a significant manner. Companies who use data and analytics to drive decision-making continue to outperform their peers.
Up until recently, access to big data tools and techniques required significant expertise. But tools have improved and communities have formed to share best practices. We’re particularly excited about solutions that target new data sets and data types. In an era when the requisite data skill sets cut across traditional disciplines, companies have also started to emphasize the importance of processes, culture, and people.
As we look into the future, here are the main topics that guide our current thinking about the data landscape.
Note: This document represents our thinking as of Fall 2014. You can keep up with the latest analysis and developments in the data space through the O’Reilly Data newsletter.
The combination of big data, algorithms, and efficient user interfaces can be seen in consumer applications such as Waze or Google Now. Our interest in this topic stems from the many tools that democratize analytics and, in the process, empower domain experts and business analysts. In particular, novel visual interfaces are opening up new data sources and data types.
- Narrative Science adds descriptive summaries to the output generated by business intelligence tools (dashboards, charts, and tables).
- Palantir and Quid use a combination of visualization, search, and analytics that enable domain experts to discover patterns hidden in large data sets.
- StitchFix provides product recommendations by combining proprietary algorithms and expert stylists.
- “Moving dots” (e.g. tracking data from athletics) are being analyzed by companies that specialize in spatio-temporal pattern recognition. Startup Second Spectrum provides analytics to coaches and front offices in many professional basketball teams. In the near future, their technology and recommendations will be available in real time to coaching staffs during in-game situations.
Intelligence matters: Artificial intelligence and algorithms
Bring up the topic of algorithms, and a discussion on recent developments in artificial intelligence (AI) is sure to follow. AI is the subject of an ongoing series of posts on O’Reilly Radar. The “unreasonable effectiveness of data” notwithstanding, algorithms remain an important area of innovation. We’re excited about the broadening adoption of algorithms like deep learning, and topics like feature engineering, gradient boosting, and active learning. As intelligent systems become common, security and privacy become critical. We’re interested in efforts to make machine learning secure in adversarial environments.
- The “Intelligence Matters” series on O’Reilly Radar covers recent developments in artificial intelligence.
- Streamlining Feature Engineering: O’Reilly Radar post on new tools that enable feature discovery.
- Hardcore Data Science day at Strata + Hadoop World 2014 features deep learning and other algorithms, analytic techniques, and a fascinating machine-learning pipeline toolkit from UC Berkeley’s AMPLab.
The convergence of cheap sensors, fast networks, and distributed computation
The Internet of Things (IoT) will require systems that can process and unlock massive amounts of event data. These systems will draw from analytic platforms developed for monitoring IT operations. Beyond data management, we’re following recent developments in streaming analytics and the analysis of large numbers of time series.
- I ❤ Logs: Event Data, Stream Processing, and Data Integration: This is a new book from the co-creator of Apache Kafka.
- Surfacing anomalies and patterns in Machine Data: O’Reilly Radar post on large-scale event data platforms that originate from the world of IT operations.
- How Twitter monitors millions of time series: O’Reilly Radar post on a distributed, near-real-time system that simplifies the collection, storage, and mining of massive amounts of event data.
- Data Analysis on Streams: A recent webcast on popular techniques in real-time analytics.
Data (science) pipelines
Analytic projects involve a series of steps that often require different tools. There are a growing number of companies and open source projects that integrate a variety of analytic tools into coherent user interfaces and packages. Many of these integrated tools enable replication, collaboration, and deployment. This remains an active area, as specialized tools rush to broaden their coverage of analytic pipelines.
Examples and related resources:
- Reproducing Data Projects: O’Reilly Radar post on popular approaches for reproducing, managing, and deploying complex data projects.
- Project Jupyter: A new initiative from the creators of IPython.
- Databricks Workspace: An impressive notebook interface that pulls together components of the Spark ecosystem.
- Data Wrangling gets a fresh look: O’Reilly Radar post on new tools for data preparation.
- Data Analysis is just one component of the Data Science workflow: An overview of modern data pipelines.
Evolving, maturing marketplace of big data components
Many popular components in the big data ecosystem are open source. As such, many companies build their data infrastructure and products by assembling components like Spark, Kafka, Cassandra, and ElasticSearch, among others. Contrast that to a few years ago when many of these components weren’t ready (or didn’t exist) and companies built similar technologies from scratch. But companies are interested in applications and analytic platforms, not individual components. To that end, demand is high for data engineers and architects who are skilled in maintaining robust data flows, data storage, and assembling these components.
Examples and related resources:
- Some popular Apache projects: Hadoop, Spark, Cassandra, Kafka, Mesos,ZooKeeper.
- Big Data systems are making a difference in the fight against cancer: O’Reilly Radar post provides an example of how open source distributed computing tools can make a profound impact in the health care domain.
- Verticalized big data solutions: O’Reilly Radar post on domain-specific big data applications.
- Hadoop Application Architectures: A book on best practices for building data management solutions.
- Designing Data-intensive Applications: A book that looks at how to build applications using some popular big data components.
Data scientists, design, and social science
To be clear, data analysts have always drawn from social science (e.g., surveys, psychometrics) and design. We are, however, noticing that many more data scientists are expanding their collaborations with product designers and social scientists.
Examples and related resources:
- IDEO’s Hybrid Insights group integrates quantitative techniques with the qualitative methods popular among product designers.
- Datascope Analytics: A Chicago-based data science consulting group that incorporates techniques from product design.
- Ideation (idea generation) workshops are beginning to be used by some data scientists.
- Thinking with Data: This book by Max Shron provides an overview of ideas and techniques from the social sciences.
Building a data culture
“Data-driven” organizations excel at using data to improve decision-making. It all starts with instrumentation. “If you can’t measure it, you can’t fix it,” says DJ Patil, VP of product at RelateIQ. In addition, developments in distributed computing over the past decade have given rise to a group of (mostly technology) companies that excel in building data products. In many instances, data products evolve in stages (starting with a “minimum viable product”) and are built by cross-functional teams that embrace alternative analysis techniques.
- Building Data Science Teams: Data scientists are at the forefront of innovation in many data-driven organizations. This report offers practical advice for constructing teams that can drive that innovation.
- Just Enough Math is a video series that introduces mathematical concepts using business cases.
- Lean Analytics: Acquire a data-driven mindset through 30 case studies.
- Data Jujitsu: A primer on organizing teams and building data products.
Perils of big data
Every few months, there seems to be an article criticizing the hype surrounding big data. Dig deeper and you find that many of the criticisms point to poor analysis and highlight issues known to experienced data analysts. Our perspective is that issues such as privacy and the cultural impact of models are much more significant.
Examples and related resources:
- On Being a Data Skeptic: A nuanced view of big data and data science.
- Organizations like Code for America, Bayes Impact, Datakind, and Data & Society broaden the discussion of what data scientists can be working on and thinking about.
- NIPS 2014 Workshop: Fairness, Accountability, and Transparency in Machine Learning: Researchers address “… growing anxieties about the role that machine learning plays in consequential decision-making in such areas as commerce, employment, health care, education, and policing.”
- No silver bullet: De-identification still doesn’t work: Princeton security and privacy researchers survey anonymization strategies for a variety of data types.
We’ll also explore each of these topics through our publishing program, events, webcasts, and online coverage. These explorations work best when they’re two-way roads, so please share your feedback through Twitter (@bigdata).