Data Dialogs 2017 Recap
Two days ago I had the pleasure to attend the Data Dialogs Conference at UC Berkeley where we discussed what can data scientists do to improve their craft. The key takeaway in the discussion is that it is so much easier to work with data than people. Even though we are able to extract insight from data, the key insight still lies in putting the human in the loop. For more details, let’s run through the main points from each speaker.
Project Jupyter by Fernando Perez
If you haven’t yet used Jupyter Notebook, come join the bandwagon. As stated in their website, “Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.” The goal of Jupyter notebook is to ensure that research is reproducible. It would be a pity if we have accomplished, for example, over 40 years of research, and future scientists are not able to reproduce it.
But reproducibility can only goes so far if the scientific communities cannot read all the publications released in their respective field. For this reason, a shift in mindset has to take place. Remember the goal of scientific literature is to communicate ideas for others to build upon. Unfortunately, it has become a way for scientists to get professional credit for what they do. The first thing we need to do then is to separate professional credit from the communication of ideas. We need to publish less so we can read more. The other shift that has to take place is to build the scientific literature easy to read by machines since we have yet to find a super human who can read all the publications released yearly. Lastly, data has to be made available for others for reproducibility to be achieved.
The goal of Jupyter Notebook is to ensure that research is reproducible, but we need to join that effort in order to make the dream of every research reproducible a reality.
Vernacular Data by T. Christian Miller
T. Christian Miller introduces the concept of vernacular data which will allow anyone who has access to data to answer their own question what is happening. He defines vernacular data as
- data created by users via sensors, mobile devices and personal data mining
- data used in creative ways to unearth previously unknown information or trends
- democratized data
We should strive to ensure vernacular data is made available as it is difficult right now to have access to data that will provide insight as to what is happening.
Continuing Evidence of Discrimination Among Local Elected Officials by D. Alex Hughes
For those who strive to learn more on how you can take a research question and develop a design process to reach to a conclusion, I recommend listening in to D. Alex Hughes’ talk. I will not do him justice by summarizing his already summarized approach, so details of his presentations (as well as others’ in the conference) is included in a PDF file.
From the research questions he posed, he was able to determine whether or not street-level bureaucrats take differential actions in responding to emails based on the name of the sender alone. He categorized the names based on race: white, African-American, Latino, and Arabs. What his research has determined is that Latino and Arab names receive lower rates of response, but African-American equal whites in terms of the rates of response received. Therefore, when street-level bureaucrats open letters from the general public, it would be best if the name of the sender was anonymized for fair treatment.
Analytics for the 4th Industrial Revolution by Chetan Gupta
We have come a long way since the day when data exists mostly in databases. Now data is everywhere, and the algorithms we have discovered brings us one step closer to developing a cyber-physical systems, the potential 4th Industrial Revolution after we have achieved computer automation in our time. The four common categories on how we can apply algorithms on data as of right now are
- descriptive or recognition
i.e. fraud detection
i.e. clinical decision support
- prescriptive or recommendation
i.e. movie recommendations
i.e. self-driving car
with descriptive or recognition being the most valued among customers (in comparison to automation).
It is more important that data scientists know how to operationalize data than getting the accuracy of data. As an example, industrial analytics is one way how data scientists can operationalize data. Although I will leave the details how this can be achieved in the PDF file, the key takeaways are:
- Remember to think about the cost tradeoffs.
- Different settings require different needs; therefore, you cannot solely rely on algorithms to make prediction. You will also need to take input from the experts in the field.
- Perform A/B testing in order to figure out if there is an improvement after the change.
- Before solving at a higher level, think of the problems at the plant level because solving problem at a higher level is more complicated exponentially since it requires to solve multiple problems at once.
The Challenge of Predicting Churn in the Enterprise World by Sharon Lin
In business, retaining customer is as important as growing your customer base. So creating a machine learning model to predict the likelihood of a customer leaving based on his/her user pattern would be beneficial for the enterprise overall. The key takeaway she found during her three months building the model is that good features will lead to good prediction. This insight did not materialize without setbacks. These setbacks include extracting user attributes from different silos, cleaning data especially in the naming convention department because a single company can be identified in fifteen different ways in terms of name, and relying on metrics other than accuracy because churn rate of enterprise is typically less than 25%.
Visual Trumpery by Alberto Cairo
Mere exposure to data does not have a persuasive effect — maybe at least partially due to the increased sense of objectivity evidence supported by numbers carries.
~ The Persuasive Power of Data Visualization
Data scientists are data people. They understand data. But most people are not literate in data and often time misread or use data and data visualizations to push agendas. It is for these people that data scientists need to engage in a data dialog. It starts with a one-on-one conversation with the following key points:
- Read beyond the title.
- Pay attention. When a visualization is given, it isn’t meant to be seen; it’s meant to be read.
- Notice what the visualization shows — and think about what it may be leaving out.
- A visualization shows only what it shows — and nothing else.
Data scientists should also be engaged in data dialogs with other data people, including journalists, to ensure that data are presented in a clear manner so as to not deceive their audience. This can be achieved by
- Disclosing data and methodology in plain English
- Representing the right data, and doing it accurately
- Showing an appropriate amount of data
- Revealing uncertainty if it was relevant
Advanced Analytics in Wells Fargo by Menglin Cao and Mauro Cardenas
It is nice to see a change of pace where speakers interview each other in an Oprah Winfrey’s style. The main takeaway from the discussion is that data scientists like to talk to data more than they like to talk to people, which succinctly summarizes the overall theme of the conference. But for data scientists to become better data scientists, they need to understand their business partners and what keeps them up at night. In order to do this, data scientists need to:
- Be really good at what they do.
- Be able to tie what they are doing with business priorities.
- Get buy-ins from their peers.
One way to go about understanding what priorities business partners have is to anticipate their questions. If data scientists can address these questions, it will go a long way to achieving their long-term goal. To make this process easier, give credit to the decision makers because people tend to implement an idea if it was their own.
“Full Stack” Data Science at Stitch Fix by Hoda Eydgahi
Stitch Fix provides personal styling as a service which incorporates both algorithms and human recommendations. In this case, the human in the loop is the personal stylist. As a company, they have found that incorporating both machine learning algorithms and human recommendations gets the best of both worlds. Machine learning has yet to decipher the subtle intricacy of style. Left alone, they would recommend a similar pair of jeans user has come to love over and over again. On the other hand, a stylist does not have the time to peruse thousands of inventories. For this reason, machine learning takes out the grunt work by reducing the sheer of volume of items user will love, and the stylist selects distinct, novel items that hopefully tickles the fancy of the user.
Stitch Fix sets up their business model as well as their development model differently to maintain their competitive edge against their competitors. Unlike discovery e-commerce that permeates in the market today, they do not let their users see their inventory. They do this because they know shopping is exhausting for their customer. But mostly, it creates an advantage by collecting high quality and differentiating data that competitors do not have access at the beginning of the sign-up stage. In addition, Stitch Fix hires full stack data scientists who can carry out data science, software engineering, and project management tasks. They find that by doing so, the development team is able to iterate quickly because they are not held back by priorities and lost of context commonly found if data scientists and software engineers were separated into their respective groups.
Using R to Build a Data Discovery Tool for Domain Experts by Tarak Shah
When constructing a software application, such as a data discovery tool, for people with different skillsets, it is important to define using simple definitions and build upon these definitions for more complex definitions. In essence, simple language allows easier communication and collaboration with people of different skillsets. This can be achieved by
- Minimizing synthetic noise, as explained in Martin Fowler’s book Domain Specific Language
- Maintaining flexibility without complexity
- Defining clear and understandable error messages
- Providing documentations and tutorials and seeking collaborations
Getting Data Together by Katherine Ahern
Companies are having difficult time transitioning data from databases to cloud. The reasons for this is because data tends to have system-specific data encodings, and systems tend to be locked and proprietary. To make the transition smoother, the presenter suggested to look into:
- Postman which can assist in building a fast and smooth workflow for API development
- Google BigQuery which provides access to Google’s Dremel via a REST API. For those who are not familiar with Dremel, Dremel is an interactive ad-hoc query system that allows users to run queries on large, structured datasets in near realtime.
- Apache Drill which is the open source project inspired by Google’s Dremel. On a side note not discussed by the presenter, Apache Impala is also inspired by Google’s Dremel but one of the main difference between the two is that Impala is tied down to Hadoop while Drill can connect to custom data sources.
Once you have “fun with data”, apply A/B testing to determine whether or not the changes applied on the dataset have made an impact.
The key takeaway in the discussion is that it is so much easier to work with data than people. Even though we are able to extract insight from data, the key insight still lies in putting the human in the loop.
As data scientist, what do you think you can do differently to improve your craft? Leave your thoughts, and we can start another data dialog on the comments section below!
Details about each of the speakers can be found in this linked PDF file.
If you can make it, I hope to see you in the next Data Dialogs Conference May of next year.