First, a tale from history:
In November 1854, Florence Nightingale arrived in Scutari to serve as a nurse during Crimean War. She had a gift for mathematics from an early age, too. At the war hospital, she was faced with suffering, pain, and chaos.
“Wounded soldiers often arrived with diseases like typhus, cholera and dysentery. More men died from these diseases than from their injuries.” (1)
During the day she served as a nurse and during night she had another mission — she was known as ‘Lady with the Lamp’. She was collecting data and analyzing causes of soldiers’ mortality. When she got back in London, she visualized the data and got into raising awareness activities. She had shown, based on data, that soldiers’ injuries were not the main cause of deaths. She went on to establish the Army Medical College in Chatham in 1859 and continued to contribute for her cause. Her personal pain and suffering is only to be imagined, yet she prevailed to serve humanity through making use of data.
Second, a story of today:
The story of today’s Data Scientist resembles a bit the Nightingale’s work — especially about lacking data while desperately needing to understand the situation of soldiers. Today, not all Data Scientists save lives, but many actually do!
Many of today’s Data Scientists most probably would welcome more data, data streams, data sources, data infrastructure, and data engineering support. Yet, they are asked to produce better insights, models, inferences, visualizations, and provide consulting on decisions.
Very often Data Scientists are frustrated not because of a model not working in production but rather because of data sources not being available — especially now in the age of GDPR, this will get worse.
There is a common denominator for this situation — lack of understanding data science. Decision makers do not understand that Data Science is not only software programming (yet!) and models are not only source-code pipelined into production.
Many companies want to become data-driven, modern, and agile. So, they hire a Data Scientist and, perhaps, a Data Engineer. This is done before collecting data, preparing data infrastructure, enabling data engineering, and knowing what data-driven business questions and strategy lies ahead. The spiral of stress ends up on the Data Scientist whose hands are tied while expectations continue to grow. A simple Google search with “why data scientists are” will suggest you complete the sentence with the word “leaving”. There is hope to see the day when this suggestion changes to something positive.
For aspiring data-driven companies, suggested steps for a successful data-driven transformation are:
- Mature, or embrace, DevOps transformation (DevOps: To do or not to do? Focus on culture first!) — this is due to culture, pipelines and data management!
- Roll-out GDPR compliant data lake — for storage and experimentation!
- Integrate data engineering capabilities — hire a Data Engineer or DataOps!
- Collect data having in mind compliance and GDPR — this is obvious but new too!
- Hire a Data Scientist!
Third, a guess to the future:
With computing power and services becoming a commodity through cloud providers, it is reasonable to think that Data Science will be commoditized, too. The million-dollar question is: under what legally balanced data constraints Data Science will become DSaaS (Data Science as a Service)?
Such balance will need to consider these factors:
- Data collection GDPR implications (currently easier to handle in the US, harder in the EU).
- Data anonymization sweet spots (ultimately, Data Science does not work on fully anonymized data, although it works for some use-cases e.g. insights based on demographics of families while anonymizing People UID, names, and surnames).
- Lawful updates of Machine Learning models (and even statistical modelling) to follow up on user GDPR consent. This refers to GDPR consent opt-out later (after initially giving consent). Not every opt-out case will affect your model — however, there is a point where too many opt-outs will affect the model, which will need to be updated, in a frequency to reflect legal requirements.
- Data ownership and trust (What happens to your raw data on a cloud and applied data science? Will the service provider infer and learn from your data for own benefits? Will your data be part of wider data science activities on provider side? etc.)
While many organizations have Data Scientists in-house today, this might not be the case in the future. Possibly, the name will change into Machine Learning Engineers (or similar!) who know how to use DSaaS and have just enough understanding of algorithms to be able to know what to input and how to interpret the outputs. The hardcore Data Scientists will work on DSaaS companies where mass Data Science advancement happens.
The good news is that Data Scientists will be in a situation where a lot of data is made available and cloud infrastructure will be mature. Their main job will be scaling out algorithms and challenging the borders of what is technically possible.
There will always be the mad Data Scientist working in the basement server room, perhaps using weird datasets.