Do Data Scientists Really Ask For Physical Data Lakes?

This article was written by Rick van de Lans

There is no discussion about the business value of a data lake for data scientists. Everyone understands that bringing all the data together in one place makes access to data easy and quick for data scientists. Studies have shown that data scientists spend 80% of their time on data preparation. A large part of that time is wasted on gathering the data they need for analytics. A data lake reduces this waste of time and enables data scientists to start sooner with their real work: data analysis.

But must a data lake be a physical data lake? According to the original definition of a data lake the answer is: Yes. Data needed by the data scientists is copied from their original data source to the physical data lake. This is reflected clearly in how James Serna defines data lake: “A data lake is a storage repository, usually in Hadoop, that holds a vast amount of raw data in its native format until it is needed.”

Copying and moving all the data physically to one centralized environment can lead to a wide range of insurmountable problems and challenges (see also this link):

  • Big data can be too big to move and too costly to store twice
  • Company politics can prohibit copying of data owned by divisions or departments to a centralized environment
  • Data privacy and protection regulations can prohibit storage of specific types of data together
  • Data in a data lake is stored outside its original security realm
  • Metadata describing the data is commonly not copied along with the data and therefore not available to the data scientists
  • Some data sources, such as old mainframe databases, can be hard to copy and to keep in its original format
  • Technical and organizational management of a data lake is required

Data scientists need quick and easy access to all the data they need, but must the solution be based on one centralized physical environment?

Continue reading…