Is there a perfect dataset?

In this blog, I will throw some lights on following questions

  • What is a perfect dataset?
  • What are the characteristics of a perfect dataset?
  • Does a perfect dataset exist in a real world?

Most of the datasets are produced as a result of a business process. Datasets are also produced because business wants to measure something. A perfect datasets should be able to help businesses either to manage their business or to improve it. For example patient administration system collects information about patients and helps the hospitals manage their patients. Those dataset helps business to bill their clients, manage their medical records, and so on.

Thus a perfect dataset should address every functions of the business in a very fluid way. It should have excellent data quality (must have excellent data quality metrics such as completeness, timeliness, etc.) and more importantly inform business to make both strategic and tactical decisions. These characteristics sets a high benchmark for a perfect dataset. Perfect dataset only exist in utopian world.

Most of businesses around the world want a perfect dataset and they try to improve their data quality via iterative processes. Even though they collect their dataset to align with business functionalities, most of the time the datasets are biased. For example if you want to use some dataset for performing data analytics, the data might not be normally distributed; they might more male patients than female patients instead of exact 50% ratio.

The perfect dataset should ideally free from selection bias, publication bias and so on. Data scientists/analyst plays an important role to understand about the dataset and thus can apply statistical techniques to use those imperfect dataset to make correct business decisions.

Business must strive hard to create perfect dataset thus empower their business to trust their dataset to make accurate business decisions.