Questions for Data Preparation

Parvez Kose
Data-vized
Published in
3 min readAug 23, 2020

Data preparation is perhaps the most critical step in data science research, exploratory analysis, or data visualization work. It refers to collecting, cleaning and transforming raw data before its usage and application, and well before any processing or further study. It could also mean aggregating data from one or more sources with disparate structures and formats.

Data preparation allows for efficient analysis, improves consistency, reduces errors, identifies redundancies, and eliminates any discrepancies during the data collection process. Most importantly, the process helps standardize the desired format and estimate the effort required to transform the unprocessed data into a usable form. All these benefits make the data preparation process so powerful.

While working on a data visualization pursuit for a not-for-profit news agency, the first step was to collect and analyze the raw data that will be used for prototyping visualization. An external community partner was tasked with sourcing and provisioning the data required for the visualization output. It was critical to gather ample information about the data provided by the community partner to perform initial requirement analysis.

From the start, it became intrinsic to identify what type of questions needed to investigate our community partner about data to make it a productive discussion. At that time, It made sense to come up with an array of questions to direct them and help drive the conversation.

The wide range of questions covered the data collection process, sourcing methods, unfinished processing, volume, storage, and blockers like difficulties faced while organizing the data, etc. Using that information, I prepared a checklist of questions that would be useful when embarking on a data preparation process. I hope this helps anyone working on the data pipelining and serve as guidelines for data preparation, especially for the visualization projects.

  1. Who is the target audience for the visualization output?
  2. What do you expect the end-user to deduce from the final output? In other words, what does your business/mission expect to see in the output?
  3. What are the key challenges and pain points that it is solving for?
  4. What is the opportunity? what is the missed opportunity if we don’t do this?
  5. what are the benefits? what are the costs? what is the return on investment?
  6. What is the product? what is the product made of (features, capabilities, touchpoint, experience)?
  7. Are their ethical considerations? are there legal parameters surrounding any of the data?
  8. What collection methods were used to source the data?
  9. What legwork were involved in cleaning the original data?
  10. What tools or techniques used to sanitize the raw dataset?
  11. What information was discarded during preprocessing or cleaning raw data?
  12. What steps were taken to fix the inconsistencies and remove duplicate information?
  13. Does the current form support scalability to incorporate more data down the road?
  14. Do parts of data need to be transformed manually before use?
  15. What preconditions to consider when selecting a storage system?
  16. What tools or services used to export the raw data?
  17. What tools or services used to organize the refined data?
  18. How these results will be used by the organization in the future?

After analyzing the dataset using these questions, you will be able to uncover enough information about the data, understand important elements to distinguish what’s most essential and what can be dispensed with. Armed with this knowledge, you will be better equipped to embark on the next steps of the data engineering more confidently, be it data processing, database design or further analysis.

--

--

Parvez Kose
Data-vized

Staff Software Engineer | Data Visualization | Front-End Engineering | User Experience