Running a Data Project — Lessons Learned

Mike K
Version 1
Published in
3 min readMar 24, 2022

So, I’ve been in the technology business a while now, but consultancy and leading a technical data stream on a big project is new. I’m going to try to capture learnings and thoughts here.

Lesson 1 — get the data!

Seems obvious; it's a data project, but until you get to interact and understand the data and the systems involved you’re just a professional guesser.

Remember, the client may not fully understand their own data, and non-technical professionals sometimes don’t fully understand how precise and accurate data needs to be to get the results they expect from a data solution.

Don’t make any promises or commitments until you have worked with the data, you can burn months trying to work with poor quality data being handed over piecemeal. Identify these problems early!

Lesson 2 — build data validation checks early

Carrying on from lesson 1, make sure the data actually makes sense and dig beneath the surface. Just because the first 20 rows you have eyeballed look great, it doesn’t mean the data is good.

  • Join separate data sets together — how much of the data doesn’t join?
  • Check for duplication across key business columns — not just all column duplicates.
  • Check for null distributions — do you have huge chunks of missing data?
  • Check date ranges, I once saw data with a Ford Focus built around the time of the roman empire.
  • Check date formats are consistent throughout? dd/mm/yyyy or mm/dd/yyyy

Lesson 3 — get the business process nailed down

To be useful and to add value data is leveraged in some kind of business process. Understand what this process is — what is the data actually used for?

This will help you understand data context; you can then apply this context to your data validation checks in lesson 2. e.g.

  • Is there a concept of a “closed” record, Is this applied consistently.
  • Are there simple business rules you can check?
    - Can a record can only be closed X days after being created?
    - An application record must have a name?
    - “Application decision” must have a value before the record can be closed?
    - etc…

Lesson 4 — write down decisions

This should probably be lesson 1, its so important. Many, many decisions need to be made about data by the business owning that data — we need to write down:

  • Who made those decisions
  • When they made it
  • What the logic was for that data decision

Data is a big deal, there are laws and potentially criminal punishments in extreme cases around data — protect yourself and protect your client by recording key decisions.

Examples include:

  • Data classification decisions and how to protect sensitive/personal data
  • How up to date does data for analysis need to be?
    - An hour old?
    - Is yesterday's data good enough?
  • Do we need to obfuscate or mask some data? Which data?
  • Are we adopting a standardised schema? What schema? Where is it documented?
  • Which security mechanisms will be used to protect data?

Conclusion

To be honest this is really just a personal record to remind myself of project fundamentals and to stay out of trouble!

As I learn more I’ll try to record the inevitable mistakes & lessons as well as the amazing technical triumphs… :-)

About the Author:
Mike Knee is an Azure Data Developer here at Version 1.

--

--

Mike K
Version 1

I’m a computer nerd moving into the autumn of my career & keen to share the learnings, mistakes & triumphs of over 25 years in the technology industry.