Good Data Engineer, Bad Data Engineer

A few weeks ago, I went to the New York office of Insight Data Science to chat with the new batch of data engineering fellows. I was impressed by their background and curiosity, and I can’t wait to see what they will do after the program.

One of the topics was about what it means (for me) to be a good data engineer. In a total inauthentic format, I made this handout for everyone.


A good data engineer knows about ALL the strategies to store data.

A bad data engineer only knows hdfs (or redshift, or whatever) and sticks everything in there.


A good data engineer understands when and how to tune performance.

A bad data engineer thinks the whole job is getting data into a warehouse.


A good data engineer is dangerous enough at analytics and data science, so that she understands pain points of analysts and data scientists who depend on her.

A bad data engineer doesn’t care to learn a lick of scikit-learn or R


A good data engineer knows enough about linux CLI and can quickly diagnose a thread dump.

A bad data engineer pings dev ops every time a process dies.


A good data engineer knows the JVM and python ecosystem well and continues to learn more.

A bad data engineer thinks the Spark tutorial is all she needs to finish a project.


A good data engineer understands “small data” is just as important as “big data”.

A bad data engineer judges the interestingness of a project only by GB and TB.


A good data engineer understands the complexity of maintaining software

A bad data engineer thinks code complete means job well done


A good data engineer thinks seriously about security, and knows when to get help from experts.

A bad data engineer thinks IT will take care of security.


A good data engineer starts by thinking about failure cases and corner cases.

A bad data engineer learns about failure cases in the middle of the night through pagerduty.


A good data engineer knows about the progression of data needs for the company.

A bad data engineer thinks about data problems purely from a technical stand point.


A good data engineer knows when to use a SAAS tool and when to roll your own solution.

A bad data engineer is hardcore about running everything in-house.


A good data engineer knows a prototype is the best way to get buy-ins from decision makers.

A bad data engineer complains about “management” whenever their ideas get shut down in a verbal meeting.


When a tool isn’t available, a good data engineer thinks seriously about creating and maintaining a tool (and open sources it if she decides to make it)

When a tool isn’t available, a bad data engineer deems the project impossible.


A good data engineer welcomes other data engineers to add to his list :)