How to technically distinguish among data projects?

The idea of technical superiority in data projects

Vimarsh Karbhari
Acing AI
3 min readJun 4, 2020

--

Technical excellence is certainly required at the engineering level for data projects. Each project needs to be an iterative work in progress, easy to improve, easy to change and update the model, dataset and the code. Additionally, all good software engineering principles of scalability, monitoring and performance apply to data projects as well.

Source: HARPA Concert Hall and Conference Center — Best Designed buildings in the world

Scalability and Performance

As data projects scale, the team should be able to leverage different building blocks or servers and scale their usage horizontally and or vertically. It is important to understand bottlenecks here. The concept of scalability and performance applies to the web severs, databases and data models alike.

Performance vs Scalability

If you have a performance problem, your system is slow for a single user.

If you have a scalability problem, your system is fast for a single user but slower under heavy load.

Practices used to ensure that production code is in good condition (such as code reviews, unit testing and integration testing) must also be considered for models to ensure scalability and performance.

Monitoring

Monitoring delivers the data or metrics used for planning and decision-making for data projects. Bugs in metrics code will lead to incorrect statistics being reported and incorrect decisions being made. Each and every step for the data project from the front end services to databases, caches to models should have monitoring components. They could be split into audit logs, service logs and metrics related logs. Each service should have agents to write the logs to the logging framework to provide separation of concerns and scalability.

Agile

The data science field is rapidly developing and agile software practices will enable technical excellence on data projects.

Architecture

At a larger scale, no architecture can possibly be optimal for more than a few years due to the rapid developments in infrastructure platforms as well as advances in algorithms. Therefore, it is important to be able to re-architect with reasonable frequency as well.

Models

Do the models really change that often? Sometimes data teams feel that since it takes too long to deploy a model, models may not change that often. However, the model changes are often in nature.

A simple example is matching model that Uber uses to match users and drivers. If Uber, say, made a mistake and published a model that was highly inaccurate and matching with a driver in a different city, the customer wouldn’t want to wait a couple weeks for updates that fix it. In such scenarios an hour of such an issue can cost millions and prove very expensive.

Other points

In addition to this, it is important that teams have a good level of data literacy so that experiments used to motivate priorities and investments are meaningful. Superiority in data projects covers a commitment to excelling in communicating and storytelling with data. Great data engineers and data teams are astute enough about their respective area to learn how to communicate well with data. Some of these practices include but not limited to ensuring that all graphs are properly labeled, correct representational graphs used for correct portrayal of data, understanding the difference between trends (behavior over time) versus individual dips and rises, understanding if a difference in a metric is meaningful in terms of the error of the metric itself. Additionally, they need to be plotting and presenting all this information with context to make it consumable to the business teams.

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

--

--