Data Engineering, the unsung hero

Published in

Pipedrive R&D Blog

10 min readMay 3, 2022

How did product P do in the market M last month compared to the same month a year before? Who are your top customers and how do they use your product? What are the most significant drop-off points in your customer onboarding? If someone in your organization had a similar analytical question that could be answered with the help of data, could you answer it?

Can you reach out for a dashboard that covers the respective theme or check data model documentation and run the query on the central data warehouse?
Or, perhaps, you need to start a whole new journey — check different tools and databases and glue the pieces of information together with custom logic while validating each and every step as you move ahead.

If you’ve answered “yes” to the former, you probably have a team of diligent data engineers that got your back. If you haven’t even heard of them, it probably means they’re doing a great job.

Here’s a sneak peek into how we do it at Pipedrive.

Data is hard to get right

Though it’s not rocket science, it’s easy to get it wrong — both technically and organizationally. I know it firsthand as I have done it enough times to understand where I was wrong in the past. Not to toot my own horn but by building on that experience, we managed to get it right at Pipedrive.

I will use data engineering in a broad sense of the term — covering all technical work required to build a data platform. In this context, data platform is a set of tools needed to acquire, process and present data.

Every company is unique. There is no ready-made blueprint or single best practice that works fantastically for everyone. The approaches described below may not be fit for your company. Still, our approach is based on more than half of the century of combined accumulated experience in building scalable data platforms in fast-growing companies.

The Data Team should be under Data

Business units tend to monopolize resources under their command
Shared resources need visibility and clear prioritization
Falling too much behind breaks everything
Chain is only as strong as its weakest link

An organizational structure matters. It shapes the collaboration with other teams and tends to dictate the priorities. There aren’t many companies where the data function is reporting directly to the CEO through chief data officer, for example. Perhaps it’s because data orgnization could fit into many departments, like finance, marketing, engineering or product. However, by fitting data organization into a business unit, it’d create a conflict whereby the business unit that “owns” the team would often monopolize it. If, for example, data were part of the finance team, they would prioritize their requests over other teams’.

At Pipedrive, data is part of engineering, and we have been granted independence to run our own prioritization. The name of the department does not really matter. What matters is our ability to work with our internal customers across the company without interference from a chain of command. This brings us to the next crucial topic : prioritization.

As a free resource for internal customers, the demand will always outstrip supply. As a result, some requests will never make it to the top of the backlog, resulting in unhappy requesters. While we could grow the data team, we could never meet the demand for free service. We could implement cross charging, but that would take us deep into bureaucracy with little room left for agility. So instead, we chose to implement a very open and transparent prioritization process.

The data team is responsible for spending our budget in the company’s best interest. Since we don’t know whose request is objectively more pressing, we address the potential conflict in advance by inviting people with conflicting requests into the same (virtual) room and asking them to agree and order our backlog. While it may not scale for larger companies, it has worked surprisingly well for us. This mediation session helps us to focus on the most impactful work while allowing everyone to understand where we spend our time.

The team should work on the highest priorities, thereby focusing on the sweetest part of ROI. But what about the requests that are never picked up or take longer than expected? How do we know when to reject additional requests and when to grow our team to meet the demand, instead? The answer is rather art than science – a successful data product manager should establish a good gut feeling about the company-level business progress by relying on the same global prioritization process. Such instincts are crucial since failing to deliver critically important data features on time will leave business departments hanging and force them to come up with their own often-lacking solutions.

Chain is only as strong as its weakest link. If the data team wants to be useful, it needs to take responsibility for the entire value chain: acquire source data effectively, process the data in a cost-effective manner, build usable models and document them, ensure great performance, train end-users and listen for feedback and improve. If one piece is missing from the chain, the result will be lacking.

The technology matters

A scalable, cost-effective data platform is not a commodity
Storage and compute separation enables cost-effective growth
On-demand compute solves a significant problem

A word of warning: Our technical decisions go back to 2015. If we were to start today, we’d probably make different decisions. We started with AWS Redshift, which helped us get off the ground quickly. However, it became apparent that large amounts of rarely used data and sophisticated transforms are not Redshift’s strong suits. Spark was all the rage, so we decided to go with Spark. AWS EMR was a joke back then, so we rolled our cluster management. After a quick detour to the large HDFS volumes, we ended up with the setup that has been serving us well for years.

Let’s look at the principles behind our setup. For actual implementation, there may be better tools available today as we made our decisions at the time when Snowflake was in its infancy and AWS Spectrum didn’t exist.

At Pipedrive, we use storage and compute separation and S3 as our storage layer. It allows storing a large amount of data in a cost effective way, gives us the flexibility to choose the data processing tools and opens data for many tools.

What does it mean? We can store hundreds of terabytes of data without breaking the bank. We can use PySpark to write our transforms while running queries in Spark SQL, AWS Spectrum or AWS Athena. We can also access data from Sagemaker or any other tool that can read parquet files. This separation also enables us to scale compute independently, and we rely heavily on on-demand compute.

On-demand compute is the key to keeping your CFO happy. The ability to get as much as CPU cores as you need exactly when you need them provides massive scalability with tightly controlled cost. We need thousands of cores for the few hours when we run our overnight data transforms and can fall back to a fraction of it for the rest of the day.

It gets even better with on-demand query engines like AWS Spectrum or AWS Athena. While there may be a slight delay at the beginning of the query, on-demand query engines enable handling very different workloads without overpaying or long queues. So, in addition to CFO, we can also keep our users happy. How cool is that?

Doing right things

It’s straightforward to do wrong things right
Loading and transforming data is engineering, not tools usage
Knowledge about data belongs to data owners
Data Mesh is not going to help you, yet

While it’s easy to start with a point and click ETL tool over a ODBC/JDBC connection to any available database to resolve some of the reporting needs, it doesn’t work in the long run. It’s also quick to hack together a giant OLAP cube with some simple end-user tools and hope you can get all the answers from there. We have seen such attempts and they just don’t work.

Keep your near future scale in mind when you lay down the architecture vision and plan to resolve complex and challenging scenarios firsthand. Always question the appeal of quick solutions that only resolve a few simple use cases without addressing the more complex ones.

Building data pipelines and loading and transforming the data from sources towards the destinations is a combined challenge of business knowledge (i.e., understanding the data semantics) and engineering skills. Today, there are no tools that can manage the entire process in an easy, scalable and maintainable way without having strong engineering capabilities at hand. Loading and transforming data requires coding skills: Each deliverable is a piece of program code to be carefully written, tested, validated, reviewed and version controlled — like any other sustainable software development process. Therefore, the development team must apply engineering practices, as well as share the same disciplines, mindset and culture. Automations, CI/CD, pipeline tools and monitoring — it’s all engineering. Nevertheless, business domain knowledge is just as crucial because the technical data transformations are all about the internal logic of data and must, therefore, take the purpose of the analytics that will be built on top of the data models into account.

The crucial challenge here is to reach clear agreements between stakeholders owning the data and data engineering teams building the data pipelines. The ownership of the business logic should remain on the business side, even though the code that implements the logic is owned by the engineering team. Without such agreements and visibility, the data team ends up “owning everything,” leaving the business stakeholders unaware of their own data domains, which, in our experience, won’t scale for larger companies.

According to the relatively recent Data Mesh concept, a single central data team developing the logic for all different business domains is a dead end that can’t scale. The solution would be to break the central data engineering competence into units located in business functions, with each smaller team owning the whole single business domain data challenge end-to-end. While this approach makes sense, there are two things to keep in mind. Firstly, and as mentioned earlier, there are no easy-to-use business tools for handling everything in the data pipeline. Secondly, the organization should be huge to justify the distributing and partially duplicating of the single central data engineering competence into a dozen smaller teams still capable of operating self-sufficiently.

From team topologies’ point of view, contemporary data engineering tends to fall under the complicated subsystem category (https://teamtopologies.com/key-concepts). The tools are just not mature enough to help almost anyone to build and maintain cost-effective high-performance data pipelines. We may get there in the next 5–10 years, but today data engineering requires skills and experiences that are neither widely available nor easily trained.

Value is in the eye of the beholder

You are only as good as your end-user experience
Self-service needs care and feeding
Analysts should analyze
A single source of truth requires common best practices

Scalable and cost-effective data platform is an accomplishment you should be proud of. Oddly enough, your end-user won’t be as impressed unless you manage to resolve the last mile of a data service: the analytics tools for users and self service capabilities. Though serving the long tail of analytical and data requests for the organization may be seen as out of scope of the data warehouse solution itself, if this last mile of the process fails , end-users won’t feel its value.

Building self-service analytics doesn’t happen by itself. Building five hundred Tableau reports based on quick one-off requests and keeping them available for all users will result in way too many reports that are of little use and build up over time. Shaping everyday analytics requests into more general workbooks and dashboards that can answer more than one question needs care and attention. Not to mention, it needs a clear acknowledgment that this investment is worth the additional upfront cost.

Ideally, the analyst role should be mainly about forming and shaping the business question and iterating it, backed by data. In practice, however, without a strong self service reporting layer the main focus of analysts will be on extracting the right data from the warehouse. This is a serious bottleneck to business scalability and, therefore, must be addressed as a part of the data platform concept.

The highly desirable single source of truth can only be achieved if there is a very clear common agreement for building and maintaining reporting outputs, tools used and handling ownership. Makes sense to think of reports and dashboards as engineering artifacts — modules of programmed code — to be carefully reviewed, versioned and maintained.

Conclusion

Many clickbait headlines are calling big data as a failed initiative. We dare to disagree. The big data ecosystem has given us power that was only available to large corporations with deep pockets just a decade ago. While we all love tools, they do not solve any problem on their own. Data engineering is a sophisticated discipline that requires careful mixing of technical and business skills. We need a long chain of coherent steps to get from the mountains of raw data to the value. Individual actions are simple enough, but the whole machine must work in symphony.

One more thing: The chain can not be pushed. Suppose top management does not understand the value of the data and does not support data engineering with adequate investment. If this happens, you are better off handing in your resignation letter and starting to look for a new job. I know a company that pays data engineering its well-deserved respect (wink-wink)

This is the first article in a series about data engineering at Pipedrive. Stay tuned for more stories!

Interested in working in Pipedrive?

We’re currently hiring for several different positions in several different countries/cities.

Take a look and see if something suits you

Positions include:

Junior Data Engineer
Junior Data Platform Developer
Software Engineer in DevOps Tooling
Backend, Full-Stack, iOS Engineers
Infrastructure Engineer
And several more…