Are You Combining Software Engineering and Data Engineering for Maximum Efficiency?

Nadeem Khan(NK)
LearnWithNK
Published in
6 min readSep 7, 2024

--

I’m a software engineer who loves coding, exploring new technologies, and really getting into the details to maximize results. I’m always looking for ways to optimize code — whether it’s mine or someone else’s — while improving processes and cutting costs. One of my favourite things to do is create reusable frameworks that can easily be adapted for future projects. When it comes to tools and technologies, I believe in picking the ones that not only simplify tasks but also save time and money.

As software engineers, we’re naturally adaptable. We can dive into areas like data engineering, machine learning, AI, and more. In fact, you can’t really separate software engineering from data engineering — data engineering is just a specialized branch of software engineering. Without solid software engineering principles, projects can quickly spiral out of control as they grow. I’m sure we’ve all seen that happen at some point in our careers!

A couple of years ago, I had the opportunity to be part of a major shift, moving away from traditional data warehousing to a more modern, cloud-native approach. That experience showed me firsthand how software engineering plays a key role in making data products scalable, maintainable, secure, and cost-efficient. In this blog, I want to share how I applied my software engineering skills to a data engineering project and got the most out of it.

Use Python-wrapped SQL

With Python-wrapped SQL (or whichever language you prefer), you really get the best of both worlds and here’s why that’s such a win. Using Python, you can take your transformation logic and turn it into reusable classes, which makes it super easy to extend or tweak later. You can also set up test cases, clean up redundant SQL code, and do a bunch of other cool stuff that’s hard to pull off with just SQL.

Why?

It makes everything way more flexible and maintainable. Instead of writing the same SQL over and over or dealing with huge, complex queries, you can handle all that logic in Python, which simplifies things big time. Plus, you get the added bonus of being able to integrate advanced features or customize your transformations however you want — stuff that would be way trickier to do with plain SQL.

Externalize the configuration

One thing I’ve learned over time is that it’s super important to externalize any properties that change with every deployment. I’m talking about things like server names, database names, authentication schemes, authentication parameters, feature flags, data source names, table prefixes or suffixes, and performance-tuning settings. All of these should live outside the code.

Why?

Because hardcoding this stuff just makes life harder down the line. By keeping these settings separate, you make your application a lot more flexible. It makes deploying to different environments — like dev, test, or production — way smoother. Plus, it’s a quick win for improving scalability and maintainability.

Parameterized the SQL query

Anything repetitive can be parameterized, right? That’s where Python-wrapped SQL comes in handy. You can treat SQL code as the output of a function or method, allowing you to pass parameters directly into your SQL queries. This means you can easily swap out values without having to rewrite the same SQL again and again

Why?

It saves you time, keeps your code clean, and makes everything way easier to maintain. Instead of hardcoding values, you can just pass in a parameter and adjust things on the fly. That flexibility is a game-changer, especially when you’re dealing with complex queries or switching between different environments.

Keep static data as external configuration which can change from client to client

This is the same idea as externalizing configuration, but here we’re externalizing things like lookup tables or metadata tables. Instead of hardcoding values or repeating SQL, you can pass parameters or reference external tables. It’s all about flexibility — making it easier to update or switch things out without touching your main code.

Why?

It keeps your code clean, reduces redundancy, and saves you from having to go in and manually update things. Plus, it’s a huge help when your data or environment changes, because you don’t have to dig through your codebase to make adjustments. Everything stays organized and easy to manage.

Focus on Decoupling

Decoupling is probably the most important principle I’ve learned in my software engineering career. I’ve seen both sides of it, so I know exactly what’s at stake if your code isn’t decoupled. Trust me, it can get messy fast.

You’ve got to decouple your base and custom code, and your technical and business layers need to be separated too. Even when you externalize configurations or data, that’s a form of decoupling. It’s all about making sure different parts of your system aren’t too tightly connected, so you can make changes without everything falling apart. The more decoupled your code, the easier it is to maintain, extend, and scale.

Why?

Decoupling makes your code easier to maintain, extend, and scale. When things are loosely connected, you can tweak or update one part without affecting everything else. It keeps your system flexible, saves time during updates, and prevents major headaches when scaling up or adding new features.

Negative Programming

We often assume that if a function is called, it’ll get the right input and return the expected output. Negative programming — thinking about what could go wrong — tends to take a back seat, especially if there aren’t any network calls involved. But the moment you introduce network calls, you can’t just assume everything will work, even if you’ve run all the comprehensive test cases. Things will break, even if your code is perfect.

In every data engineering project, there’s a database, and we’ve all seen those annoying intermittent issues. To handle this, you need retries. And if the code still fails after retries, you should have proper fallback mechanisms in place. Plus, any downstream code that depends on the failed function should be able to handle the situation gracefully with proper assertions.

Why?

In the data-driven world, we all need faster results. If a critical transformation logic fails, and you only find out after everything’s processed, it becomes a real issue. This is why a “fail fast” approach, using negative programming, is crucial — it allows you to detect failures early, take preventive actions, and fix issues before they snowball into bigger problems.

Think Optimally

As we enter the new data era, where every decision a company makes is based on how fast they can get insights from their data, efficiency is key. To achieve this, you’ve got two choices: write non-optimal code and throw more resources at the problem, or write optimal code and keep costs low. These days, we often don’t have full control over the underlying resources since we’re using managed services, which makes it even more important to write code that performs well across different systems.

Writing optimal code doesn’t just mean writing better SQL — it sometimes means thinking beyond SQL entirely to find the best way to generate and analyze data. You have to choose what works best for your specific situation, because, like they say, one size doesn’t fit all.

Why?

In today’s data-driven world, speed and cost are critical. Writing optimal code helps you get insights faster without unnecessarily driving up costs. Plus, since we can’t always control the resources, writing efficient code ensures your solution performs well no matter what system it’s running on. It’s about balancing speed, cost, and performance in the best way possible.

Summary

To sum it up, applying core software engineering principles like decoupling and externalizing configurations can really enhance the scalability and maintainability of your data engineering projects. Python-wrapped SQL gives you flexible, reusable code, while parameterization simplifies complex queries. When you optimize for code efficiency, you not only cut costs but also improve performance across different systems. And with negative programming, you’re making sure your system handles errors gracefully, boosting reliability. All in all, these practices help you scale your data engineering efforts with more efficiency and adaptability.

Have you applied any of these software engineering principles in your own data engineering projects? I’d love to hear about your experiences! Share your thoughts or tips in the comments below so we can learn from each other.

If you found this post helpful, don’t forget to leave a clap 👏 and share it with others. You can also follow me on Linkedin, Github and Medium for more insights on coding, optimization, and the latest in tech.

Thanks for reading. Happy Learning.

--

--

Nadeem Khan(NK)
LearnWithNK

Lead Technical Architect specializing in Data Lakehouse Solutions with Azure Synapse, Python, and Azure tools. Passionate about data optimization and mentoring.