Know the process for choosing “the best tool ever “… given the known facts and biases

There are so many data analytics platforms that choosing the right one can be a long process, but how can you start actually doing it?

Homer: Too many options, what do I choose ?
Homer: Too many options, what do I choose ?
Source: memegenerator.net

The first thing to do while choosing your data analytics platform is to choose the criteria of choice to be able to compare them.

This is a very important step since the criteria of choice will determine the final choice. Here is what you definitely need to take into account while building your own criteria of choice:


Frustration, tension, and bitterness are all results of absent common grounds for cooperation. Finding your best collaboration process that values each data role is fundamental.

As a Data Engineer, I had the opportunity to experience one Data Engineers/Data Scientist cooperation process and quickly saw the downfalls of it. I, therefore, became very interested in how we can improve this collaboration and started documenting myself on other processes that can improve the teamwork. Consequently, this article presents multiple organizations between the Data Engineer and the Data Scientist, each with its benefits and its downfalls.

The Data Engineer industrialize the Data Science project

When the Data Scientists and the Data Engineer are not using the same coding language and the Data Engineer is the one responsible for production code, everything has to be re-coded.

Data Scientist/Data Engineer organization: The Data Engineer industrialize the Data Science project
Data Scientist/Data Engineer organization: The Data Engineer industrialize the Data Science project
Data Scientist/Data Engineer organization: The Data Engineer industrialize the Data Science project

Of course, after re-coding the data science project, the Data Engineer would have some additionals tasks like scheduling, building the deployment part and assuring himself that the monitoring features are created. …


Red building on a school campus — pexels
Red building on a school campus — pexels
Source: Red Building Of A School Campus by Matthis (Pexels)

I’ve been a Data Engineer for just over six years and I’ve accompanied several junior engineers and every single time it reminded me of my early years. We all do the same mistakes because there are some core learnings that we don’t learn at university.

Production purpose pipelines

At university, we usually have some “projects” that will last for a limited amount of time, therefore, we don’t have time to learn about how to put our university projects into production and how to monitor them or what is a hotfix and how to do one.

We learn all these on the field, in internships or during our first job. To speed up the process of learning, an experienced developer could teach the junior one best practices that are helpful in production purposes pipelines.

Testing

We learn about testing, different kind of tests and code design principles at university. However, somehow, in the beginning, it is hard to find small testable units of code and we tend to test on bigger datasets than is needed or bigger portions of code than is needed. …


Why Data Engineers are more than just experts on one specific technology

Data Engineers are often presented as the “technology-focused/ technology-experts” role between all data roles. However, if this is true, what technology should a Data Engineer focus on? and how does this role survive the evolution of technology?

Well, I would argue that it’s not so simple, that this role is not just “technology-focused”, that there are some core aspects of this role that remained the same over time even though the technologies didn’t remain.

Different technologies

Data Engineering can involve different technologies depending on the company/team you work and the spectrum can be quite large: Spark, Apache Beam, Step functions, Airflow, Google Dataprep, Kafka, Hive, Python, Java, Scala, Oozie, HBase, Cassandra, Spring…


java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

Reason of error:

This error happens when your rdd/dataframe has partitions larger than 2GB.

This is due to the fact that in spark can’t have shuffle blocks larger than 2GB because Spark stores shuffle blocks as ByteBuffer that are limited by Integer.MAX_SIZE (2GB)

Solutions:

If you see the above error in spark 1.6 you can try the followings:

spark.sql.shuffle.partitionsor spark.default.parallelism

df.repartition(numberOfPartitions)


I wanted an ebike because I live at about 14km away from work which means 15-20 min by car and 60 min by bus. Since the last part has a 172m elevation gain, I never did the whole ride by bike until I had the ebike.

Image for post
Image for post
Elevation Gain

This year, for my birthday, I received an eBike DIY Kit. Therefore, together with my husband, we started converting my bike into an eBike.

In this article we wanted to show you the steps that we took or give you an idea of what you will need to do if you want to convert your bike into an eBike.


I’ve been a data engineer for almost 5 years. Nonetheless, I wanted to write this article because we see a lot of articles about data trends or data science trends but not so many focused on my role: the data engineer.

For me,

the Data Engineer is a software engineer specialized in data:

data modeling, data plumbing, data transforming, data integration, data storage, data platform maintenance …

However, being a software engineer implies following up the software development trendsand being specialized in data shows the need of keeping up with data architecture paradigms and landscape. …

About

Alina GHERMAN

Data Architect | Data Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store