There are so many data analytics platforms that choosing the right one can be a long process, but how can you start actually doing it?
The first thing to do while choosing your data analytics platform is to choose the criteria of choice to be able to compare them.
This is a very important step since the criteria of choice will determine the final choice. Here is what you definitely need to take into account while building your own criteria of choice:
As a Data Engineer, I had the opportunity to experience one Data Engineers/Data Scientist cooperation process and quickly saw the downfalls of it. I, therefore, became very interested in how we can improve this collaboration and started documenting myself on other processes that can improve the teamwork. Consequently, this article presents multiple organizations between the Data Engineer and the Data Scientist, each with its benefits and its downfalls.
When the Data Scientists and the Data Engineer are not using the same coding language and the Data Engineer is the one responsible for production code, everything has to be re-coded.
Of course, after re-coding the data science project, the Data Engineer would have some additionals tasks like scheduling, building the deployment part and assuring himself that the monitoring features are created. …
At university, we usually have some “projects” that will last for a limited amount of time, therefore, we don’t have time to learn about how to put our university projects into production and how to monitor them or what is a hotfix and how to do one.
We learn all these on the field, in internships or during our first job. To speed up the process of learning, an experienced developer could teach the junior one best practices that are helpful in production purposes pipelines.
We learn about testing, different kind of tests and code design principles at university. However, somehow, in the beginning, it is hard to find small testable units of code and we tend to test on bigger datasets than is needed or bigger portions of code than is needed. …
Data Engineers are often presented as the “technology-focused/ technology-experts” role between all data roles. However, if this is true, what technology should a Data Engineer focus on? and how does this role survive the evolution of technology?
Well, I would argue that it’s not so simple, that this role is not just “technology-focused”, that there are some core aspects of this role that remained the same over time even though the technologies didn’t remain.
Data Engineering can involve different technologies depending on the company/team you work and the spectrum can be quite large: Spark, Apache Beam, Step functions, Airflow, Google Dataprep, Kafka, Hive, Python, Java, Scala, Oozie, HBase, Cassandra, Spring…
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
Reason of error:
This error happens when your rdd/dataframe has partitions larger than 2GB.
This is due to the fact that in spark can’t have shuffle blocks larger than 2GB because Spark stores shuffle blocks as ByteBuffer that are limited by Integer.MAX_SIZE (2GB)
Solutions:
If you see the above error in spark 1.6 you can try the followings:
spark.sql.shuffle.partitions
or spark.default.parallelism
df.repartition(numberOfPartitions)
…
I wanted an ebike because I live at about 14km away from work which means 15-20 min by car and 60 min by bus. Since the last part has a 172m elevation gain, I never did the whole ride by bike until I had the ebike.
This year, for my birthday, I received an eBike DIY Kit. Therefore, together with my husband, we started converting my bike into an eBike.
In this article we wanted to show you the steps that we took or give you an idea of what you will need to do if you want to convert your bike into an eBike. …
I’ve been a data engineer for almost 5 years. Nonetheless, I wanted to write this article because we see a lot of articles about data trends or data science trends but not so many focused on my role: the data engineer.
For me,
the Data Engineer is a software engineer specialized in data:
data modeling, data plumbing, data transforming, data integration, data storage, data platform maintenance …
However, being a software engineer implies following up the software development trendsand being specialized in data shows the need of keeping up with data architecture paradigms and landscape. …
About