Image for post
Image for post

The Data Engineer skills is a hot topic, for everyone interested in becoming a one. The rise of data platforms wouldn’t be possible without the data engineer skills, that develop the infrastructure and tools. The need for specialists in this area is forecasted to only increase, therefore if your are considering on becoming a Data Engineer, those are my 5 essential skills you have to poses. Enjoy!

Programming — Scala and/or Python


Image for post
Image for post

Recently I have encountered a very cool site with cooking recipes, which had extremely poor UI, especially when using a mobile. There was no official API, and so I have decided to build a web service that would web scrape the content out of it, and publish it using RESTful API. In this post I will show you how to use Scala, Play Framework and Jsoup to build such service. Enjoy!

You will need

  • Scala — I use the version 2.13.1
  • Sbt — install from official website
  • IDE of choice — IntelliJ or VS Code, I use latter
  • Around 15 minutes of your…


Image for post
Image for post

The Scala language, being a mix of object oriented and functional programming paradigms, has a quite unique collection framework, compared to some other languages. Although inheriting a lot from Java and JVM world, the Scala has a much easier and to use API and is overall more polished.

In this post, I will show you a range of Scala collections, how to use them with code samples, and where to use them. Enjoy!

Scala collection package(s)

First, let's have a look at the packages that contain all the Scala collections. The top-level package is called as you might expect - collection. What is more interesting, are the sub-packages included in it. …


Image for post
Image for post

Recently, I have finally had a chance to attempt an exam for Google Cloud Certified Associate Cloud Engineer certification and I have passed in first attempt! This achievement has been possible, thanks to thorough preparation, during which I have used a selection of resources. In this post I will be sharing with you 5 of those that I think have contributed the most to the positive outcome of the GCP ACE exam. Enjoy!

1. Official GCP ACE Study Guide


Image for post
Image for post

One of the greatest features of Apache Spark is its ability to infer the schema on the fly. Reading the data and generating a schema as you go although being easy to use, makes the data reading itself slower. However, there is a trick to generate the schema once, and then just load it from disk. Let’ dive in!

For the sake of this article, let’s assume that we are using data with complex data structure, where creating a case class or struct type of any kind would be a hundred lines long.

Saving the Schema

Let’s begin with reading some sample data and enforcing a schema inference on it. Using Apache Spark and Scala language, this can be done like the…


Image for post
Image for post

Working with types in Scala can be challenging in many ways. The deeper you dig into the complexity of this subject, the more difficult is to find the right path. In this short post, I will be explaining some of the most fundamental of Scala type system — the type bounds. Enjoy!

What is type bound?

A type bound is nothing more than a mechanism to add restrictions to the type parameters. When passing a type to let’s say a class, we can limit to what types we want to allow. …


Image for post
Image for post

When I have decided to start writing my own programming blog, I wasn’t sure what is the best platform to start on. I was convinced that despite its flaws, WordPress will be my CMS of choice. But that wasn’t the only choice to make — a hosting platform had to be chosen as well. In this post, I will explain why I think AWS Lightsail is the optimal service to do this.

What is AWS Lightsail?

Lightsail is an easy-to-use cloud platform that offers you everything needed to build an application or website, plus a cost-effective, monthly plan.

AWS Website

The Amazon Web Services (AWS) ecosystem consists of several hosting options and Lightsail is one of them. As the quote above explains — it’s a managed hosting service that combines computing and storage with affordable pricing. This service itself can support a range of use cases, from test environment for hobby projects to fully-featured web applications. …


Image for post
Image for post

Writing functions and methods that process the input data and produce some output is the very core of any type of programming. When diving deeper into different types of functions in Scala, there is one type that differs from others — partial function. As the element of Scala’s standard library, it is appropriate to know what it is and where it should be used. Let’s dive in!

What is Partial Function?

A partial function of type PartialFunction[A, B] is a unary function where the domain does not necessarily include all values of type A.

Scala Standard Library

The definition above, taken straight from the Scala lang standard library docs may seem quite obfuscated. Fear not though, as the concept of partial function is very easy to understand. …


Image for post
Image for post

Scala is one of the most popular languages used in the JVM ecosystem. The recent post by Li Haoyi explains why Scala is entering the ‘slope of enlightenment’, with ever-growing community and plethora of mature and production-ready tools. It is a great time to start learning Scala and this post will list you, in my opinion, the best places to start your journey with it.

1. Scala Book


Image for post
Image for post

Unit testing Apache Spark Structured Streaming jobs using MemoryStream in a non-trivial task. Sadly enough, official Spark documentation still lacks a section on testing. In this post, therefore, I will show you how to start writing unit tests of Spark Structured Streaming.

What is MemoryStream?

A Source that produces value stored in memory as they are added by the user. This Source is intended for use in unit tests as it can only replay data when the object is still available.

Spark SQL Docs

MemoryStream is one of the streaming sources available in Apache Spark. This source allows us to add and store data in memory, which is very convenient for unit testing. The official docs emphasize this, along with a warning that data can be replayed only when the object is still available. …

About

Bartosz Gajda

Software Engineer, Computer Science student and Blogger. Cycling, cooking and enjoying craft beer in free time.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store