Source: DataBricks

Spark 3.0 is out, and there are ton of improvements! But there are a nice improvement that is not yet highlighted in the announcement post: Push down filter for CSV file.

Prior to Spark 3.0, when you load a CSV file, the CSV file is read to memory then apply filter, which is a waste of CPU cycle and bandwidth. Now, the data can be filtered as the files are read. This is similar to push down filter in Parquet but now for CSV files.

Here is a quick example: I load a CSV file (flights dataset from Kaggle), then…

Docker allows you to set the limit of memory, CPU, and also recently GPU on your container. It is not as simple as it sounds.

Let’s go ahead and try this

docker run --rm --memory 50mb busybox free -m

The above command creates a container with 50mb of memory and runs free to report the available memory. If you run this on Mac, you should see a similar output like the below screenshot.

Why doesn’t it show 50mb as in the memory parameter? Why it shows 2gb of memory, and where is the 2gb come from?

This is the first…

This is my collection of notes and opinions on Software Architecture. This helps to guide me through software architecture and design. I publish this to hope this will be helpful for others, and also to receive feedback as well 🙂

Architecture is about identifying the necessary components to support the business requirements, their characteristic, role, and how they interact with each other.

Software design is the realization of the architecture. There may be multiple designs that support the architecture. One can consider the architecture is the most abstract design of the system.

Architecture is about things that are not likely…

I have been using Go for a while, but mainly for tools. So I decided to invest some time to learn more about the language, and also more about system programming, distributed programming.

The chat server was just a random idea. It is simple and also complicated enough for a sandbox project. I would try to do everything from scratch

This post is more like a summary of my experience during the exercise. If you wan to look at the source code under this github repository.

So let’s start!

The requirements

I will start with very basic features:

  • There is a single…

So I had a challenge on the other day to restore an EC2 instance from EBS Snapshots. I have worked with AMI and EBS for many years, but i have never tried this before.

I was given some information about the environment such as the host OS and the listening port of SSH, then that was it, I would have to figured out the rest. In fact, it was two EBS snapshots: one for the root and one for data. And I would need to bring up the instance and data up in the middle of the night…

The first…

AWS just announced a new service AWS Secret Manager in SF Dev Summit (I was there at the announcement 😇), which is a cool service to help you to manage and rotate your secrets securely.

But actually, this is not something new. There is also a less-well-known service AWS Simple System Manager (SSM) that provides a similar feature to Secret Manager. Today I would like to write a post to show you about this service and how you can use it in Python easily.

AWS SecretManager and AWS SSM Parameter Store

AWS Secret Manager helps you to store, distribute, and rotate credentials securely. You can use it…

Timezone is a hard problem. DST is even a harder problem. I found myself walking into problems and problems when I started using datetime in Python properly. So I decide to write a blog to share my experience.

“Naive” and “Aware”

The first thing to know is that in Python there are two types of datetime: offset-naive and offset-aware. Offset naive means that the datetime has no timezone information. It could be very error prone if you are new to Python. If you mix a naive datetime and aware datetime, you will get an error. …

AWS has a very flexible permission system using IAM policies. But sometime it is also complicated to get the access control right. Yesterday, I worked on a IAM policy to allow an instance start / stop another instance.

It should be straightforward, and this is my initial policy.

"Effect": "Allow",
"Action": [
"Resource": "*"

It worked well. But I quickly found out that it didn’t worked for all instances. It didn’t provide any information, the instance quickly entered pending state then stopped immediately with message Server.InternalError: Internal error on launch. …

Copying AWS AMI across accounts require a lot of manual steps. If you do from UI, you will need to share the images, then share the relevant snapshots, then go to the new account console to copy the AMI to your account.

If you have to do it for 100 AMIs, then surely doing manually will not scale. So i write a small script to do the job for me.

Below is my script that allows you to mass copy AMIs from one account to another account based on pattern. You will need to edit to fill in the…

One of the feature that I am dealing with is GPU resource management. The requirement is that we have multiple jobs are scheduled, each job is a python script and may require a number of GPUs to work, the scheduler needs to distribute the GPUs evenly for these scripts. Kubernetes GPU scheduling would eventually be the solution, but right now the platform that i am working on not ready to adopt this yet. So I come up with a very simple solution that is usable for short term.

The idea is using CUDA_VISIBLE_DEVICES to control GPU allocation. When a script…

Bao Nguyen

I write, so I learn.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store