Spark 3.0 is out, and there are ton of improvements! But there are a nice improvement that is not yet highlighted in the announcement post: Push down filter for CSV file.
Prior to Spark 3.0, when you load a CSV file, the CSV file is read to memory then apply filter, which is a waste of CPU cycle and bandwidth. Now, the data can be filtered as the files are read. This is similar to push down filter in Parquet but now for CSV files.
Here is a quick example: I load a CSV file (flights dataset from Kaggle), then filter by ORIGIN_AIRPORT, then print out the execution plan. …
Docker allows you to set the limit of memory, CPU, and also recently GPU on your container. It is not as simple as it sounds.
Let’s go ahead and try this
docker run --rm --memory 50mb busybox free -m
The above command creates a container with 50mb of memory and runs free to report the available memory. If you run this on Mac, you should see a similar output like the below screenshot.
Why doesn’t it show 50mb as in the memory parameter? Why it shows 2gb of memory, and where is the 2gb come from?
This is the first catch of container memory limitation. The
--memory parameter limits the container memory usage, and Docker will kill the container if the container tries to use more than the limited memory. But inside the container, you still see the whole system available memory.
free reports the available memory, not the allowed memory. It is the same for os.totalmem (nodejs) or psutil.virtual_memory (python). …
This is my collection of notes and opinions on Software Architecture. This helps to guide me through software architecture and design. I publish this to hope this will be helpful for others, and also to receive feedback as well 🙂
Architecture is about identifying the necessary components to support the business requirements, their characteristic, role, and how they interact with each other.
Software design is the realization of the architecture. There may be multiple designs that support the architecture. One can consider the architecture is the most abstract design of the system.
Architecture is about things that are not likely to change throughout the lifecycle of the system. It’s like when you build a house, the architecture tells you how many stories, where are the doors, where are the rooms. These elements are fixed, at least for a very very long time. The furniture may be changed, the paint may be changed, the people in the room may be changed, but it’s not likely that you will change the location of the door. …
I have been using Go for a while, but mainly for tools. So I decided to invest some time to learn more about the language, and also more about system programming, distributed programming.
The chat server was just a random idea. It is simple and also complicated enough for a sandbox project. I would try to do everything from scratch
This post is more like a summary of my experience during the exercise. If you wan to look at the source code under this github repository.
So let’s start!
I will start with very basic features:
So I had a challenge on the other day to restore an EC2 instance from EBS Snapshots. I have worked with AMI and EBS for many years, but i have never tried this before.
I was given some information about the environment such as the host OS and the listening port of SSH, then that was it, I would have to figured out the rest. In fact, it was two EBS snapshots: one for the root and one for data. And I would need to bring up the instance and data up in the middle of the night…
The first thing I tried is to create an image from the snapshot so that I can launch the instance from the image. However, I stumped into an very weird issue where it said the snapshot does not belong to my account. It seemed that you would also need to grant the CreateVolume permission to the external account as well. The only way I was able to make this work is to copy the snapshot to my account, and then created the image from my snapshot. Luckily I got the right Virtualization type (HVM is a good choice, PV is only for very old instance) for the image and other default parameters. View the instance system log is a good way to confirm whether the instance is booting normally. …
AWS just announced a new service AWS Secret Manager in SF Dev Summit (I was there at the announcement 😇), which is a cool service to help you to manage and rotate your secrets securely.
But actually, this is not something new. There is also a less-well-known service AWS Simple System Manager (SSM) that provides a similar feature to Secret Manager. Today I would like to write a post to show you about this service and how you can use it in Python easily.
AWS Secret Manager helps you to store, distribute, and rotate credentials securely. You can use it to store credentials for RDS, databases or any type of secrets (token, api secret, etc). It also provides secret rotation to allow you to change the secret and also an audit trail when then secret is rotated. …
Timezone is a hard problem. DST is even a harder problem. I found myself walking into problems and problems when I started using datetime in Python properly. So I decide to write a blog to share my experience.
The first thing to know is that in Python there are two types of datetime: offset-naive and offset-aware. Offset naive means that the datetime has no timezone information. It could be very error prone if you are new to Python. If you mix a naive datetime and aware datetime, you will get an error. …
AWS has a very flexible permission system using IAM policies. But sometime it is also complicated to get the access control right. Yesterday, I worked on a IAM policy to allow an instance start / stop another instance.
It should be straightforward, and this is my initial policy.
It worked well. But I quickly found out that it didn’t worked for all instances. It didn’t provide any information, the instance quickly entered pending state then stopped immediately with message
Server.InternalError: Internal error on launch. …
Copying AWS AMI across accounts require a lot of manual steps. If you do from UI, you will need to share the images, then share the relevant snapshots, then go to the new account console to copy the AMI to your account.
If you have to do it for 100 AMIs, then surely doing manually will not scale. So i write a small script to do the job for me.
Below is my script that allows you to mass copy AMIs from one account to another account based on pattern. You will need to edit
copy-snapshot.sh to fill in the necessary values: your source account, your target account, and also AWS profile for the source account and the target account as well.
One of the feature that I am dealing with is GPU resource management. The requirement is that we have multiple jobs are scheduled, each job is a python script and may require a number of GPUs to work, the scheduler needs to distribute the GPUs evenly for these scripts. Kubernetes GPU scheduling would eventually be the solution, but right now the platform that i am working on not ready to adopt this yet. So I come up with a very simple solution that is usable for short term.
The idea is using CUDA_VISIBLE_DEVICES to control GPU allocation. When a script starts, it will query for available GPUs and then set the environment variable to acquire the GPUs. If there is not enough GPUs, the script will just fail and the scheduler will schedule the script again later, hopefully that it will be next time. …