EXPEDIA GROUP TECHNOLOGY — SOFTWARE
The Virtual Internship: My Experience
Team building by video and coding with Scala and Spark
This summer, I interned at one of the largest travel companies in the world, Expedia Group™️. Who knew during a pandemic, when travel is at an all-time low, Expedia had a great intern program planned for us!
Luckily, my previous internship was remote too, so I was kind of used to work from home. In this internship, I mostly worked on cloud and data engineering.
The induction week
Expedia organized an induction week for us, and it involved workshops on various topics: how to become a better leader, what a designer does, and we even got to know how Expedia operates. We kept on having workshops throughout our internships, and there was a lot to learn from each session.
My setup
My Team
I was a part of the Stay Experience team, which looks after the post-booking experience for a traveller. The team was divided into pods, and I worked on the .Net Stack Modernisation pod. My work involved making a microservice and figuring out how to deploy it to the cloud.
I worked with a manager and a buddy. We synced up every week and discussed work updates every day. It motivated me to stay on track. I also synced up with our project sponsor a few times. I enjoyed such one on one sessions. They helped me connect with the team better.
All the Vrbo interns together had weekly sync with the Vrbo managers. We socialized, played some games, got help on some issues. It was nice to have someone for support throughout the program, and these calls also helped me get to know the other interns better.
The project: Create a Spark job to sync data from MongoDB to S3
Motivation
We have a database which stores the notifications for guests of guests in a homestay. Now we wanted a way in which we can store the data in S3 and keep it updated.
Why are you storing the data in two different places?
This data is not only used to send a notification but also by an analyst or data scientist. The MongoDB database also backs the guests of guests API, and if everyone does their operations on MongoDB, it might severely affect the API.
We finally came up with the idea to use a Spark job. This job would take the data from MongoDB and store it in an S3 bucket.
Why a Spark job?
Spark and Scala are the industry standard for data management. The jobs will be able to manage huge amounts of data without any hiccups.
A quick revision
This blog is going to get more technical now, so here’s a quick reference if you need a reminder.
What’s MongoDB? It’s a NoSQL database.
What’s Scala? It’s a programming language widely used for data management. I did all my coding in the Scala language.
What’s Apache Spark? It’s a fast and general-purpose cluster computing system for large scale data processing. Some important terms to know about from Spark are:
- Dataframe: This is data in the form of a table, just like a relational database table.
- Dataset: Extension of Dataframes. They provide the functionality of being type-safe and support an object-oriented programming interface.
- SparkSession vs SparkContext: SparkSession is a unified entry point of a Spark application, starting with Spark 2.0. It provides a way to interact with various Spark functionality, and encapsulates Spark context, hive context, and SQL context all together.
- Prior to Spark 2.0, SparkContext was the entry point of any Spark application and used to access all Spark features. More recently, it is encapsulated in a SparkSession.
What’s a Spark job? In a Spark application, when you invoke an action on the data set, a job is created. It’s the main function that has to be done and submitted to Spark.
What’s a vault? Hashicorp Vault is a tool for secrets management, encryption as a service, and privileged access management.
What’s an S3 bucket? Just a simple distributed file storage system, think of it as the hard disk on your computer
Approach
The Spark job does the following:
- Authenticates against the vault and gets the secrets
- Reads the updated data from MongoDB
- Reads the data stored in the S3 bucket (it’s in Apache Parquet format)
- Does a left anti join on the S3 and Mongo data, yielding the data that did not change
- Merges the data that didn’t change with the new data
- Writes the data in Parquet format to the S3 bucket
The authentication process is more complicated than just sending a saved token:
- Grab the EC2 metadata and nonce
- Send the metadata and nonce to Vault
- Vault checks if the EC2 instance is allowed and the nonce is correct
- Send a nonce to the Vault server and get a token
- Get secrets like AWS access key, AWS secret key, or MongoDB password via the token
Here’s a simplified example explaining the method:
Mongo data:
id author book updateTime
1 A1 B1 T1
2 A2 B2 T2
3 A3 B3 T3
4 A4 B4 T4S3 data:
id author book updateTime
1 A1 B1 T1
2 A2 B2 T2
3 A3 B3 T2.1Step 1:
Get all docs from Mongo between updateTime1 and updateTime2 where T2 < updateTime1 < updateTime2
Fetched records:
3 A3 B3 T3
4 A4 B4 T4Step 2:
Read all records from S3.
Fetched records:
1 A1 B1 T1
2 A2 B2 T2
3 A3 B3 T2.1Step 3:
Left anti join on id between S3 and Mongo data so that we have the data that did not change
1 A1 B1 T1
2 A2 B2 T2Step 4:
Merge results of Step 3 with data fetched in Step 1.
Resulting data set:
1 A1 B1 T1
2 A2 B2 T2
3 A3 B3 T3
4 A4 B4 T4
Save this back to S3. You can see how this data is in sync with Mongo data.
Challenges (and how I tackled them)
VPN Issues: I needed to use a VPN to remotely access project documentation and tools, but the VPN I was using did not support Linux or WSL, so I had to work in Windows, which I’m not as familiar with. It was not fun to learn Windows versions of the one-line commands I know in Linux.
However, I found a hack for this. I created an EC2 instance on the Expedia network and shifted all my work to that, using Windows just for the VPN and tools like a terminal window. Because the instance is on the corporate network, I could access everything I needed from there, and as a bonus the scripts ran much faster on the instance than on my computer.
RAM Issues: I started off using IntelliJ since it was easy to set up a Scala project with it. But on my machine Intellij was terribly slow and took a lot of RAM.
- I later switched to VSCode plus Metals. Although it was a little more time consuming to set up, once understood, all RAM issues were solved. Also, using bloop instead of sbt decreased compile time.
- Having a local MongoDB server for testing was also quite RAM hungry, MongoDB Atlas helped me a lot here. I deployed a free cluster in the cloud, and my work was done :)
Spark and MongoDB
- Spark has some underlying concepts which were a little nontrivial to understand, at least for a first-timer like me. Above that, the documentation of the MongoDB-Spark connector was following old Spark conventions in some places and new conventions in other places. The documentation confused me a bit.
Understanding Vault EC2 authentication
- Learning how to do Authentication took most of the time. Being a college student, I hadn’t had to worry about securing my applications in projects, since we really didn’t put something into production.
- I learned about Hashicorp Vault (for storing secrets), Terraform (for building instances from code) and Consul (a backend for Vault).
- Rather than having a token to authenticate against Vault and get the secrets, a better method would be to use the EC2 metadata in which the Spark job will be running to authenticate against Vault.
- Making EC2 instances with Terraform and Consul.
Setting up EC2 instances
- Setting up an instance is easy, but determining the right VPC, subnet, security groups, etc. for production really is a security design issue, and one has to decide carefully.
- The majority of AWS services that get hacked, are hacked because of misconfigurations, and they aren’t really Amazon’s fault.
Thanks
Big thanks to Himanshu Verma and Arushi Dhunna for guiding me and answering my endless doubts and queries.
I would also want to thank Rohini Gupta, Anu Chopra, and Bhupinder Guleria for the weekly calls and support throughout the internship.
Thanks to the Early Careers Team for organizing the amazing workshops from Expedians all around the world: Jackie Kam, Rachel Tan, Chetna Mahobia. Thanks to Jim McCoy for all the help throughout the workshops.
The folks on internal Slack channels #scala, #vault and #just-amazon-things also helped me with my doubts, and are very much appreciated.
Originally published at https://rohanrajpal.github.io on June 17, 2020.