Creating a Work Visa Search Engine
The Tech Stack Behind Visatopia
Visatopia is a simple search engine to find millions of past work visa history in the US.
Visatopia was made out of frustration with companies not sharing if they could hire international candidates. Large companies usually can hire, but small-mid size companies it isn’t clear. I found a couple of websites that share this kind of data, but it would just list lines of a candidate’s salary and their application date. The hundreds of rows were just that, raw data, and it was missing any useful information.
Visatopia was made out of frustration of companies not sharing if they could hire international candidates.
Search for Data
Now with some ideas, the search was on. To create a useful website with meaningful information, I needed to find the same raw data as other websites. It turns out this was the easiest part. USCIS publishes their work visa application history on their website!
However, soon after I realized that the data set was not uniform at all, their naming convention differed depending on the year, salary will be reported in hourly, weekly, monthly, yearly. There needed to be some time spent just on data cleaning. This took me almost 3 months in total.
NoSQL Data Storage with DynamoDB
When I was envisioning the use cases of Visatopia, I realized most of the calls to the database would be read-intensive. There wouldn’t be any writing to the database unless I am updating the records myself.
Initially, I normalized the data into separate tables and reduced as much duplicate information as possible. This is great storing large datasets in a fraction of their original size. However, the problem was because the data was separated, any query for a company will require massive joins on multiple tables each time. That can cause massive delays in reading requests when there is a large number of clients trying to query data. One way to combat this is to have read-replicas of the master database. You can have as many read-replicas as you want and load balance, reducing stress to any particular database. I did not choose this option simply because the cost to maintain multiple instances is just too costly for my use case.
Then it became clear my goal was to reduce database joins, which lead me to the decision of choosing NoSQL. There are a couple of NoSQL database choices such as Firebase from Google, Atlas from MongoDB, but it was obvious to use DynamoDB from Amazon since I was on the AWS stack anyways.
Note. I knew about some of the limitations on DynamoDB, but some became a blocker. The main issue was the 400kb limit on a single record. Some companies would have thousands of data records that it would not fit as a single record. So I ended up breaking up one company record into four smaller chunks.
Search Auto Suggestions
Auto suggestion is a feature that is taken for granted these days. Google, Facebook, Amazon anywhere you go suggestion feature is there. However, the complexity of implementing is sometimes not worth the effort.
At first, I tried to utilize external services like Algolia, Search as a Service, to host the company names and Algolia handle all the complex data storage and recommendation. However, the free tier only allows up to 10k records, which is far less than the 200k company names I needed to store. The price on the paid version, $50~$100 per month was far too expensive.
Luckily, I found some simple implementation from AWS using Elasticache. Elasticache is not intended for search suggestions, but it worked pretty well for me. Elasticache was made to store short term data for caching data to reduce hitting the actual server multiple times.
I created an Elasticache node and stored all company names in it. Then I created a Lambda to query from the Elasticache node and return the results. Note if you are doing something similar, Elasticache cannot be made public by nature, therefore the Lambda or EC2 instance trying to connect must be in the same VPC and subnet groups as the Elasticache node.
AWS Chalice To Deploy in 20 Seconds
I am always talking about serverless. AWS Lambdas are a great example of that. The pricing on these is insane. The first 1 million calls are free, and the next 1 million will cost me $0.20. On top of that, a serverless framework will expedite your development time. Chalice is a framework by AWS built for Lambdas. Creating a function attached to an API Gateway will take you 20 seconds to code out and deploy.
Serve Static Files Over CloudFront CDN
The most simple way is to serve static files is to host them in an S3 bucket. However, public S3 objects are not secure and may cause issues if your webpage is served over HTTPS. CloudFront is a great solution to this. CloudFront is a Content Delivery Network(CDN). It caches your static files to edge locations all over the world. This allows global customers to load your static files much faster compared to having a single location serving your files. CloudFront also allows you to add a certificate on the endpoint which will serve your static files over HTTPS.
Note. Having files cached at edge locations come at a cost, if you make a small update to your CSS file, users won’t be able to see the changes since CloudFront returns an older version. You can invalidate a file and force CloudFront to fetch the updated version. However, this will cost you. An alternative approach is to, instead of invalidating your outdated file, add versioning to the file name. For example,
index.css can be named
index-1.0.0.css. Any future changes can be saved to
index-1.0.1.css, the only problem is the webpage must now keep track of the version to fetch.
Let’s Move onto the Cloud
Now lets connect all the different services. I was uploaded all the company JSON data to DynamoDB with a Lambda. A Elasticache Redis node with all the company names. As noted, Elasticache is under a VPC and subnet groups, so the Lambda also needs to be under the same VPC and subnet groups. All static files were hosted in S3 and served via CloudFront.
Generally speaking you can’t really go wrong with AWS. There are other options such as Google Cloud Platform, but I have not had the chance to do any serious development just yet. With tools like Chalice, development becomes much simpler.
Route 53 For Custom Domain Names
Having multiple endpoints become a problem in the long run. CloudFront generates a custom URL and Lambdas generate a custom URL. These URLs are very lengthy and impossible to remember. Hardcoding them all over the place is a bad idea since a future lambda will have its URL as well.
All custom URLs can be masked using subdomain names. For Visatopia, I created a
cdn.visatopia.fyi endpoint for CloudFront and
api.visatopia.fyi for Lambdas. This allows me to hardcode the endpoints without having to worry about any future changes.
Also I have multiple Lambdas behind the same
api.visatopia.fyi endpoint which can be routed depending on the path.
/search will call the suggestion lambda, where
/company will query the company data.
Adding an Open API
After publishing the website, I started to wonder how I can make the data set more useful and accessible to more users. I decided to create an Open API to anyone who wants to query and build different websites on top. You can see more details on API here.
I was able to create the API documents page easily by exporting the API definition on API Gateway page. For the open API, I added authentication using API Keys to prevent anyone from calling the endpoint. API Keys also allow throttling TPS and limits total calls per month per key.
In this post, I went over my development journey of going from an idea of developing a workable website. I used AWS to host my data and serve webpages. I used a serverless framework called Chalice to develop and deploy Lambda functions. After the website was public, I made an Open API for anyone to utilize the data I collected and cleaned.
The overall experience was really fun. Data cleaning took most of the time, but after that, I was able to get up and running within a couple of days. I was able to iterate on all the feedback I received continuously. Having the USCIS data at the core of the project, decision making was rather quick and was able to prioritize my tasks.