Creating a Work Visa Search Engine

The Tech Stack Behind Visatopia

Chris Oh
Chris Oh
Feb 23 · 8 min read

Visatopia is a simple search engine to find millions of past work visa history in the US.

Visatopia was made out of frustration with companies not sharing if they could hire international candidates. Large companies usually can hire, but small-mid size companies it isn’t clear. I found a couple of websites that share this kind of data, but it would just list lines of a candidate’s salary and their application date. The hundreds of rows were just that, raw data, and it was missing any useful information.

Visatopia was made out of frustration of companies not sharing if they could hire international candidates.

Search for Data

Now with some ideas, the search was on. To create a useful website with meaningful information, I needed to find the same raw data as other websites. It turns out this was the easiest part. USCIS publishes their work visa application history on their website!

H-1B History by year on USCIS

However, soon after I realized that the data set was not uniform at all, their naming convention differed depending on the year, salary will be reported in hourly, weekly, monthly, yearly. There needed to be some time spent just on data cleaning. This took me almost 3 months in total.

NoSQL Data Storage with DynamoDB

When I was envisioning the use cases of Visatopia, I realized most of the calls to the database would be read-intensive. There wouldn’t be any writing to the database unless I am updating the records myself.

Initially, I normalized the data into separate tables and reduced as much duplicate information as possible. This is great storing large datasets in a fraction of their original size. However, the problem was because the data was separated, any query for a company will require massive joins on multiple tables each time. That can cause massive delays in reading requests when there is a large number of clients trying to query data. One way to combat this is to have read-replicas of the master database. You can have as many read-replicas as you want and load balance, reducing stress to any particular database. I did not choose this option simply because the cost to maintain multiple instances is just too costly for my use case.

Then it became clear my goal was to reduce database joins, which lead me to the decision of choosing NoSQL. There are a couple of NoSQL database choices such as Firebase from Google, Atlas from MongoDB, but it was obvious to use DynamoDB from Amazon since I was on the AWS stack anyways.

Note. I knew about some of the limitations on DynamoDB, but some became a blocker. The main issue was the 400kb limit on a single record. Some companies would have thousands of data records that it would not fit as a single record. So I ended up breaking up one company record into four smaller chunks.

Search Auto Suggestions

Auto suggestion is a feature that is taken for granted these days. Google, Facebook, Amazon anywhere you go suggestion feature is there. However, the complexity of implementing is sometimes not worth the effort.

At first, I tried to utilize external services like Algolia, Search as a Service, to host the company names and Algolia handle all the complex data storage and recommendation. However, the free tier only allows up to 10k records, which is far less than the 200k company names I needed to store. The price on the paid version, $50~$100 per month was far too expensive.

Luckily, I found some simple implementation from AWS using Elasticache. Elasticache is not intended for search suggestions, but it worked pretty well for me. Elasticache was made to store short term data for caching data to reduce hitting the actual server multiple times.

AWS Elasticache

I created an Elasticache node and stored all company names in it. Then I created a Lambda to query from the Elasticache node and return the results. Note if you are doing something similar, Elasticache cannot be made public by nature, therefore the Lambda or EC2 instance trying to connect must be in the same VPC and subnet groups as the Elasticache node.

AWS Chalice To Deploy in 20 Seconds

I am always talking about serverless. AWS Lambdas are a great example of that. The pricing on these is insane. The first 1 million calls are free, and the next 1 million will cost me $0.20. On top of that, a serverless framework will expedite your development time. Chalice is a framework by AWS built for Lambdas. Creating a function attached to an API Gateway will take you 20 seconds to code out and deploy.

I took the extra mile on this step. I decide to host all of my HTML on the lambda. Meaning the site will be server-side generated. The main reason I took this approach is for Search Engine Optimization(SEO). Google’s indexing bots will not wait for javascript to load content on your page, meaning it will only index the empty page and move on. By having server-side rendered webpages, indexing bots will be able to store all the content found.

Serve Static Files Over CloudFront CDN

However, I chose not to store every single file to AWS Lambdas. If every page load requires 5~10 javascript, CSS files, I will consume the 1 million calls in 5times the speed. There are multiple files you need to create on top of CSS and JS files, such as sitemap.xml and robots.txt.

The most simple way is to serve static files is to host them in an S3 bucket. However, public S3 objects are not secure and may cause issues if your webpage is served over HTTPS. CloudFront is a great solution to this. CloudFront is a Content Delivery Network(CDN). It caches your static files to edge locations all over the world. This allows global customers to load your static files much faster compared to having a single location serving your files. CloudFront also allows you to add a certificate on the endpoint which will serve your static files over HTTPS.

Note. Having files cached at edge locations come at a cost, if you make a small update to your CSS file, users won’t be able to see the changes since CloudFront returns an older version. You can invalidate a file and force CloudFront to fetch the updated version. However, this will cost you. An alternative approach is to, instead of invalidating your outdated file, add versioning to the file name. For example, can be named . Any future changes can be saved to , the only problem is the webpage must now keep track of the version to fetch.

Let’s Move onto the Cloud

Now lets connect all the different services. I was uploaded all the company JSON data to DynamoDB with a Lambda. A Elasticache Redis node with all the company names. As noted, Elasticache is under a VPC and subnet groups, so the Lambda also needs to be under the same VPC and subnet groups. All static files were hosted in S3 and served via CloudFront.

Generally speaking you can’t really go wrong with AWS. There are other options such as Google Cloud Platform, but I have not had the chance to do any serious development just yet. With tools like Chalice, development becomes much simpler.

Route 53 For Custom Domain Names

Having multiple endpoints become a problem in the long run. CloudFront generates a custom URL and Lambdas generate a custom URL. These URLs are very lengthy and impossible to remember. Hardcoding them all over the place is a bad idea since a future lambda will have its URL as well.

All custom URLs can be masked using subdomain names. For Visatopia, I created a endpoint for CloudFront and for Lambdas. This allows me to hardcode the endpoints without having to worry about any future changes.

Also I have multiple Lambdas behind the same endpoint which can be routed depending on the path. will call the suggestion lambda, where will query the company data.

Adding an Open API

After publishing the website, I started to wonder how I can make the data set more useful and accessible to more users. I decided to create an Open API to anyone who wants to query and build different websites on top. You can see more details on API here.

I was able to create the API documents page easily by exporting the API definition on API Gateway page. For the open API, I added authentication using API Keys to prevent anyone from calling the endpoint. API Keys also allow throttling TPS and limits total calls per month per key.

Conclusion

In this post, I went over my development journey of going from an idea of developing a workable website. I used AWS to host my data and serve webpages. I used a serverless framework called Chalice to develop and deploy Lambda functions. After the website was public, I made an Open API for anyone to utilize the data I collected and cleaned.

The overall experience was really fun. Data cleaning took most of the time, but after that, I was able to get up and running within a couple of days. I was able to iterate on all the feedback I received continuously. Having the USCIS data at the core of the project, decision making was rather quick and was able to prioritize my tasks.

Chris Oh

Written by

Chris Oh

Hungry Coder

More From Medium

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade