It can work (easily!): Serverless React and web crawlers

(Want to get straight to the point? Tired of reading a bunch of words before seeing a solution? I get it. Jump to the end: My Step by Step Guide to React and Crawlers.

If you have a little more time, though, read on.)

When I started building www.helloclue.com, I decided to use React. Compared with other Javascript frontend frameworks I’d used before, it seemed like it might actually have some longevity, and longevity is my biggest pain point about JS frameworks.

After some initial learning curve, I was off the ground running with React, quickly building UI components, our Contentful client, and integrations between all of those. For the first version of our site, we are using a service, Contentful, as a CMS backend, which means that I was able to start building this site serverless.

It got me up and running so much faster. Perfect! Right?

I asked the question, so you know the answer is no.

When it came time to get the site up and deployed, it seemed incredibly easy, until it didn’t. I quickly ran into what seems like a super common problem site creators run into when using dynamic JS frameworks like React for Single Page Applications (SPAs). Crawlers just don’t know how to act, or at least they don’t act how you’d want them to when they get to your pages.

Pages are generated dynamically by injected JS. This works perfectly for users, but often crawlers (Twitter, Facebook, etc.) don’t execute JS — and even when they do (shoutout to Google), they often return long before the full page’s data has loaded, resulting in your blank page frame being scraped and stored.

Well, that’s far from what we want.

Why does this matter so much?

Issues with crawlers are already a fairly clear problem, but with the big vision we have for Clue and www.helloclue.com, it is even more problematic. One of our biggest goals and inspirations for this project is to spread the reach of our educational content much further throughout the world, to users and non-users alike. We believe we have valuable content to share with the world. Content to help people become more informed and empowered about their health.

People are using “Dr. Google” every day to get a first, second, third, etc. opinion on concerns around sexual, reproductive, and menstrual health. The internet is awash with answers, but what about good quality, research-backed answers? We have that content — we just need to get it to those searchers. We can’t do that, though, if we can’t be found, and we can’t be found if crawlers only find our blank frame or a subset of our content, and not the full page.

So, hopefully, you can see, we need Google/Twitter/Bing/Facebook/Baidu/YouNameIt to find us and bring us to the much wider world, which needs good quality information. The internet can help us spread high-quality health information, no matter how hard or taboo it may be to get such information in your own community.

So, back to the tech. Trial and Error Edition.

There was quite a process that went into experimentation, but I eventually found a good process that works for our needs, unblocked me/us, and got us showing up in search results and social sharing widgets like pros!

Here are some of the trials (and errors) I had while trying to get this to work.

Stage 1: SPA Upload with S3 Hosting

This was soooo simple and looked amazing at first. Just build the React project and push up the build/ directory to an S3 bucket. Host the site there. I built a deployment script to handle that and boom, we were up and running.

Visitors could visit and navigate the site like a charm. Crawlers, not so much. Crawlers without Javascript or without “patience” saw absolutely nothing, just the index.html markup.

Real People: 💯 Crawlers: ❌👀

Stage 2: Snapshot it all, then upload to S3

Snapshotting seemed like the next best option. I could build my SPA, then run my own snapshotting crawler through the site, generate snapshotted pages and serve these pages to users and crawlers alike when they visit. Sounded great. Users would get quick loading pages, and crawlers would see pages as expected since I would have pre-rendered/snapshotted pages before uploading to S3.

Spoiler alert, this didn’t work for me.

Of all the libraries I tested,react-snap was the most promising, so I’ll just talk about that one for now. It crawled my links like a pro and saved the pages in a convenient file structure. For a page like https://helloclue.com/articles/cycle-a-z/gynecological-pelvic-exams-and-pap-tests-101, react-snap would save the snapshot in build/articles/cycle-a-z/gynecological-pelvic-exams-and-pap-tests-101.

The main problem was that react-snap generates snapshots in one viewport, which you specify in the config parameters. We have a mobile responsive site, and much of the mobile responsiveness comes from using react-responsive. react-responsive is a very convenient library, which allows you to wrap your React components directly in MediaQuery components, rather than only applying media queries through css.

This meant however, that react-snap snapshots would not trigger any of the breakpoints added with MediaQuery. So, things would look great if you were using the site in whatever version (mobile/desktop) the react-snap snaps had triggered, but as soon as the screen became too small or two large (depending on the viewport config value) things got very wonky, very quickly. It was trying to serve the mobile header in desktop mode, applying desktop styling, and it didn’t work.

Real Users: ❌😕 Crawlers: 💯

Stage 3: Other options. So many other options.

I was really clinging as hard as I could to having the tiniest/most non-existent backend server. I didn’t want to have to set up a node backend for React server side rendering. All of this just to get SEO/crawlers to work? There must be something better!

Me: ❌😩

Stage 4: By George, I Think I’ve Got It! — EC2 and Prerender.io

After a long time seeking a simple and effective solution, I finally came across an article about static site generation for crawlers using prerender.io and AWS. I’d seen Prerender before during my searches, but there were so many free one-install libraries (like react-snap), I wanted to try them first. After exhausting those options and finally seeing a useful explanation of setting up Prerender, I decided to go for it. Surprise, it did the job.

Prerender.io is a paid service (though they also have an open-source server you can set up and use, yay) which serves cached, pre-rendered pages to crawlers. So when Google wants one of your pages, the crawler’s request gets routed to prerender.io, not directly to your minified bundled content. Prerender.io sends a cached version of the page, if it already has it; otherwise, it crawls your site with JS and enough time to then send the newly, full rendered, cached version to Google.

Real Users: 💯 Crawlers: 💯

The set up wasn’t awful, but it took a while to find a helpful article and to piece together series of articles to get me through stumbling blocks. Hopefully with my guide, you’ll be able to leap past many (or all) of those blocks and get your site up and running much faster.

So, without further ado.

My Step-by-Step Guide to React and Crawlers

Setup Prerender

1. Create a Prerender.io Account ¹

2. Get your Prerender install token.

We’ll be using the NGINX setup, so get that token/line from the Install Token page. It will look something like: proxy_set_header X-Prerender-Token YOUR_TOKEN_HERE;

Setup EC2

3. Create an EC2 instance

Pick the appropriate AMI (Amazon Machine Image) and the smallest instance size that works for you. I used the Amazon Linux 64-bit in t2-micro to start with.

4. Launch your instance

Adjust your security group from the default to allow HTTP traffic on port 80 and HTTPS traffic (if you use it) on port 443, so that your site will be publicly accessible.

Ensure your instance is running and navigate to your instance’s public DNS to ensure the site is live. e.g. http://ec2-11-222-3-444.eu-west-1.compute.amazonaws.com.

5. SSH into your instance

You’ll get a .pem file when setting up your instance. You’ll need this to SSH into your new EC2 instance.

ssh -i your_key.pem ec2-user@<YOUR_EC2_PUBLIC_DNS>

⚠️ Ensure that your .pem file is secure or you’ll get this warning: WARNING: UNPROTECTED PRIVATE KEY FILE! To address this, run: chmod 700 your_key.pem

6. Set up necessary environment on your instance

For this guide, the setup is simple. You’ll just need nginx installed

yum install nginx

Upload Your Build Files

7. Create a /var/www/html directory in your instance. This is where we’ll put our build files.

Ensure your ec2-user has write access to this folder, so you can copy files from your local environment with that user, which you’ll do in the next step.

sudo mkdir -p /var/www/html
chown -R ec2-user /var/www/html
chmod -R 755 /var/www/html

8. Upload your build/ files to your new EC2 instance

npm run build
sudo mkdir -p /var/www/html
scp -rp -i your_key.pem build/* ec2-user@<YOUR_EC2_PUBLIC_DNS>:/var/www/html

Configure Your NGINX Server

9. Set up your NGINX server to route crawler requests to prerender.io and all other requests to your build at /var/www/html.

I made a gist with an nginx.conf file you should be able to use out of the box. Replace the YOUR_SITE_GOES_HERE.com and YOUR_TOKEN_GOES_HERE values in the file and you should be good to go. Replace what you have at /etc/nginx/nginx.conf with this file.

Just FYI, it’s the following lines of the gist that do the bulk of the routing for prerender.io.

if ($http_user_agent ~*   "googlebot|baiduspider|twitterbot|facebookexternalhit|rogerbot|linkedinbot|embedly|quora link preview|showyoubot|outbrain|pinterest|slackbot|vkShare|W3C_Validator") {                
set $prerender 1;
}
...
if ($prerender = 1) { 
#setting prerender as a variable forces DNS resolution since nginx caches IPs and doesnt play well with load balancing
set $prerender "service.prerender.io";
rewrite .* /$scheme://$host$request_uri? break;
proxy_pass http://$prerender;
}

10. Restart/start your nginx server to use your new configurations.

sudo service nginx restart

Test Your Integration

11. Test that the integration with prerender.io works.

What Twitter will see:

curl -A Twitterbot http://<YOUR_EC2_PUBLIC_DNS>/<SPA_URL>

What your users will see:

curl http://<YOUR_EC2_PUBLIC_DNS>/<SPA_URL>

Confirm that both responses look like what you expect. If they do, your integration is done!!²

Hopefully you can avoid the time I spent coming up to this solution. Feel free to send thoughts and feedback in the comments or to @omosolatweets. Enjoy!


¹ We don’t have any type of affiliate deal with Prerender.io. It just worked well for me and I’m sharing it with you. :)

² If you’re using a CDN, make sure to forward User-Agent headers to your EC2 instance and forward any query string that you app needs to use for rendering. This is necessary so your nginx server can differentiate requests based on the User-Agent, so it knows to send crawlers to the prerender.io server. If you forget this bit, your CDN will consider all traffic the same, regardless of the User-Agent, and send all the traffic the same cached response.