Empowering Your Scraping Tools: A Guide to Deploying Proxies with Terraform on DigitalOcean

Esteve segura
5 min readOct 30, 2023

--

In the realm of web scraping, encountering IP blocks is a common and often frustrating hurdle. These blocks can severely impede the flow and efficiency of data extraction operations. Enter Terraform: a robust Infrastructure as Code (IaC) tool that streamlines the provisioning of infrastructure, effectively enabling users to “spin up” servers on-demand.

By leveraging this capability, scrapers can dynamically allocate new IP addresses for their tasks, ensuring uninterrupted data gathering. What’s more enticing is the cost-effectiveness of this solution; with platforms like DigitalOcean, one can deploy these servers, known as ‘Droplets’, for just a few dollars, and in some cases, even for less than a dollar.

Terraform can be thought of as “purchasing servers from our provider’s graphical interface, but through code.” It allows users to define and provision infrastructure (VPS/Droplets) using a declarative configuration language. Terraform works with 99% of infrastructure providers (AWS, GCP, Azure, DigitalOcean, etc…).

Flow that we are going to replicate in code

Rapid Proxy Deployment Using Terraform on DigitalOcean

In this section, we’ll delve into one of the most efficient methods for launching multiple proxy servers using Terraform. Here’s a brief walkthrough of the provided code:

# Specifies the required providers and their versions.
# In our case, it's DigitalOcean.
terraform {
required_providers {
digitalocean = {
source = "digitalocean/digitalocean"
version = "~> 2.0"
}
}
}

# Defines a variable for the DigitalOcean API token.
variable "DIGITALOCEAN_TOKEN" {
description = "DigitalOcean API Token"
}

# Configures the DigitalOcean provider with the provided API token.
provider "digitalocean" {
token = var.DIGITALOCEAN_TOKEN
}

# Defines the resource for the DigitalOcean droplet named "proxy".
# Two droplets will be created based on the 'count' attribute.
resource "digitalocean_droplet" "proxy" {
count = 2
image = "ubuntu-20-04-x64" # Specifies the image to be used.
name = "proxy-${count.index}" # Names the droplets dynamically.
region = "ams3" # Designates the region of deployment.
size = "s-1vcpu-1gb" # Determines the size of the droplet.

# Contains user data for bootstrapping our droplet with
# necessary installations and configurations.
# Here is where we are going to install the proxy, and the
# necesary stuff.
user_data = <<-EOF
#!/bin/bash
sudo apt install nodejs -y
sudo apt install npm -y
sudo npm cache clean -f
sudo npm install -g n
sudo n stable
npm install -g pm2
cd /home && mkdir proxy && cd proxy
git clone https://gist.github.com/EsteveSegura/1bfdc6ea1762a00e3baa5b5487370466
cd /home/proxy/1bfdc6ea1762a00e3baa5b5487370466
pm2 start proxy.js
pm2 startup systemd
pm2 save
EOF
}

# After the droplets are created,
# this resource writes their IP addresses
# to a file named 'ip_addresses.txt'.
resource "null_resource" "output_ips_to_file" {
provisioner "local-exec" {
command = "echo '${join("\n", digitalocean_droplet.proxy[*].ipv4_address)}' > ip_addresses.txt"
}
}

Explanation

  1. Provider Configuration: The code starts by specifying the required provider, DigitalOcean, and its version.
  2. API Token: A variable DIGITALOCEAN_TOKEN is declared. This token is crucial for authenticating and interfacing with the DigitalOcean API.
  3. Droplet Resource: The digitalocean_droplet resource outlines the characteristics and configurations for the proxy servers (droplets) to be created. For instance, we are creating two droplets, based on the ‘Ubuntu 20.04’ image, within the AMS3 region.
  4. User Data Scripting: A significant feature in this code block is the user_data section. It's a bootstrap script that runs when the droplet initializes. This script installs Node.js, npm, and sets up a proxy.
  5. Output IPs to File: After creating the droplets, it’s essential to know their IP addresses, especially for scraping tasks. The code achieves this by writing the droplet IPs to a file named ‘ip_addresses.txt’.

In order to launch and raise the proxies, you must launch the terraform process by running this in your terminal:

terraform apply

Implementing Scraping with Our Proxied IP Addresses

With our proxy servers set up and their IP addresses stored in ip_addresses.txt, we can seamlessly integrate these proxies into our scraping workflow. This approach ensures our scraping activities leverage different IP addresses, minimizing the risk of blocks or restrictions. Here's how to do this:

import axios from 'axios';
import { HttpsProxyAgent } from 'https-proxy-agent';

// Replace the placeholders with the actual
// proxy URLs derived from the `ip_addresses.txt`.
const proxies = [
'http://PROXY_1_URL:8080',
'http://PROXY_2_URL:8080',
];

// An asynchronous function to fetch HTML content.
async function fetchHtml({url, proxyIndex}) {
try {
// Retrieves the proxy URL based on the provided index.
const proxyUrl = proxies[proxyIndex];
// Creates a new HTTPS proxy agent using the proxy URL.
const agent = new HttpsProxyAgent(proxyUrl);
// Makes a GET request using Axios with the proxy agent.
const response = await axios.get(url, { httpsAgent: agent });

// Outputs the fetched HTML data to the console.
console.log(response.data);
} catch (error) {
// If an error occurs, it's printed to the console.
console.error(error.message);
}
}

// Fetches the HTML content of two different URLs,
// each using a different proxy.
fetchHtml({url: 'https://ifconfig.co/', proxyIndex: 0})
fetchHtml({url: 'https://whatismyipaddress.com/', proxyIndex: 1})

It’s important to note that the provided script is a succinct example meant to illustrate the core concept. The full implementation, including error handling, optimisation, and additional features, is left to the reader’s discretion. Moreover, while our demonstration uses JavaScript, the proxies themselves are language-agnostic. They can be effortlessly integrated into web scraping tasks written in any programming language of your choice.

⚠️ Warning: Cost Implications with Terraform ⚠️

When using Terraform, it’s crucial to be aware of the potential financial implications. Actions executed through Terraform directly impact the credit card linked to your cloud provider. Therefore, it’s essential to monitor your configurations closely.

Different providers have varying billing practices. Some charge based on the hours consumed, while others have fixed rates regardless of usage. Always be vigilant, understand your provider’s pricing model, and ensure you’re not inadvertently racking up unexpected charges.

When you are done using the proxies, remember to destroy the instances to avoid incurring further costs. Launch on your terminal:

terraform destroy

Conclusion

With Terraform, web scraping becomes more reliable and scalable, as it simplifies the management of proxies, ensuring a cost-effective and efficient data harvesting experience. Resources

--

--

Esteve segura
Esteve segura

Written by Esteve segura

Hi there! I'm Esteve Segura, Software Engineer based in Barcelona. I work at Voicemod as Tech Lead. I love swimming and cycling.

Responses (1)