AWS serverless scraping: randomising your ip.

Pablo Voorvaart
Sep 1 · 3 min read

Scraping is all about tricking the website you want to scrape into thinking you are actually not a robot but a human; tracking cookies, changing your user agent periodically, setting random timmers when making requests… All fun stuff necesary if you want to scrape the web. But now a days, most of the sites that have data worth scraping have some kind of bot protection tool that will spot that you are a robot pretty much within 10 request of the scraping process, so we have to resort to the holy grail of tricks for data scraping: IP switching.

I love AWS, specially serverless architectures, and i think serverless is a great way of scraping without incurring into the great cost and hassle of managing a server. If you dont know how to set up a scraper with AWS serverless you can check out the following tutorial:

So now you know how to scrape serverless, lets get started with ip switching.

The code:

import boto3import os, jsonimport time
client = boto3.client('sqs')natClient = boto3.client('ec2')lambdaClient = boto3.client('lambda')def reset(event, context): vpcId = "" subnetId = "" routetableId = "" # block lambda functions response = lambdaClient.put_function_concurrency( FunctionName='feiScraperEvent--dev', ReservedConcurrentExecutions=0 ) response = lambdaClient.put_function_concurrency( FunctionName='feiScraperContest--dev', ReservedConcurrentExecutions=0 ) response = lambdaClient.put_function_concurrency( FunctionName='feiScraperResult--dev', ReservedConcurrentExecutions=0 ) #delete NAt gateway nat_gateway_id = natClient.describe_nat_gateways( Filters=[{ 'Name': 'vpcId', 'Values': [vpcId] }] ) for gateway in nat_gateway_id['NatGateways']: try: natClient.delete_nat_gateway( NatGatewayId=gateway['NatGatewayId'] ) natClient.release_address( AllocationId=gateway['NatGatewayAddresses'] [0]['AllocationId'] ) except: pass #Create eip and nat eip = natClient.allocate_address( Domain = vpcId )
natGateway = natClient.create_nat_gateway(
AllocationId=eip['AllocationId'], SubnetId= subnetId ) response = natClient.describe_nat_gateways( Filters=[{ 'Name': 'nat-gateway-id', 'Values': [natGateway['NatGateway']['NatGatewayId']] }]
)
while response['NatGateways'][0]['State'] != 'available': time.sleep(10) response = natClient.describe_nat_gateways( Filters=[{ 'Name': 'nat-gateway-id', 'Values': [natGateway['NatGateway']['NatGatewayId']] }] )
# Edit route tables
natClient.delete_route( DestinationCidrBlock='0.0.0.0/0', RouteTableId = routetableId ) response = natClient.create_route( DestinationCidrBlock='0.0.0.0/0', NatGatewayId=natGateway['NatGateway']['NatGatewayId'], RouteTableId = routetableId )
# reset lambda functions
response = lambdaClient.put_function_concurrency( FunctionName='feiScraperEvent--dev', ReservedConcurrentExecutions=1 ) response = lambdaClient.put_function_concurrency( FunctionName='feiScraperContest--dev', ReservedConcurrentExecutions=1 ) response = lambdaClient.put_function_concurrency( FunctionName='feiScraperResult--dev', ReservedConcurrentExecutions=1 )

The explanation:

The idea behind this is to utilize the capability aws lambda has to run inside a vpc to our advantage. We are using a NAT gateway to communicate with the internet, and every time we detect our scraper has been blocked we delete the current NAT gateway and create a new one. If you dont know how to use a NAT gateway with lambda read the following article:

As you can see in the code above im using three different lambda functions to scrape data. One of them controls web request, and will invoke the ip switching function once it has detected we are blocked.

So the first thing we do after invoking this function is to rate limit all scraping functions so that they cant be invoked while we are reseting the NAT Gateway. Then we just go on and delete the current NAT gateway and its assigned estatic ip adress and create a new one. Finally we poll via the natClient.describe_nat_gateway method to check whether the gateway has been created, and once it has been created we adjust the route tables of the vpc we are utilizing.

Conclusion:

Event hough this method is a bit rudimentary, it is an effective way to avoid getting constanlty banned by bot scanners.

Also, dont forget to set the timeout of the function to more than one minute and thirty seconds, otherwise it will time out.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade