Tor Scrape: The story of how I made a basic downloader with custom configs capable of bypassing most Anti-Bot tech
Over the past couple of months I’ve been using a script called save.py which does as it sounds: Saves things. Which is .html documents in most cases. But why? Well, I’m lazy that’s why.
However that was until I met a friend. OSINTGuardian. I learned of a site we’ll call “COPP” about a week ago after failing to obtain the username’s list in full from the memberlist.php file from VPN & Anti-Bot issues I finally had had enough. I decided I’d find a way, spending 2 days reading different articles and etc I couldn’t figure out how to get a scraper to simply do what I wanted: Connect to tor and bypass security.
Selenium
This was one of the recommended ways but boy oh boy IS IT FUCKING RETARDED. Cloudflare: * Infinite looping on the Anti-DDoS Bot challenge *
2 hours of going insane by doing the same thing over and over again I said fuck selenium and moved on to trying to use user agents and referrals similar to save.py
Save, Time Stamp (TS), Memberlist and Time_Calc
Above is a snippet of save.py which is then used to save HTML documents (Without files) it worked heavily in unison with Memberlist.py which generated links from memberlist.php on the target site. Now, TS.py was used to scrape the data out in a specific format so time_calc (formerly time.py) could calculate data.
A snippet of TS.py can be viewed below
This is what I call a “Config” it’s intended goal is to grab specific data from the HTML Documents in the directory the user specifies when prompted in this case:
Name:
Size:
Duration:
Link:
HTTP Error: 403 Client Forbidden
The next hurdle with COPP was the damn fucking 403 that returned 403 Client Forbidden instead of just: 403 Forbidden. Which pretty much told me python requests were forbidden and no matter how much User Agents I’d use or How many referrers I’d use it wasn’t working.
So, we went back to the original concept which is still downloading HTML files but making a more discreet version of selenium’s web browser :)
COPP2.py
COPP2.py had significantly better functions then COPP.py and was more stable. COPP.py lacked an important feature: Logging and Naming. It became apparent that COPP was bound to crash or not complete a save and often would result in us having 0 clue where we’d left off at due to the fact it “Saved what It wanted” and in most cases: With no order. So, 5832.html would be downloaded and then next thing you know, 63812.html was downloaded right after. So, without logging COPP.py was a nightmare resulting in hours lost and the need to burn all the downloads using Clorox.py (Essentially a bleach bit script). So, COPP2.py addressed these issues by mostly logging and using line number to determine next file’s name and logged visited/downloaded items into download_log.txt creating a super effective system to pick up after a network failure, going to bed and saving power and etc.
Tor_Scrape.py v1.0
Tor Scrape v1 is built off of COPP2.py the difference being the directories. There’s not much difference other then some stability upgrades and some basic UI overhauls.
Tor_Scrape.py v2
Version 2 of Tor Scrape fixes an issue the need to launch a config post-completion. Upon Completion, Tor Scrape now launches a user defined Config which is asked during a prompt just after initial execution creating a much more automated exportation process.
Tor_Scrape.py v3 and v4
v3 Added the Clorox.py addition which automatically fires upon the config finishing. This cleans up any .html files by overwriting the data since some is really bad data you don’t want laying around. v4 was the biggest update.
Version 4 allowed me to “chunk the data” but looks for IS_VM file hanging out on the desktop. If present the script will not proceed to fire Clorox.py and will simply initiate shutdown. If not present, the main script will look for the now returned drives labled: E, V, I & L and move the contents of “Extracted” to D:\Extracted which already contains the 1/5th of the data already. From there the config is fired, the data is exported and Clorox.py essentially destroys the data, partions the drive and then I re-encrypt it with Vera Crypt later on.
Endgame, CloudFlare and DDoSGuard:
The effectiveness of bypassing Cloudflare has been rather effective with the initial connection being around the only time a captcha appears and it never does throughout the process. DDoSGuard as usual is ass and never appears post-initial connection and Endgame? Yeah, It comes back every so often depending on the site’s load and other things. It’s usually very easy to do the captcha and continue on and thank the lord for the download_log.txt file because the script has a tendency to fire off another save request mid captcha solving and save the wrong page.
Anyways thanks for reading, I’m a little toasted right now so I really dunno if 99% of this made since but I’ll read tomorrow