Twitter Client Side DOM Scrapping
I was working on a project involving the twitter DOM recently and found that while there are many plugins that help with DOM scrapping, there was no easy guide to scrapping the twitter DOM yourself on the client using Javascript.
I wanted to feature what I learnt on scrapping the DOM and share some of the code to achieve this result.
Finding a Tweet
Let’s start by exploring the DOM structure of twitter. Of course, it is heavily obfuscated, but it might still give us some clues. I highly recommend playing around in Inspect Element for yourself.
So we have this one unique tag data-testid = “tweet” which helps indicate what element is a tweet, now let’s try to find it in the JS console.
We could achieve this by checking the attributes of the div elements for data-testid and then check if it equal to “tweet.”
let divs = document.querySelectorAll("div") // Load Div Elementsfor(let div of divs){let dataTestId = div.getAttribute("data-testid")if(dataTestId == "tweet"){console.log(div.innerHTML) // Printing the html of a tweet
break; // First we just want to find one tweet}} // Load Tweet Elements by checking for specific Attribute
Clearly, we have found the div element of the tweet, but there is still a lot of junk we need to clear up to format the tweet in an easier to use format.
From the for loop above, we should find all the tweets loaded into the page, so let’s store them in an array.
let divs = document.querySelectorAll("div") // Load Div Elementslet tweets = []for(let div of divs){let dataTestId = div.getAttribute("data-testid")if(dataTestId == "tweet"){tweets.push(div)}} // Load Tweet Elements by checking for specific Attribute
Finding a Tweet’s ID
Unfortunately, the tweet’s ID is obfuscated and can’t easily be retrieved from the element, but we can still use a trick to find the ID of a tweet, which involves noticing the url of a specific tweet, this url contains the tweet’s ID so we can use this.
These URLs are structured like
${USER_NAME}/status/${Tweet_ID}
So we can search for /status/ and capture the tweet_id after it
for(let tweet of tweets){
let aTags = tweet.getElementsByTagName("a")
// Gets all the <a> tags from the tweet element
for(let aTag of aTags){
let href = aTag.getAttribute("href")
// Gets the value of the href attributeif(href.includes("/status/")){let tweetId = href.split("/status/")
// Splits the string into a listtweetId = tweetId[1]
console.log(tweetId)
break; // Just getting the first of these URLs}
}
}
Parsing the tweet
First lets built a tweet object, of what information we want from each tweet
let tweet = {name: "",username: "",time: "",content: "",interaction: {reply: "",retweets: "",like: "",},};
This is some basic information we can extract from a tweet
So let’s build a parser to extract this information
let tweetParser = async function(tweetDom){let tweetTextContent = tweetDom.innerText
// Will be all the text in a tweet// First let's get the time of the tweet
let timeElm = tweetDom.getElementsByTagName("time")[0]
// Again noticing this DOM element takes messing around in the browser
let timeDisplay = timeElm.innerText;
// This will just be the display which is 15m or 4hr or April 23rd
let dateTimeAtri = timeElm.getAttribute("datetime")
// Gets the value of the datetime attribute
First I want to give some motivation and insight into why we are going to parse the tweet like the following
Let’s look at the text output from the tweet element
Rahul Tarak
@CryogenicPlanet
·
Mar 14
Working on a new project, excited to be putting to use a Neo4j database
Also such a relief to be doing that feels productive and like it could help with #covind19 rather than dumb school work
Quote Tweet
Arnav Bansal
@itsarnavb
·
Mar 14
Building this with @CryogenicPlanet, in random hackathon mode
Here’s the why: https://notion.so/Vision-b37d3026c3384a01ac5a6f5e2695e0d5…
1
1
7
So we can parse the elements by splitting my newline
let splitTweet = tweetContent.split(/\n/);
// Makes the text into an Array split by newlinelet splitLength = splitTweet.lengthlet breakpoint;let endContent = splitLength - 4
// Subtracting four to account for the three interactions at the endfor(let i = 0; i < splitLength; i++){if(splitTweet[i] === timeDisplay){ // Find the first element which has the timeDisplay
// This is the last element before the tweet content startsbreakpoint = i;}}tweet.name = splitTweet[0]; // Always the First Elementtweet.username = splitTweet[1]; // Always the Second Elementtweet.time = dateTimeAtri // The date-time value of the tweettweet.content = splitTweet.slice(breakpoint+1,endContent+1)// All the parts of the tweet which are the contenttweet.content = tweet.content.join('\n') // Combined into one stringtweet.interaction.reply = splitTweet[endContent+1];tweet.interaction.retweets = splitTweet[endContent+2];tweet.interaction.like = splitTweet[endContent+3];}
Capturing Lazy Loaded Tweets
Twitter doesn’t load all the tweet on the timeline immediately, and if the tweet isn’t in the DOM, we can’t read it like this. So we have to set up an onScroll event to capture new tweets
We can use the below outline to do this
window.addEventListener('scroll',function() {// Get tweets// Check they are distinct // Store them}
The full code for the scrapper can be found https://gist.github.com/CryogenicPlanet/b2dd54a8c946999e9fe497b33ae2037a or https://repl.it/@CryogenicPlanet/Twitter-Scraper-Source-Code
Conclusion
Let me start by saying this is not even close to a complete guide to scrapper twitter, just what I learnt when I was experimenting with scrapper twitter.
In terms of application, this could be used inside an extension as a wrapper to manipulate the twitter experience or could be used after downloading the twitter page.
I was exploring twitter scrapping for another project where we finally scrapped the idea of using a client-side scrapper, but I have to acknowledge my friends’ contributions to the above code, namely Baalateja Kataru, Gaurang Bharti, Arnav Bansal and Rithvik Mahindra