Selenium tutorial with Twitter
What is selenium?
Let’s start with a definition of selenium.So, Selenium is an open source tool that is mostly used to automate web applications. For example, we can use Selenium to test a form on a website to see if it is giving us the desired result or not. You can find more information about selenium with a slight Google search if you want to know more. And we are going to use this capability of Selenium to scrape Twitter data like tweets, tweeter handles, like count, reply count, and much more.
Why Selenium?
You might be thinking we already have a Twitter API for scraping Twitter data, so why use Selenium? Recently, I participated in a competition where we needed to scrape twitter data and do sentiment analysis. At first, I chose APIs only, but both were not working, and I tried all different things to resolve the errors, but didn’t get the result. Then here came Selenium to my rescue :). So, I thought if you ever get into some issue, then this article could be helpful for you.
Let’s get started!
For this tutorial, I will be using the Chrome browser, but you can try other browsers as well. To use Chrome, you will need to download the Chromedriver as per your system specification, and you can get it from this link. And to find your chrome version, follow the below images.
If you face any issues, this link will help you to get your right chrome driver.
After downloading the chromedriver, extract it in a good place and remember its path as you will need it later.
Now we will start writing our automation script in python. To use selenium you need to install selenium, you can download this using pip install selenium. You can follow any source to install the package if you don’t know how to install one. After installing, let’s import it.
Now we will initialise the driver:
Now we will use the driver variable to access the website. Use the get() function to access any website by passing the url of that website. Since this is twitter scraping, we will use the Twitter URL.
After executing this, you will see twitter page:
Next, we need to sign in or sign up on twitter. Since I have an account on this, I will sign in. To do this, first we need to find the code behind the sign-in button. To find this, right-click on the page, then click inspect. You will see all the code behind it. In this, you have to find what triggers this sign-in button.
When you get something like this, right click on the sign in button and the respective code section will get highlighted on the right side of the section.
Now in this div section, keep going down the hierarchy by clicking the arrow until you find the “Sign in” text. There is a simpler way to find this using xpath. Xpath is an xml path. To find any text using xpath, we use the following syntax :
//*[text()=’text’]. So to find the Sign in, click anywhere in the code section and press Ctrl+f and write: //*[text()=’Sign in’]. You will see it gets highlighted. Now we will find the element using our driver. For this, we can use the find_element_by_xpath () function and pass the xpath as an argument to this function, and then click it using the click() function.
You will move to the next section as shown.
There are many options to sign in, but I would suggest signing in using your username or phone, as signing in with google restricts you from bot signing in. I will use my username to sign in. In order to do that, we have to again find that section of code, click it, enter the value, and click on Next.
In order to do this, right click on the page and select “inspect.” Here we have to find the code for the below section.
Since this is an input field, find the input tag for it in the code section by right clicking on this field and select inspect. In this div section, find the input tag as shown below:
Now we need to build the xpath. To find something by tag, we find it by using //tag_name[@attribute=’value’]. Here the tag name will be input and I will use the attribute name and its value of “text”. Sometimes it could happen that there are multiple input box in a page, so to uniquely identify each input tag, we use attributes. Attributes are the words written in yellow in the above image, such as autocapitalize , autocomplete, etc. and values are written in double quotes. To uniquely identify this input tag, I will use the name attribute as follows: //input[@name=’text’]. Let’s incorporate it into our code to automate it.
Next, we need to enter our user name. This can be done using the send_keys() function as follows:
Next, we need to click Next. To do this, we can do the same as we did for Sign in.
We move to the next section:
On this page, we need to enter the password and click Log in.
There is a good way to get the password using getpass library.
Now again, find the input box for password and find a unique attribute and use it. Now try to do it yourself and then refer to the below images.
That’s all in this part. In the next part, I will post how we can scrape the twitter data. I hope you enjoyed the article, but if not, give your feedback. I will try to keep those things in mind in my next part. Keep Learning!