Web Scraping and Login using Python Selenium
Have a web scraping problem when website must be logged in first?
Well, we can use Selenium for that problem. Basically, selenium is used for automated testing web validation, but it can also be used for scraping, because it can be controlled automatically by scripts, easily work with javascript, DOM or complex html tags
For example, we try to scrap news from websites that need to logged first, such as www.wsj.com or www.barrons.com
The first thing we do is install libraries, including selenium python library, webdriver manager library and import several selenium functions in your file
Create your function/class for login, the codes include:
- put the url
- set the web driver options (e.g. windows size, headless, etc.) and
- login with your username and password
After successful login, we can continue the code to get the news. We can choose the information what we need (e.g. title, article, date, etc) and store it to csv
Sometimes, we still can’t get data from website because captcha or something. So, if that happen, we can prevent it by some methods like user agent or slow down the script execution
For the user agent, we can use fake_useragent library and add a random agent to web driver options. While, to slow down the script execution, we can use time.sleep(second)
However, it still tricky for web scraping using selenium, but at least this is another options tools to get data from website and it can be logged in easily to website.
**This code was adapted from here and for more information please check here