Mastering Python The right Way.
Imagine you want to download images and reeds/videos from any social media platform, you can easily do it, just tap an images/video and download it. Easy right? But come to think of it, if you want to download many images/videos from the same platform, this is totally tiresome. This is where the world of programming simply using code to instruct your computer to work on your behalf comes into play. Python as a programming language does this for you. And you will scrap the websites of whatever kind and retrieve this data and store it for you wherever you want it saved. How does this just happen? Come in with me…
Web Scraping Using Python.
This refers to extracting large amounts of data from websites through automation. Much of this data is not structured html data that is converted to structured data in spreadsheets before they are used in various applications. how is this one arrives at, web scrapping can be done is many ways;
You can use specific APIs or writing specific code from scratch. Some big companies like Google, Facebook(Meta) among others expose their APIs that let you access data in structured ways.
Where is Web Scraping used?
- Price Monitoring. Web Scraping can be used by companies to scrap the product data for their products and competing products as well to see how it impacts their pricing strategies. …
- Market Research.
- News Monitoring.
- Sentiment Analysis.
- Email Marketing.
Python is a stack that boasts in a lot of fancy projects. As a result, one of the projects achieved using it is web scraping. There are many libraries out there to help perform this task(web scraping easily. You can choose to use scrapy and beautiful soup to accomplish this.
- For to you to enjoy
- Using your terminal, use the below command to create a directory,
- Change Directoty by using this command.
cd webScrap #here you have switched to this directory.
- Run the following command to activate the virtual environment
pipenv shell # This command create and activate the virtual environment.
Hope this was not a lot, the commands were straight forward. Up to this far, you have a working and running virtual environment.
2. Create python file.
Here now lets create a file and name it scrap.py. The extension .py simply means it’s a python module, you wonder how? think of a picture it can be of jpeg or png. Similarly this .py is recognized as a python module.
Let’s keep on, now using the editor of your choice, run this command
code . #Don't leave the trailing . this simply means at the current directory.
If you have correctly followed the procedure to this far, your VS code must be up and running. Remember you already activated your virtual environment. Some might ask, What the heck is this virtual environment?… Very simple, this basically refers to the box/container where your project is running. This simply means that one project does not affect the other, that is project are decouple.
3. Import packages.
The following packages will help us achieve our end goal easily, they are already existing libraries and therefore don’t re-invent the wheel, someone already did it for you.
pip install requests
pip install bs4
pip install termcolor
At the beginning of the page, inside the scrap.py, import the following
import re # regular expression
import requests# remember to first install requests before importing it.
from bs4 import BeautifulSoup
import urllib.request as DFU
from termcolor import colored
4. Here let us get our personal URL from Instagram.
Here we just want to use our personal Instagram url, in order to get data associated with this url using requests module.f
fTo get this data do as follows.
data = requests.get(url)
To check if this is working, print data.
Checking for a video case.
click on any video and take the video URL. For this case I am going to use my Instagram account and go to where we have a video and copy that URL to use. Check below for procedure.
url = "https://www.instagram.com/tv/Cap3EXMJsi0/utm_source=ig_web_copy_link"
data = requests.get(url)
print(data.status_code) # if 200, the result is successful.
In the above page, if you find ‘mp4’, The link containing the mp4 is the main thing we need. Intagram has terms and conditions when doing something of this kind. Instead use the below link for video.
match = re.findall(r’url\W\W\W([-\W\w]+)\W\W\Wvideo_view_count’, str)
Any time we run the code above, this will search the URL above. To extract the video, I have to declare a variable that helps me to retrieve this information, like is shown below.
extraction = “.mp4”
So far so good, I recommend you use profile picture url. The image you should get is your profile picture image.
Use the below link for image.
match = re.findall(r'profile_pic_url\W\W\W([\W\w]+)\W\W\Wdisplay_resources’, str)
Our variable for extraction is like below:
extraction = ".jpg"
After we are done with this long process, let’s collect tech actual post video or image’s url in variable as regex and then to string.
res = match
5. Data extraction
To download captions of the post, We will use Beautiful soup in order to get the caption or the titles of the post. For us to manage this, we should assign data (strings) to pass through BS4 and filter it.
page = BeautifulSoup(str, "html.parser")
title = page.find("title")
title = title.get_text()
Our code will find the title of this page and store the title variable. Having otten this, now we use the regex to make our name saved and store it in a media folder.
title = re.sub(r"\W+", "_", title)
title = "download/web_scrap"+title+"web_scrap"
We shall use the download/ because we want to store our downloaded files in a new folder called download/.
if result != "" :
download = input("Do you want to download(y/N) : ")
if download.lower() == 'y':
fileName = title
print("Sorry! Download Unsuccessful")
print('did not find or post is from private account')
The above code can only execute if the string is not empty, As line says. Afterwards the code converts the user input to lower case and compares with the ‘y’, and the code flows chronologically until images are saved in. If this did not work, the an error is thrown saying download was not successful.