Myntra: Web Scraping using Selenium

Kashish Oberoi
4 min readJul 6, 2020

--

With the Increasing demand for Machine Learning and AI models in the market, there has been a demand for new and wide spectrum datasets. There are many datasets readily available to use but if one wishes to make a dataset on his/her own then one option that comes to mind and that is Web Scraping.

A large amount of data of various domains is up for grabs on the internet, we just need to scrape it and use to train our state of the art machine learning models.

Here, in this article, we will learn to scrape Myntra, an Indian fashion e-commerce company using selenium webdriver.

Step I: Import Modules and Dependencies

from selenium import webdriverimport urllib.requestimport osimport json

In addition to the above-mentioned python libraries, you need chromedriver, which you can directly install from https://chromedriver.chromium.org/ based on the chrome version that you are currently using.

Step II: Open Myntra.com using selenium

driver = webdriver.Chrome('chromedriver')driver.get('https://www.myntra.com/')time.sleep(5)

The above piece of code would enable you to get and load the URL and a chrome window would open with the response page.

Step III: Search

In the response page shown above, you can see that there exists a search bar where you can enter your query. In this segment, we will enter text in the search bar and click to retrieve search results.

driver.find_element_by_class_name('desktop-searchBar').send_keys(search_string)driver.find_element_by_class_name('desktop-submit').click()
Result for search string “T-Shirt”

Step IV: Scrape for object URLs

If the purpose of your dataset collection is just to retrieve the thumbnails then you can retrieve them from this response page only. But if you want to retrieve the metadata along with the images then you can proceed with.

while(True): 
time.sleep(5)
for product_base in driver.find_elements_by_class_name('product-base'):
links.append( product_base.find_element_by_xpath('./a').get_attribute("href"))
try:
driver.find_element_by_class_name('pagination-next').click()
except:
driver.close()
driver.quit()

The above code fetches all the product links() on the response page and looks for the NEXT-> button, and if it exists, it goes to the next page and repeats the procedure or else, halts the driver connection.

Step V: Retrieve Data

Product Page
driver = webdriver.Chrome('chromedriver')
driver.get(link)
metadata['title'] = driver.find_element_by_class_name('pdp-title').get_attribute("innerHTML")
metadata['name'] = driver.find_element_by_class_name('pdp-name').get_attribute("innerHTML")metadata['price'] = driver.find_element_by_class_name('pdp-price').find_element_by_xpath('./strong').get_attribute("innerHTML")

Here, we get the name, the title, and price of the product using the purchase link fetched in Step IV.

try:
driver.find_element_by_class_name('index-showMoreText').click()
for index_row in driver.find_element_by_class_name('index-tableContainer').find_elements_by_class_name('index-row'):
metadata['specifications'][index_row.find_element_by_class_name('index-rowKey').get_attribute("innerHTML")] = index_row.find_element_by_class_name('index-rowValue').get_attribute("innerHTML")
metadata['productId'] = driver.find_element_by_class_name('supplier-styleId').get_attribute("innerHTML")

In this code snippet, we try to load all the specifications of the product and then fetch the key-value pair to add the metadata.

itr = 1
for image_tags in driver.find_elements_by_class_name('image-grid-image'):
image_path = os.path.join("data",base,metadata['productId'],'images',str(itr)+".jpg")
urllib.request.urlretrieve( image_tags.get_attribute('style').split("url(\"")[1].split("\")")[0],image_path)
itr +=1

For Computer Vision and Deep Learning, state of the art models we require image data of the product which is fetched here in this code snippet.

with open( os.path.join("data",base,metadata['productId'],'metadata.json'), 'w') as fp:
json.dump(metadata, fp)

And finally, the metadata is saved as a JSON file.

Final Notes

Web Scraping is a great tool to explore websites and extract useful datasets. In this article, We have successfully scraped the Myntra website for data extraction. This data can be used to generate new clothing designs using GAN models, it can be a helpful dataset for machine learning and image classification.

The work that we have done must be used for research purposes only. One should never use Web Scraping for unethical practices.

--

--