How Python Does Help to Scrape News Articles from CNN?

  • A quick overview of web pages and HTML
  • Python web scraping with BeautifulSoup
! pip install beautifulsoup4
  • find_all(element tag, attribute): It enables us to identify any HTML element on a page by displaying its tag and characteristics. This function will find all items that are of the same kind. Instead, we can use find()to get only the first one.
  • get_text(): This command will allow us to retrieve the text from a specific element after it has been discovered.
# importing the necessary packages import requests from bs4 import BeautifulSoup
r1 = requests.get(url) coverpage = r1.content
soup1 = BeautifulSoup(coverpage, 'html5lib')
coverpage_news = soup1.find_all('h2', class_='articulo-titulo')
# Scraping the first 5 articles number_of_articles = 5# Empty lists for content, links and titles news_contents = [] list_links = [] list_titles = [] for n in np.arange(0, number_of_articles): # only news articles (there are also albums and other things) if "inenglish" not in coverpage_news[n].find('a')['href']: continue # Getting the link of the article link = coverpage_news[n].find('a')['href'] list_links.append(link) # Getting the title title = coverpage_news[n].find('a').get_text() list_titles.append(title) # Reading the content (it is divided in paragraphs) article = requests.get(link) article_content = article.content soup_article = BeautifulSoup(article_content, 'html5lib') body = soup_article.find_all('div', class_='articulo-cuerpo') x = body[0].find_all('p') # Unifying the paragraphs list_paragraphs = [] for p in np.arange(0, len(x)): paragraph = x[p].get_text() list_paragraphs.append(paragraph) final_article = " ".join(list_paragraphs) news_contents.append(final_article)



Scraping Intelligence

