I Mined Thousands Of News Articles To Find The Most Trending Topics For Tesla In The Past Year. Here Are The Results

Saurav Modak
ScrapeHero
Published in
6 min readSep 13, 2023
Photo by Roman Kraft on Unsplash

There is no shortage of gossip about Elon Musk. During the past year, he has always found ways to remain in the news and create trends. Perhaps that is precisely what is expected from one of the wealthiest persons on the planet.

However, what were the most newsworthy events of his extraordinary companies during the past year? We can all harbor a guess, but I wanted to arrive at these conclusions from evidence — from data. And thus, I set out on this quest to gather the data first.

While the most straightforward way to get this data will be to scrape these from the news websites, they came with their own challenges. Scraping data is complex, and more so from hundreds of news websites. All of these websites have their own structure, which changes from time to time, and maintaining these scrapers will be a daunting task for a single person like me.

If only there were a way to get all this data in a structured format, just using API calls.

That’s where the good people at Intell came into the picture. They are a full stack news service that gathers the news from thousands of news websites and presents us the data in a well-structured way. You can download the data from their website or use their friendly APIs to get the data programmatically.

Since I will be doing some post-data analysis of the data in Python, I decided to follow the second way — use their APIs to download the data and then process it in Jupyter Notebooks.

Getting The API Key & Downloading The Data

Once you sign up for their service, you can get the API key from their portal. The way to do it will be to go to settings, then to integration, and then generate an API key from there. This will be an alphanumeric number that looks gibberish, but keep it a secret.

In Python, the best practice is creating a .env file and adding the API key.

APIKEY=<your api key here>

Then, we can use the python-dotenv package to read the API key from the file. This makes sure that the API key is not exposed in your code.

from dotenv import load_dotenv
load_dotenv(override=True)
APIKEY = os.environ[‘APIKEY’]

Next, to fetch the data, we can use the requests package. For the purpose of this article, we will limit ourselves to 10,000 news items only. This is not a very large number of articles, as that may slow down our experimentation, nor is it a small number that gives us no reasonable results.

We will use this function to get the data iteratively, as Intell sends 5,000 articles at a time. Further, to get better data, we ensure that the keyword we want is in the title.

We store all the titles in a list and will be clustering them.

def fetch_data(keyword=None, next_url=None):
if next_url == None:
## fetch using keywords

payload = {
'token': APIKEY,
'offset': 1000,
'startDate': '2022-09-01',
'endDate': '2023-08-31',
'organizations': keyword
}
try:
resp = requests.get('https://app.intell.me/api/v3/feed/', params=payload)
return resp.json()['result']
except Exception as e:
return None
else:
try:
resp = requests.get(next_url)
return resp.json()['result']
except Exception as e:
return None

resp = fetch_data('tesla')

We can pass the keyword to the function. Hare, we are only looking for articles that mention Tesla.

We can run the above function on a loop till we get the required number (10,000) articles.

title_list = []
has_results = True
keyword = 'tesla'
i = 0
next_url = None

while(has_results):
i = i + 1

if next_url == None:
resp = fetch_data(keyword)
else:
resp = fetch_data(next_url=next_url)

if resp == None:
print('=== Exception From API ===')
has_results = False
break

next_url = resp['nextUrl']
data = resp['data']
last_date = resp['data'][-1]['date']

for datum in data:
title = datum['title'].lower()
if keyword in title:
title_list.append(title)

if len(title_list) >= 10000:
has_results = False

print('=== i: {} ==='.format(str(i)))
print('=== titles: {} ==='.format(len(title_list)))
print('=== Done till: {}'.format(last_date))

This will take a while to finish. You can see the print statements to monitor the progress.

Once it’s done, you can view the list to see what’s inside.

List of all article titles

This looks fine. Now, let’s go to the next step.

Finding The Most Trending Topics Of Tesla

Now comes the final step — finding the most trending topics of Tesla of last year. We can use topic clustering for this and find the most significant clusters. However, instead of using traditional methods such as TF-IDF and Count Vectorizer, we can use Sentence Embeddings, which also take care of the context of the text via vector representations.

We can use a package called BERTopic, which provides an easy interface to all this — embedding sentences and clustering.

Create a topic model and fit the title list.

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(title_list)

After the model has been fit, we can use the get_topic_info function to find the different clusters. However, remember that the largest chunk will be topic -1, which contains all the unclustered articles.

topic_model.get_topic_info()

The biggest relevant topic is thus with the ID 0.

We can get the relevant keywords of the topic with the get_topic function.

Topics

The most significant talking point of Tesla in the last year was about its stock. What was the next most significant trend?

Topic at index 1

It was related to charging and chargers.

We can also get the individual documents of each of the clusters.

df = topic_model.get_document_info(title_list)
df[df['Topic'] == 0][['Document']]
Documents of a topic

Getting Trends Of Other Companies

We can similarly get trends for other companies, like Space X, Neuralink, etc. You can substitute this code with any company name, and you will be able to get the trends as a result.

Trend analysis can give us some chief insights into what the company was doing during the past year and its most newsworthy contributions.

Improving The System

This system can be tuned in several ways to get even better results.

  1. Changing the embedding model — BERTopic by default uses all-MiniLM-L6-v2, which is usually a good model to get the semantic similarity of the documents. However, you can change it to some other model if you like.
  2. Using dimensionality reduction techniques: Sentence embedding models embed the text into high-dimensional vectors, which are hard to work with and visualize. You can use dimensionality reduction to reduce the vector to lower dimensions. However, you will likely lose some information.
  3. Changing clustering parameters: By default, BERTopic uses HDBSCAN. You can change this model entirely or use different hyperparameters to tune the clustering algorithm.

If you’ve found this article helpful or intriguing, don’t hesitate to give it a clap! As a writer, your feedback helps me understand what resonates with my readers.

Follow ScrapeHero for more insightful content like this. Whether you’re a developer, an entrepreneur, or someone interested in web scraping, machine learning, AI, etc., ScrapeHero has compelling articles that will fascinate you.

--

--

Saurav Modak
ScrapeHero

I love technology, as it shapes our future, and to see the impact it makes to our lives.