Find the Trending Topics in Data Science and Software Development using ML & Python!!

Chesta Dhingra

Published in

Data And Beyond

7 min readJan 11, 2024

Harnessing Selenium, Python and ML for Dynamic Insights into Tech World’s Pulse.

Three major reasons behind doing this research or project is to

Identify Community Focus:- Analyzing the latest trends from past 30 days, that highlights the area in which data science and software development communities are focusing on.
Enhancing the Python Skills:- The project serve as a practical platform for enhancing my python coding abilities.
Advancing NLP Competency:- Lastly via applying the machine and deep learning principles on this endeavour will helps in elevating my maneuver in Natural Language Processing.

With this overview, I trust you have a solid understanding of project’s objectives and it’s significance in evolving world of Data Science. Now let’s not delay any further.

It’s time to delve into the practical journey of this research, from meticulous collection of data using selenium and the analyse the patterns and drive the insights using advanced machine learning algorithms.

Collecting Data

Starting with the project, collecting and cleaning the data is always a hard part. Here I took the advantage of one of the popular libraries of python named Selenium to extract the data from various prominent data science blogs. Importantly, this task was conducted with a steadfast commitment to the ethical guidelines governing web scraping. Three major elements which are required for the data are Article’s Title, Date and Subtitle. We can also include the name of the sites if it is required for doing site wise analysis. First, we’ll be defining the resources from which we’ll be collecting the data

towards_ds_url = "https://towardsdatascience.com/latest"

data_science_central_url = "https://www.datasciencecentral.com/articles/"

levelup_url = "https://levelup.gitconnected.com/latest"

Great!! We have define our resources now let’s start collecting the data via defining the elements which are relevant for doing the analysis sepecifically for the latest 30 days.

The pivotal tool which act as an intermediate between the python and browser is the diver. The most common browser is chrome so I’ll be using the chrome driver that seamlessly connects my python selenium library with the browser and browser automate the data collection process.

"""defining the functions to setup the driver, parse the dates
and scrape the sites to collect the data
"""

def setup_driver(chrome_service, chrome_options): #function to setup the driver based on the browser you are using
    driver = webdriver.Chrome(service=chrome_service, options=chrome_options)
    driver.maximize_window()
    return driver


# each blog site has its own way of storing the dates based on that we have create two different
# parse date functions
def parse_towards_ds_date(date_element):
    date_string = date_element.get_attribute('datetime')[:-1]  # Adjust attribute as needed
    return datetime.fromisoformat(date_string)

def parse_data_science_central_date(date_element):
    date_string = date_element.get_attribute('content')  # Adjust attribute as needed
    return datetime.strptime(date_string, "%Y-%m-%d")

# finally we'll be creating the logic to collect the data from all three different resources.
def scrape_articles(driver, url, xpath_title, xpath_subtitle,xpath_date, parse_date, days_back=30, scroll_pause_time=30):
    data = []
    end_date = datetime.now() - timedelta(days=days_back)
    try:
        driver.get(url)
        html = driver.find_element(By.TAG_NAME, "html")
        while True:
            titles = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, xpath_title)))
            subtitles = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, xpath_subtitle)))
            date_elements = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, xpath_date)))

            for title, subtitle,date_element in zip(titles, subtitles,date_elements):
                article_date = parse_date(date_element)  # Pass the entire element to the parse_date function
                 if article_date < end_date:
                    
                    raise StopIteration
                   
                 if not any(d['title'] == title.text for d in data):
                    print(f"Title: {title.text}, Date: {article_date}")
                    data.append({"title": title.text, "date": article_date,'SubTitle': subtitle.text if subtitle.text else None})

            html.send_keys(Keys.PAGE_DOWN)
            time.sleep(scroll_pause_time)
    
    except Exception as e:
        logging.error(f"An error occurred: {e}")
    finally:
        driver.quit()
    return data

Data Cleaning

Once data is harvetsed from different resources, our focus will be on cleaning the data. It is one of the most crucial steps when we are dealing with unstructured data specifically in context of Natural Language Processing.

Data that we have collected is enriched with information but also contains some numeric as well as non-alpha numeric characters. This information is not relevant in context of the research or project that we are working on. To refine our dataset we remove these numeric and non-alphanumeric characters, also we’ll be eleminating the punctuations as well as stopwords- those common words which are offering little to no value on our trend analysis.

This cleansing step is not merely tidying up the data; it’s about sculpting into a form that’s primed for insightful analysis.

"""
Defining the function that will clean the title and subtitle columns of the dataset.
First standardize the data via lowering all the uppercase character to lower.
Eleminating the non-alpha values.
Removing the trailing spaces.
and lastly tokenizing the words using split method.
"""

def simple_clean_text(x):
    x = x.lower()
    x = re.sub('[^a-z]'," ",x)
    x = re.sub(' +',' ',x).strip()
    words = x.split()
    return " ".join(words)

# applying the function on the dataset after combining all the rows of the data
# that we have collected

combine_data2['title'] = combine_data2['title'].apply(lambda x: simple_clean_text(x))
combine_data2['SubTitle'] = combine_data2['SubTitle'].apply(lambda x: simple_clean_text(x))
combine_data2.head()

Data is collected and cleaned. Now, it’s time for real exciting exercise where we are leveraging the power of advanced Machine Learning algorithms to pinpoint the areas that are garnering the significant attention in data science and software development communities.

Trend Analysis

While doing the analysis we’ll be taking the advantage of python scikit-learn library. As computer understands only numbers thus it is quintessential to convert text of documents to a matrix of token counts using the CountVectorizer from scikit-learn library.


no_features =1000 #max number of words to be consider in the analysis
no_topics = 10 #number of topics to consider 
no_top_words = 10 #number of topics to display for each topic

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features,stop_words="english")

#after defining the vectorizer we will fit it on the data that we have cleaned
tf = tf_vectorizer.fit_transform(combine_data2['title'])
tf_feature_names = tf_vectorizer.get_feature_names_out()

Once the text is converted into matrix, we’ll be implementing the Latent Dirichlet Allocation (LDA) a statistical model that is used to extract abstract topics within a collection of documents. The main algorithm used to infer hidden structure in LDA are Variational Bayes and Gibbs Sampling.

At the core LDA is actually a bayesian probabilstic model, it makes an assumption that there are latent or hidden variables that determine the observable data (words in document).

For example, we have large collection of news article from sports, politics to entertainment and we need to organize them in an order. LDA is a statistical method that helps you to organize them.

First it makes an assumption that each document is a mixture of various topics and each topic is a mixture of various words. Then it makes random guesses about the topic composition of each document and word composition of each topic.Iteratively it will improve the guesses about both document-topic and topic-word compositions. Bayesian Approach comes in when it updates the beliefs based on new evidences. Everytime LDA sees a word in a document, it updates its belief about the composition of that document and relevant topics.

As we get a basic understanding of LDA, we can implement it on our CountVectorizer transformed data using LatentDirichletAllocation() function from the scikit-learn.


lda = LatentDirichletAllocation(n_components=no_topics,max_iter=5,learning_method='online',learning_offset=50,random_state=0).fit(tf)
"""
model is fitted on the dataset and now we want to see the top 10 components of each 
abstract topic that is hidden using LDA from the documents or the titles that we have collecetd.
"""

# now to extract the features or the top n words for each document we'll 
# define the function that will take model, feature_names and no_words as parameter.

def display_topics(model, feature_names, no_top_words):
    topics = {}
    for topic_idx, topic in enumerate(model.components_):
        topics[f"Topic {topic_idx}"] = " ".join([feature_names[i] for i in topic.argsort()[:-no_top_words -1:-1]])
    return topics
# this will create 10 no_topics and for each number of topic total of top 10 words
# will be displayed.

Output of the above code displaying 10 topics with 10 most frequent words in each.

Now, we will generate labels based on the most relevant words for each topic to accurately represent their content.

From our analysis, we conclude that in the past 30 days, the data science and machine learning community has primarily focused on AI & Models, followed by Coding & Implementation. The third major area of interest lies at the intersection of Data Science & Python and Machine Learning Tools.

Moving forward, in my upcoming article, I will delve into Deep Learning methodologies aimed at text generation and title recommendations, utilizing this very dataset.

I hope you found this article insightful and you can find the whole code of scraping the data in the given github link and the code of analysis with the given github link. For more such content, follow me on Medium.

Additionally, you can access the curated dataset on Kaggle through the link provided below.

Your feedback and engagement are always appreciated.

References

Trends in Data Science and Software DevelopmentA Dataset of Article Titles from Leading Data Science Blogs
www.kaggle.com

Latent Dirichlet Allocation research paper by David M. Blei, Andrew Y. Ng, Michael I. Jordan

Find the Trending Topics in Data Science and Software Development using ML & Python!!

Trends in Data Science and Software Development

A Dataset of Article Titles from Leading Data Science Blogs

Written by Chesta Dhingra