Fast Data Science - Medium

Effective project management in data science

Thomas Wood — Fri, 22 Dec 2023 10:33:39 GMT

Unlocking the Complexities of Managing Data Science Projects

In a world that is extensively digitised, data science has become a paramount tool in helping businesses maximise their performance, make strategic decisions, and outshine their competitors. The role of project management within this sphere is critical but often met with unique challenges. In this article, we explore an answer to a fundamental question — How does project management work in data science?

To comprehend the complexity of project management in data science, we need to understand that traditional management methodologies don’t often fit perfectly into the realm of data science. Data science projects usually entail a long exploration phase with numerous unknown factors, which is starkly different from traditional software development where deliverables and timelines can be predefined.

Traditional project management methodologies are often categorised into:

Waterfall — often represented by a Gantt chart illustrating tasks and dependencies.

Agile — tasks are divided into short development cycles called sprints.

Kanban — visualizes work progression from left to right, representing stages like to do, in progress, and done.
CRISP-DM — A data science-centric approach involving: business understanding, data understanding, data preparation, modelling, evaluation, and deployment.

The significant limitations of these traditional methods lie in their rigidity that doesn’t cater to the iterative and explorative nature of data science projects. Hence, the key to succeeding in project management within the data science sector is flexibility and adaptability.

Advice on applying project management to data science:

Avoid setting a rigid project structure right from the start. Allow an initial week or so to explore and understand the context of the project better.
Assess Business Needs. Understanding what is needed by the business is crucial. The requirements may vary — a predictive model, standalone analysis, a full scale website, and API, and so forth. Remember to leave room for adjustments, as requirements may change at any stage of the project.

Data science projects typically involve the following phases:

Understanding the context.
Understanding the available data.
Building a prototype.
Defining KPIs and requirements in collaboration with the stakeholder.
Refining the model and integration with real-time data.
Testing and Deployment.
Project completion and maintenance planning.

The necessity for a flexible approach means regular meetings and open, detailed communication with all stakeholders is essential. You must ensure both the business and data scientists are updated, and all required data, access, and cooperation are provided.

In conclusion, though traditional project management approaches such as Agile or Waterfall are beneficial, they must be adapted to suit the demands of data science projects. A mix of exploratory, iterative, and empirical methods are vital to make project management effective in data science.

For more insights on project management in data science, visit FastDataScience. You will also find resources such as in-browser Gantt chart generator, a project kickoff checklist, a roadmap planner, and much more to assist your journey.

Effective project management in data science was originally published in Fast Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Deep learning for bespoke recommender systems

Thomas Wood — Tue, 19 Dec 2023 10:33:39 GMT

Exploring the Power of Deep Learning in Optimising User Experience in Dating Apps

Matchmaking with deep learning: recommender systems for dating

Whether finding a product online or seeking a partner for dating, the role of recommender systems stretches over various domains, playing a critical role in our decision-making process. Let’s delve in and see how these systems revolutionise matchmaking in the dating industry.

What is a Recommender System?

Online retailers, such as Amazon, often suggest ‘similar products’ once you make a purchase. These suggestions arise from an area of machine learning known as recommender systems.

The standard approach for these recommender systems is filling matrices with information about various products and analysing the relationships between them. If you’ve noticed, most products you buy often have similar products recommended alongside, usually those that go together in the same basket quite often.

Recommender Systems for Dating — the Challenge

Switch the focus to a dating website, and things get tricky. While it’s simple to suggest products based on previous purchases — as is the case with online retailers, recommending a partner is not as straightforward. There are countless users, each unique and with different preferences.

We usually have data on:

User’s profile text
Profile photo
Contact requests (if any).

With this information, we use a deep learning technique called vector embeddings to make recommendations.

How it Works

By using a system, each profile text is converted into a ‘fingerprint’ or a vector in a 100-dimensional space.
While individually the 100-dimensional vector holds little meaning, similar tastes typically result in similar vectors.
To recommend potential partners to a new user, the system calculates their vector, determines the distances to other vectors, and finds the nearest neighbours!

Broadening the Horizon: Other Industry Applications

These text-based recommender systems are not only for dating apps. They are useful in other sectors like:

Recruitment websites: Applicants upload their CVs, and the platform suggests relevant jobs.
Real-estate platforms: Property descriptions and photos are used for recommendations.

But bear in mind that while there are off-the-shelf recommender systems for retail or movie suggestions, image-based or text-based recommendations require highly customized solutions, considering the complexity and specificity of data.

At Fast Data Science, we offer consulting services after decades of learning from experience and dealing with machine learning and natural language data. If you have abundant text or image data and seek an advanced recommender system, we’d love to hear from you. Learn more here or leave a comment below.

Deep learning for bespoke recommender systems was originally published in Fast Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

The similarities and differences between neural networks and the human brain

Thomas Wood — Thu, 07 Dec 2023 10:33:39 GMT

Unlocking the Mystery: Exploring the Parallels Between Neural Networks and the Human Brain

How similar are Neural Networks to our Brains?

Have you ever come across an unusual animal or plant and tried to figure out what it might be? You might not realise it, but the process that your mind goes through in trying to identify the organism is the same principle that drives the operation of artificial neural networks. Your brain runs an image across thousands to millions of ‘reference images’ stored in it. It then executes quick checks with the animal species information it has accumulated over the years. This process uses your biological neural network to reprocess past experiences, making it possible to deal with the unfamiliar situation at hand.

The Evolution of Artificial Neural Networks

The science of artificial intelligence, which was gaining popularity in the past, pondered the idea of creating machines capable of learning, adapting, and making decisions much like the human brain. Hence, artificial neurons were developed, inspired by the brain’s biological neurons. These artificial neurons could be interconnected in intricate ways to create artificial neural networks (ANN) capable of creating more complex outputs.

The concept of artificial neural networks dates as far back as 1943, with major milestones like Frank Rosenblatt’s Perceptron in the 1950s. The Perceptron was an artificial neural network that could “learn” based on data examples, essentially emulating the functionality of biological neural networks.

However, the Perceptron showed weaknesses in dealing with certain problems, especially non-linear functions. Years later, deep learning with hidden layers of neurons was introduced to resolve these challenges. In fact, around 2006, the combination of powerful GPU processors, big data, and cloud computing renewed interest in the potential of artificial neural networks. Today, they are key components of voice assistants, image and facial recognition technology, online translation services, and search engines.

Comparing Neural Networks and the Human Brain

The human brain is undoubtedly the most complex and potent information processor known to man. Similarly, artificial neural networks seek to replicate the efficacy of the brain in processing information. We can witness the success of this in AI systems that have matched, and even surpassed, the human brain in tasks like object recognition and language translation.

Neurons are a common aspect in both human brains and artificial neural networks. However, they function differently in the two cases. Known components of the human neuron include dendrites for receiving information and axons for outputting information, forming the cell body. In the artificial neuron, however, input and output are taken directly from the neuron.

Despite these differences, artificial neural networks still seek to replicate the brain’s capability to process large volumes of information in complex ways. Remember, the human brain functions using 100 billion neurons and has about 100 trillion synapses, which are the junctions between two neurons. It’s a level of complexity artificial neural networks aspire to achieve.

Want to know more about the similarities and differences between the human brain and neural networks? Check out this article on Fast Data Science!

Today, technology has advanced to the point where we have artificial neural networks that can perform functions with a reasonable semblance to the workings of a human brain. However, achieving the complexity and efficacy of the human brain remains a challenge at the forefront of AI and neural network research. As we continue to understand more about the functioning of our brains and deepen our understanding of AI, it’s only a matter of when, not if, we will achieve this and who knows what other incredible milestones!

The similarities and differences between neural networks and the human brain was originally published in Fast Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Exploring the Power of Natural Language Processing in Business: 8 Practical Examples

Thomas Wood — Mon, 04 Dec 2023 10:33:40 GMT

Exploring the Power of Natural Language Processing in Business: 5 practical examples

How Natural Language Processing is used in business

What is Natural Language Processing (NLP)? — Real World Examples

Natural language processing has been gaining significant attention over the years, and it's all for the right reasons. To put it simply, NLP is a field that aims at getting machines to communicate or interact with humans in our language. You might already be familiar with applications like autocorrect, spell checks, and search engines. They make use of NLP.

To give you a better idea of how impactful NLP can be, here are eight exemplary applications that may pique your interest. And if your business deals with massive volumes of text data, hiring a capable NLP consultant like Fast Data Science would indeed be a wise move.

8 Insightful Applications of Natural Language Processing in Business

1. Text Information Extraction

Large volumes of textual data can be difficult to handle, even for businesses. For instance, pharmaceutical firms need to go through large piles of regulatory and clinical trial documents. Sifting through these to find relevant information is an uphill task.

Here’s where NLP helps. With natural language processing tech, these tedious tasks can be automated, saving you time and resources.

2. Spell Check in Forms

With almost everyone owning a smart gadget, typing errors have become very common. NLP rectifies these by using neural networks that correct homonyms and even adapt to languages with complex morphology.

3. Information Retrieval and Answering Queries

This powerful application enables NLP to retrieve information and answer queries in a context-aware manner. These capabilities stem from the use of transformer models or sophisticated neural networks.

4. Converting Between UK and US English Spelling

Algorithms with NLP readily convert or normalise text documents’ spelling between UK or US English to maintain uniformity and accuracy. I was encountering this problem so often that I made my own library for US-UK spelling conversions and put it on Pypi for others to use: https://pypi.org/project/localspelling/ (install with pip install localspelling).

5. Language Identification

NLP algorithms can swiftly identify the language of a given text. This is done through either identifying unique stopwords of different languages or pattern recognition in the text.

The above examples are just a glimpse of what NLP can achieve. Fast Data Science, a highly experienced NLP consulting firm, will help you leverage NLP’s power for your business.

To learn more about the real-world applications of natural language processing, visit this page. You may also like my blog post about Natural Language Processing Tools: The Latest Trends and Developments.

If you or your business require NLP consultation or services, feel free to get in touch with us. We are always ready to help.

Exploring the Power of Natural Language Processing in Business: 8 Practical Examples was originally published in Fast Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Predicting clinical trial risk with AI: elevating efficiency in healthcare through natural language…

Thomas Wood — Fri, 01 Dec 2023 10:33:39 GMT

Predicting clinical trial risk with AI: elevating efficiency in healthcare through natural language processing

Exploring the Application of AI and NLP in Improving Clinical Trial Risk Assessment and Reducing Failure Rates.

The Role of AI & NLP in Clinical Trial Risk Assessment

Clinical trials pave the way to scientific breakthroughs, but approximately 90% end in failure. A subset of these failed trials can be categorised as ‘uninformative’. This scenario occurs when the design, implementation or report of a clinical trial is done in a way that prevents it from delivering scientifically valid information. As a result, valuable resources are lost, and the ethical implications of involving human subjects in futile trials become a critical concern.

To address this problem, Fast Data Science developed the Clinical Trial Risk Tool, an AI-powered tool that leverages NLP to detect potential risks of uninformative results in clinical trials. If your organisation needs a tailored AI strategy to leverage machine learning for healthcare projects, feel free to reach out to us.

The Aspects of an Informative Clinical Trial

In order to be scientifically informative, a clinical trial must meet the following conditions:

The hypothesis addresses an unresolved scientific question of significance
The study design allows for the gathering of significant evidence pertaining to the question
The study is feasible, i.e., the recruitment of necessary participants is attainable
The study is conducted with scientific rigour
The study’s results are reported accurately, completely, and promptly.

The Role of AI & NLP

Identifying potential problematic components of a clinical trial protocol — such as having an inadequate Statistical Analysis Plan (SAP), planning to recruit an insufficient number of participants, or being unsure about the expected effect size — requires a significant investment of expertise and resources.

This is where AI, specifically natural language processing (NLP), can assist. NLP is capable of identifying key points in the document, drawing the attention of human experts to those sections, and quickly triaging and flagging potential risks.

Clinical Trial Risk Tool Workflow

We built the Clinical Trial Risk Tool in collaboration with domain specialists. Factors that we considered included: pathology, presence of a SAP, whether the effect estimate has been stated, the number of subjects and arms, the countries of execution, and the use of simulations in determining the sample size.

The tool was designed to use an ensemble of rule-based, machine learning (random forest) and neural networks models to compute a score between 0–100 based on these factors. Following the score, it also categorises the trial as HIGH, MEDIUM, or LOW risk.

The tool has been made accessible via a web interface, allowing users to upload a trial protocol in PDF format. It then presents the risk score and level to the user, simplifying the workflow of a clinical trial reviewer.

Open Source Contributions

To foster a wider utilisation of the tool, Fast Data Science has open-sourced the project on Github under the MIT licence. This allows other researchers and developers to modify, extend and improve the tool to meet their specific needs.

Future Directions

The team is exploring ways to extend the functionality of the Clinical Trial Risk Tool. Possibilities include adding more pathologies and locations, estimating trial complexity, cost, or other parameters besides risk, and continuously refining the AI model based on user feedback and updates in clinical trial methodologies.

For more information on how AI and NLP can facilitate clinical trial risk assessment, visit https://fastdatascience.com/how-can-we-assess-the-risk-of-a-clinical-trial-using-ai.

Predicting clinical trial risk with AI: elevating efficiency in healthcare through natural language… was originally published in Fast Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Challenges in the stages of a data science project: an insider’s guide to success

Thomas Wood — Tue, 07 Nov 2023 10:33:39 GMT

Unravelling the realities of data science project flow: from initial client meetings to final deployment

Most people envision a data science project as a neatly organised process comprising equal parts of data cleaning, data analysis, and deployment. However, reality seldom conforms to this utopian view.

Nevertheless, understanding the major stages of a data science project can help to manage expectations and allocate resources more effectively.

The Importance of Data Cleaning

Data science projects often commence with a substantial amount of data that needs cleaning and wrangling. This process can be time-consuming and will often overlap with the data analysis portion of the project.

The tendency, therefore, is to underestimate the time required for deployment.

The Challenge of Deployment

Also integral to every data science project is the deployment stage. It requires consistent dialogue between the technical and business teams — a process that can be fraught with organisational politics. As such, the deployment phase might extend far beyond its anticipated completion time.

The Critical First Step: Securing Data

The success of a data science project is predicated on the availability of data. The initial stages of the project often include:

Gaining the client’s trust
Signing a Non-Disclosure Agreement (NDA)
Accessing the data
Navigating the client’s systems
Identifying key stakeholders

These initial steps can often require about a month to complete. However, they are essential to ensure the project does not stumble on the blocks of unavailability of data.

Read the full article about “The Key stages in a data science project” here.

Overcoming Data Access Challenges

Data access can pose considerable challenges, primarily when data is protected by stringent regulations. Therefore, gaining access to a company’s internal data requires a substantial degree of trust between the client and the data science consultancy.

The Solution: Thorough Planning

To preempt potential issues that may delay the project, thorough planning is paramount:

Send a list of requirements to the client one month before the project start date
Arrange a kickoff meeting for a week after the email, ideally securing some — if not all — of the requirements
Continually correspond with stakeholders to ensure everything is in place for the project’s inception

By undertaking these steps, the project can progress without any data-related hindrances.

The Bottom Line

Countless obstacles can hinder a data science project, foremost of which is the lack of data. Expenses bourne due to undue delays can be mitigated with meticulous planning and proactive communication.

For more guidance on planning your data science project, visit the Fast Data Science website.

Challenges in the stages of a data science project: an insider’s guide to success was originally published in Fast Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Face recognition technology

Thomas Wood — Tue, 31 Oct 2023 10:31:47 GMT

Exploring the Evolution of Face Recognition: From Eigenfaces to Deep Learning

Building a Face Recogniser: Traditional Methods vs Deep Learning

Over the past few decades, facial recognition technology has transformed from being prohibitively inefficient to an indispensable tool used by social media platforms, security departments, and smartphone companies around the world. The influx of deep learning within this domain has sky-rocketed the efficiency and use-cases of face recogniser apps and tools.

Here’s a look at how the methodology of building a face recogniser has evolved over the years.

Traditional face recognition: Eigenfaces

In the 1980s and 90s, the inception of Eigenfaces marked the first significant step in face recognition technology. These were blurry face-like images composed by superimposing various images onto each other pixel by pixel. The motive was to recognise unknown faces by associating them with probable Eigenfaces.

However, this method wasn’t without its flaws. Shifting a face image a few pixels could lead to misrecognitions, rendering the approach inefficient.

The next generation: Facial feature points

To overcome the shortcomings of the Eigenface method, facial feature points were introduced. This approach identifies pivotal points on a face — like the corner of the mouth or an eyebrow — and uses their coordinates in comparison with other faces after adjusting for slight off alignments.

This method, although better than Eigenfaces, didn’t utilise all available information such as hair colour, eye colour, and facial structures not captured by feature points.

You can find more details about the feature points method here.

The Deep Learning Revolution

The advent of deep learning and convolutional neural networks (CNNs) marked a paradigm shift in facial recognition technology. Using CNNs, a stencil-like structure repetitively walks over an image, identifying subsections that match specific patterns.

Initially, the patterns identified are simple edges and corners. But, as the process is repeated, higher-level features like parts of an eye or an ear emerge, eventually leading to the recognition of a whole face.

The advantage of this approach is that the patterns are not predefined but derived from training the network with millions of face images.

However, one of the challenges in developing a CNN-based face recogniser is the need for millions of images. A majority of developers rely on gathering images from the internet, but significantly more data can be collected when users willingly provide their personal photos. This explains why Facebook, Google, and Microsoft have impressively accurate face recognisers.

Road Ahead

Although deep learning has drastically improved face recognition technology, it has its limitations. Many companies utilise additional systems to correct for pose and lighting, often using a 3D mesh model of the face.

Machine learning-based facial recognition models are rapidly advancing, and we see impressive improvements year after year. For companies and brands who wish to incorporate the benefits of this blossoming technology, Fast Data Science, a leader in NLP, ML, and data science consultancy since 2016, offer world-class machine learning consultancy sessions.

Want to delve deeper into the world of face recognition? Explore here to learn more about building a face recogniser using traditional methods and deep learning.

Face recognition technology was originally published in Fast Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Can forensic stylometry unmask an anonymous author?

Thomas Wood — Wed, 25 Oct 2023 09:32:18 GMT

Unmasking the hidden authors: How forensic stylometry revealed J.K. Rowling’s secret pen name and its implications for business and security.

J. K. Rowling, author of the Harry Potter series, was found to have written under a pseudonym thanks to forensic stylometry, a technique using natural language processing. [¹^]

[¹^]: Image source: Wikipedia, licensed under Creative Commons.

In 2013 JK Rowling released a new detective novel under the pseudonym, Robert Galbraith, seeking a release from the Harry Potter frenzy. However, two professors of computational linguistics managed to prove that the mystery author was indeed Rowling herself, using a technique called forensic stylometry[²^].

[²^]: Here’s how they did it: Calculating a “fingerprint”.

By comparing the “linguistic fingerprint” of the Galbraith book to those of known authors, they found a likely match with Rowling. A linguistic fingerprint is a unique pattern of word use which all of us have, whether consciously or not.

Let’s have a look at some of the linguistic fingerprints obtained for three famous female authors who wrote under male pseudonyms:

This method of identifying text authors through their unique writing style is referred to as forensic stylometry.

Modern technology such as ‘deep learning’ software and computational power have refined and accelerated the process of forensic stylometry. Now all that’s needed is a large dataset, without the complexity of creating a unique recipe for your fingerprint.

I personally favour the Convolutional Neural Network, a deep learning technique originally designed for images but surprisingly effective for natural language processing!

Commercial applications of this technology are vast, from identifying the authors of threatening documents to parsing financial reports and even spam flagging.

If this subject piques your interest, or you need assistance developing this tech, feel free to reach out! You can do so via our contact form.

Mark your calendars! On 5th July 2018, we’re hosting a workshop on forensic stylometry at the Digital Humanities Summer School at Oxford University. Be sure to register here.

To read learn more, head over to here.

Can forensic stylometry unmask an anonymous author? was originally published in Fast Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

How AI and NLP are revolutionising healthcare services

Thomas Wood — Thu, 19 Oct 2023 09:33:38 GMT

Utilising AI in Healthcare: Enhancing Diagnostics and Patient Care with NLP and Document Similarity Algorithms

How AI in Healthcare is Transforming Lives of Doctors and Patients

AI in healthcare dramatically improves the efficiency and accuracy of routine tasks in the medical sector.

Particularly in disease diagnosis, AI is proving to be a pivotal development. A report in 2018 highlighted an alarming 40% misdiagnosis rate of cancer patients. As such, improving diagnostic procedures has become a priority and AI has shown promising results in facilitating these improvements. Alleviating the risk of human errors due to high caseloads and incomplete patient histories, AI is facilitating faster and accurate diagnosis, thus helping healthcare professionals deliver better patient care.

For example, a recent study demonstrated an AI model’s prowess at diagnosing breast cancer more accurately than veteran pathologists.

Tracing AI’s Involvement in Healthcare

Artificial intelligence made its foray into the healthcare sector in the 1960s and 70s with the development of Dendral, the original expert system designed for organic chemistry applications. Based on Dendral, the MYCIN system, one of the earliest applications of AI in healthcare, was created.

With the ’80s and ’90s came the advent of microcomputers and increased network connectivity, adding new dimensions to the scope of AI in healthcare. This period fostered the realisation that AI systems could be designed to augment physicians’ expertise while accommodating imperfect data.

Today, advancements in technology and medicine have further ingrained machine learning into the healthcare sector. Improved computer power, growth of genome sequence databases, widespread implementation of electronic health records and developments in natural language processing have facilitated better data processing and collection, thereby bolstering the application of AI in healthcare.

AI in the Current Medical Landscape

AI is becoming more sophisticated with each passing year, opening up limitless possibilities for its application in healthcare, ranging from disease prevention to research, treatment, and diagnosis.

Particularly in medical research, AI is proving to be a game-changer. The journey from research to patient care is long and expensive, often costing more than £1.2 billion per medicine. Given these constraints, it’s easy to understand why AI is gaining importance in streamlining the drug discovery process, significantly reducing the time it takes for new drugs to enter the market.

In the realm of preventive healthcare, AI is empowering individuals to take control of their health by proactively managing their lifestyle. The Internet of Medical Things (IoMT) and AI-powered health apps encourage healthier living and significantly reduce the need for doctor visits.

To learn more about AI applications in healthcare, visit our website.

Pros and Cons of AI in Healthcare

While AI is widely appreciated for its numerous benefits, there are also potential downsides to consider.

AI in healthcare contributes to:

Improved disease diagnoses
Better healthcare reach for rural communities
Early prediction of potential health problems
Significant time and cost savings
Advanced surgical assistance
Augmented abilities and mental health support

However, it also has its cons:

Reduced personal interaction
Potential job loss in the healthcare sector
Risk of defective diagnoses
Possible social prejudice

For an in-depth discussion on the pros and cons of AI in healthcare, visit Fast Data Science.

How AI and NLP are revolutionising healthcare services was originally published in Fast Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

NLP and Machine Learning for Document Similarity and Recommendations

Thomas Wood — Mon, 16 Oct 2023 09:33:38 GMT

How natural language processing can help find and recommend similar texts

Finding Similar Documents with Natural Language Processing

Have you ever wanted to find other documents in a database that are most similar to a given document? This is referred to as the document similarity problem, or the semantic similarity problem.

Imagine the following scenarios:

You’ve got a scientific paper (including the title, abstract, and full text), and you want to find which publications are most similar to it.
You’re operating a job search website and need an efficient way to compare job descriptions.
You want to identify job candidates that are similar to an existing candidate based on the text of their CV (résumé).
You have a set of questionnaires in a field such as psychology, pharmaceuticals, or market research, and need to find similar questionnaire items and match Likert scales (item harmonisation, or data harmonisation).
In a law firm, a lawyer needs to locate similar past legal cases to help with a current case.
A dating app’s algorithm needs to recommend similar matches based on a user’s “liked” profile.

These are just a few examples, but the possibilities are endless when you use natural language processing to solve your document similarity problem.

Before you dive into the technical aspect, it’s crucial to define the problem and identify what you need your document similarity model to achieve.

What Does Your Document Similarity Model Need to Achieve?

This is one question that needs to be answered before you can build an NLP model to calculate document similarity. Typically, there won’t be a pre-existing dataset showing which documents are similar to others.

Before carrying out any data science work, it would be prudent to generate some data that you can use later to test and evaluate your model.

In some cases, it might be impossible to build a dataset that can evaluate your document similarity model. Subjectivity plays a part here, but it’s still possible to present some basic model recommendations to the stakeholders for evaluation.

Appraising a Document Similarity Model

There are several matrices you can use to judge a document similarity model. One well-known example is the Mean Average Precision. It can evaluate a search engine’s recommendation quality, and penalises models that rank relevant documents at the end of the list.

To get started, try using the mean average precision to evaluate your models on your gold standard dataset.

Bag of Words Approach to Document Similarity

The ‘bag of words’ model could be the simplest way to compare two documents; just calculate the word overlap. Names like ‘bag-of-words’ come from the fact that words are collected together in a ‘bag’, losing their sentence context.

One instance would be comparing these two sentences:

“India is one of the epicentres of the global diabetes mellitus pandemic.”

“Diabetes mellitus occurs commonly in the older patient and is frequently undiagnosed.”

Here, you can compute the Jaccard similarity index. First, remove stopwords like ‘the’, ‘and’ etc. After that, divide the number of common words in both documents by the number of different words in any of the documents. This gives you the Jaccard index.

Despite its straightforwardness, the bag-of-words models, like the Jaccard index and the similar cosine similarity, are powerful because of their speed and simplicity.

N-gram Document Similarity

The disadvantage of a bag-of-words approach is that it throws away contextual information and will treat diabetes and mellitus as independent terms. We can address this with the N-gram approach. In this strategy, all two-word, or three-word, sequences are indexed, and we calculate the Jaccard similarity index on word groups instead of individual words.

Doc2Vec: Represent Documents as Vectors

Moving upwards in complexity and performance, we can use document vector embeddings. In this method, each document is represented as a vector. The distance between these vectors gives a measure of the similarity between the documents.

However, implementing this approach can be complex. It requires deep knowledge of Natural Language Processing and Machine Learning fundamentals.

The simplest ways to use document vectors for similarities is to use an off-the-shelf LLM such as SentenceBERT from HuggingFace, or OpenAI’s API. You can read more about how we achieved this in the Harmony project here.

For more information about finding similar documents using Natural Language Processing and Machine Learning, visit the Fast Data Science website.

NLP and Machine Learning for Document Similarity and Recommendations was originally published in Fast Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.