We know in India HIV/Aids is the common problem from a very long period of time. In this post I figure out the current situation from the year 2010.
The technologies that I have used are Python,Web Scraping,Natural Language Processing,Unsupervised machine learning,Artificial Neural Network,Convolution Neural network.
Trust me I will explain you all in the simplest way. Follow the steps-
i. To collect data from a reliable source i.e “Times of India” is a good choice. This is basically called “Web Scraping”.
ii. After collecting the data we need to process it because lines are not understandable by any machine learning algorithms. Now come to the concept of Natural language processing. Step by step processes are given in the code.
iii. Now the task have to find out how important a word in the document. It is calculated by many processes but here I have shown only two methods. First one is tf-idf and another one is fastText.
In tf-idf we found how frequent a word in a document X By how unique the word is w.r.t the entire corpus of documents. But calculating the weight of words is not enough. It is not storing the semantic relation between words. Like king and man are related to each other which is not identified by tf-idf. To slove this Facebook AI team developed it’s own word embedding algorithm called fastText. We can use Word2Vec by gensim. The difference is so simple suppose your vocab has not the name “XYZ”, in Word2Vec it will not able to predict the similar word of “XYZ” but in fastText it will.
iv. Mind it the data is not classified . For unlabeled data K-Means clustering is the right method to classify into the label.
After being clustering these are the relevant groups.
0 Blood infections
2 Risk groups
4 Treatment centers/Facilities
The plots gives the whole report from 2010 to 2018 on every group which have been clustered.
v. Here are the most powerful algorithms comes one is Artificial Neural Network(ANN) and another is Convolution Neural Network(CNN). Here we can use any ML model but for better accuracy Neural Networks are preferable. ANN is simple fully connected layer where as CNN is feed forward network with one or many convolution layers to extract features from text to get better accuracy. One thing is noticeable that for text classification RNN is used but it is little bit complex and CNN gives good accuracy rather than RNN in most of the times.
Here we see all the year to year detailed analysis report on the basis of HIV/aids related cases. We can say that HIV/aids has reduced in present time compared to previous years just because of tremendous awareness campaign. Using both Artificial Neural Network(ANN) and Convolution Neural Network(CNN) have given good accuracy with minimum loss. Here is the code on my github profile.