As today we are available with lots of projects in a particular domain in many websites so how to choose the project with your area of interest.So here we come up with a solution with the project recommendation system which helps you to decide the project in which you are interested in a very fast and efficient manner. We collected many projects from various websites with different domains.
“A recommendation system is one that foretells the preferences of users according to their inquiries.”
Like any recommendation system, a project recommendation system has information about previously done projects and recommends them according to users’ interests. Our project recommendation system has details of projects in the Data Science field only. If one wants to explore more in their particular area of interest then he/she can use the recommendation system by selecting from top items as per their preferences and perform innovation over them.
- Data set is generated by scraping the course web pages of different universities like Stanford etc in many academic sessions.
- We scraped the project report of various subjects which the students made during their academics for completing their course project requirements.
- Only reports were chosen as they contain the entire aspect of projects from problem definition to implementation.
- Scrapping was mostly done using “Beautiful Soup” as the pages mostly had HTML content.
- Our datasets is a CSV file which has the following columns:
- Document Id: A unique Id for every document.
- Project Name: Title of the project.
- Course Name: Name of the subject in which the project was made.
- University: University/college in which the project was made.
- Link: Hyperlink to the project report in PDF format.
The PDF file is accessed using the link in the csv file and then we convert it to text files so that we can get all the written text in that file.
This involves the following
i. Tokenize the text
ii. Removing all irrelevant characters (Numbers and Punctuation).
iii. Convert all characters into lowercase.
iv. Removing Stop-words
- LDA (Latent Dirichlet Allocation)
“ LDA approach to topic modelling considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion. ”
Once we provide the algorithm with the number of topics, all it does is rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.
For every topic, two probabilities P1 and P2 are calculated. P1 — p(topic t / document d) = the proportion of words in document d that are currently assigned to topic t. P2 — p(word w / topic t) = the proportion of assignments to topic t over all documents that come from this word w. The current topic — word assignment is updated with a new topic with the probability, product of P1 and P2 . In this step, the model assumes that all the existing word — topic assignments except the current word are correct. This is essentially the probability that topic t generated word w, so it makes sense to adjust the current word’s topic with new probability.
After a number of iterations, a steady state is achieved where the document topic and topic term distributions are fairly good. This is the convergence point of LDA.
2. HDP(Hierarchical Dirichlet process)
HDP operates with a blend of words to assign topics to the document. It is an extension of LDA. In this scheme the numbers of topics in the corpus need not be known beforehand. When there is ambiguity in topics count then we can use Hierarchical Dirichlet Allocation. HDP considers a common distribution to show the possible number of topics for the corpus. From this distribution a finite allotment of all required topics is sampled.
The advantage of using HDP is it will detect the maximum amount of topics and is not bound to a specified number of topics in advance.
3. Latent Semantic Analysis( LSA, LSI )
“LSA uses a matrix of documents and terms and will decompose it into a separate document topic matrix and a topic term matrix”.It comes into play as it attempts to leverage the context around the words to capture the hidden concepts, also known as topics.It uses singular value decomposition (SVD) concept and reducing the number of rows,also preserves the similarity structure among columns.Its principle is based on distributional hypothesis in which same semantically close words occur in same kind of text.
Other than the traditional way of applying topic modellings with Bag Of Word model, you can try some interesting model also like Bi-gram Model, Tri-gram model , TF -IDF model etc with all the topic modelling techniques. These technique may enhance your results.
Our main aim of Document clustering is the organisation of a large amount of text documents into a small number of meaningful clusters, where each cluster represents a specific number of topics.
Here we used Doc2Vec technique that creates an embedding of a document irrespective to its length. Doc2Vec computes a feature vector for every document in the corpus. Doc2vec model is based on Word2Vec, with only adding another vector (paragraph ID) to the input.
After clustering we can apply following techniques-
- LDA on each cluster and extract the keywords of corresponding cluster.
- Using Tf-Idf score we can extract top relevant words of each cluster.
Now we manually assign relevant name to each group of keywords with the help of topic visualisation.
Here user get list of all the topics which recommendation system have,then user have to select one of them to get list of all the previous project links.
After user enters ‘ML in biomedical application’ then following is the list of project link as an output.
The coherence score is used for evaluating the models. Coherence score measures a single topic by measuring the degree of semantic similarity between topics that are semantically similar topics and topic that are artefacts of statistical inference.
Here LDA , HDP, LSI gives coherence score 39%,52.72% and 31.3% respectively, and in clustering with LDA having score 41% .Although highest coherence score was given by HDP it is giving overlapped clusters. So we fine tuned the parameters of LDA model and it gave best results with tri-grams at number of topics 20.It is also observed that clustering with LDA also give relevant cluster so we can use both of them in recommendation system.
- Akanksha Dewangan, MT19049(firstname.lastname@example.org)
Data Collection(scrapping from various websites), Pre-processing ,applying LSI, HDP and LDA(with bag words) visualisation on each topic modelling technique. Annotate each topic and accordingly cluster each document with the best topic it belongs to.
2. Priyanka Boral, MT19127 (email@example.com)
Data Collection(scrapping from various websites), Pre-processing, applying LDA HDP and LSI with all its variation (including Bi-gram,Tri-gram and Tf-Idf model)
3.Sudiksha Aapan, MT19018 (firstname.lastname@example.org)
Data Collection(scrapping from various websites), applying clustering techniques with Doc2vec implementation then apply LDA and Tf-Idf model on them.
This project has been done for the course Information Retrieval 2020 in IIIT-Delhi,under the guidance of Dr. Tanmoy Chakraborty , and helpful TA’s :Anubhav Shrimal, Vrutti Patel, Abhinav Gupta, Hirdoy Shankar Dutta, Jasmeet Kaur.