A Self-help guide to starting your own Machine Learning research project

Soumyadeep Roy
8 min readJun 30, 2022

--

During my research journey, I learned new things and faced a lot of challenges. I believe a structured orientation for students starting out as a researcher will be useful and definitely would have helped my younger self.

As I talked to more and more grad students, I began to realize that the experiences are quite heterogeneous and vary significantly from one another.

Eureka! A self-help guide assimilating advice from past researchers and professors.

I believe the knowledge contained in this blog series will help make the research experiences more equitable and enjoyable.

This is a part of the “Research for All” initiative, aimed to promote research awareness and make machine learning research more accessible.

How to give technical talks + Life-cycle of a Research Project

First, to introduce myself, I am a 3rd year Ph.D. student at IIT Kharagpur, India, and have been a part of the research fraternity for close to 5 years — intern, then Masters’s student, and now pursuing a Ph.D.

What will you get to learn by the end of this blog series?

Here, I will try to provide my own research project life-cycle based on first-hand practical experience, and a list of research practices I observed and picked up along the way. I had the honor of working with a talented team of researchers and seniors, and two research labs with quite a unique work culture of their own.

Secondly, I have tried to assimilate research articles from eminent researchers, and provide them as a reading list associated with each subtopic

I believe this will provide a balanced view based on my personal, first-hand experience and more experienced guidance by eminent researchers on the same topic.

Your feedback is crucial to help me further improve this article, so don’t hold back and let me know about your views in the comments or by simply mailing me.

Specifically, you will learn about …

Part 1: Motivation, research from a self-improvement perspective, life-cycle of a research project, problem ideation, reading list (current article)

Part 2: Problem verification, baseline setup, novelty (will be out soon!)

Part 3: Soft skills including delivering a technical presentation, collaboration, and work productivity (another blog article, so do give it a read. 8 min. reading time)

Related Works

Interestingly and quite opposite to my starting hypothesis, I stumbled upon a treasure trove of learning resources, tutorials, or how-tos by eminent professors and researchers to help us guide at every step of the way.

https://mentorship.aclweb.org/ webpage, last seen on July 1, 2022

Along similar lines of making the research experience of students more equitable across different institutions for students pursuing a Ph.D.in natural language processing — — Zhijing Jin (Ph.D. student in NLP at Max Planck Institute, co-organizer of the ACL Year-Round Mentorship Program) has been made available an open-source Github repo containing reading material for research sub-topics like:

What Is Weekly Meeting with Advisors like?

Coming Up with Good Research Ideas

How to Read Papers.

I would definitely recommend you to give it a thorough read as it provides a deep and comprehensive overview of already available resources, and would definitely help the researchers to start off on the right foot.

How is this blog series any different?

If you go through the reading list provided at the end, you will hardly any material is written from the context of Indian academia, and that too from a student’s perspective

I tried to put together whatever I learned from my first-hand experience and what worked for me.

So, you can expect this piece to be highly opinionated, focusing on “what worked” instead of “how it should have worked ideally”

As the floor is now set, so let’s start!

Life-cycle of a Research Project

Prerequisites

  • Choice of broad domain — Text, Vision, Graphs, Speech
  • Basics of Machine Learning and Deep Learning
  • Python, Pytorch
  • Optional: R, other DL frameworks
  • Git commands

Proposed Workflow

Let’s talk about the first and perhaps the most challenging step — Problem Ideation.

Deep dive into Problem Ideation

We will try to break into down and present my take on tackling each step.

Selecting top conferences

  • Some well-known CS conferences
  • General — WebConf, CIKM, AAAI, KDD, ICDM
  • Health — BioKDD, CHIL, ACL BioNLP workshop, ACM Transactions of Computing in Healthcare
  • Natural Language Processing — ACL, EMNLP
  • Computer Vision — CVPR, ECCV, MICCAI
  • Information Retrieval — SIGIR, ECIR, WSDM
  • Recommender Systems — RecSys
  • Social — ICWSM, JCDL
  • NeurIPS, ICLR, AISTATS, KDD — super top
  • Conference deadlines (https://twitter.com/_ConferenceList)

Accessing research papers

  • Conferences from ACL Anthology like ACL, EMNLP, and NAACL are published publicly
  • Non-commercial preprint server — Arxiv, bioRxiv
  • Rxivist combines biology preprints from bioRxiv and medRxiv with data from Twitter to help you find the papers being discussed in your field
  • Unpaywall
  • Search for the paper in Google Scholar, select the article, click on ‘All [number] versions,’ and check if any one of them has a PDF version available
  • Check if the first or last author has a personal homepage, then an author’s copy is usually found on their webpage.

Some heuristics to filter Machine Learning papers

  • ML is a very fast-growing field — Published within the last 3 years.
  • Reasonable citation count
  • Authors from a reputed institution (first or principal author)

Finding Candidate Topics — Talks/Tutorials/others

Reading papers — exploration phase

This is perhaps the most time-consuming part, but it was the exciting part of the process for me. First, start by reading the abstract and introduction section of the paper. Then create a paper summary where you summarize the following aspects:

The problem statement, key takeaways, and limitations or scope.

Read figures and tables along with their captions.

The ideal output of the problem ideation stage

  • Start with some keywords (NLP, medical, summarization)
  • Identify 10–15 papers — create paper summaries
  • Brainstorm with collaborators or colleagues or yourself
  • Cluster problem statements
  • List research challenges
  • Resources required
  • domain knowledge, labeled data availability, the skillset of authors, the time required, server requirements of deep learning experiments
Image source: https://i.ebayimg.com/images/g/YJMAAOSwEVhfx1jO/s-l400.jpg

A crucial resource that helped me at all stages of the research process

Twitter.

Yes, I may sound quite new to some but believe once you start using Twitter as a learning resource, it is a gold mine

It has been my go-to and is still one of the best platforms to stay up-to-date with the research in my domain.

I started following some well-known researchers and research labs, and that gave me access to:

  1. Details about their recently published research paper
  2. Their commentary and opinion of high-impact research articles. If you follow the discussion thread, you will get to learn a lot!
  3. Job notifications — internship, postdoc, and Ph.D. positions
  4. Work productivity and mental health issues and solutions

I started following well-known Twitter handles or hashtags like #AcademicTwitter, @PhDVoice, and @jenheemstra (now there are many more ) where such issues faced by the student are discussed or useful tips or perspectives are presented from time to time.

If you do not find an exact match to what you are looking for, please feel free to post your question, someone will surely get back to you.

Parts to come under the “Research for All” initiative

Part 2 containing details regarding Baseline Setup and Novelty will be out soon.

Part 3 contains a guide to improving the soft skills and day-to-day skills of a researcher (separate article already published, 8 min. reading time).

Reading List

I am quite pleased to announce that a lot of resources and advice from eminent researchers and institutions are freely available. As a believer of positive realism, I will say that you just need to read, read and continue reading …

  1. The Missing Semester of your CS Education https://missing.csail.mit.edu/
  2. How to be successful as a Ph.D. student: This document also has its own reading list at the end
  3. Stanford CS Ph.D. Orientation 2021
  4. Stanford CS Red Book: Please read section 1.4 on “Advisors: Choosing advisors, Communicating with advisors”, and section 1.7 on “How to do research”
  5. Lessons from my Ph.D. — Austin Z. Henley
  6. Newsletters, WordPress, and Quora: DoctoralWriting SIG, The Hidden Rules of Academia, by Bianca Pereira (Medium), and Dr. Doctorate (Quora)
  7. Advice to pre-PhD self https://twitter.com/FromPhDtoLife/status/1514338255822639115
  8. ACL is a premier conference on Natural Language Processing and they have organized mentoring sessions related to research like how to choose your NLP project, building collaborations, and more. Please go through the recorded videos available on their Youtube channel
  9. Choosing between a Ph.D. and industry for new computer science graduates by Shreya Shankar (Blog)
  10. Balancing Teaching and Research by Emily M. Bender (Slides)
  11. Job positions: Twitter threads at @jobRxiv
  12. PostGradual: The Ph.D. Careers Blog
  13. The Ultimate UG Research Manual by Scholar’s Avenue, IIT Kharagpur, India
  14. Highlights of mentoring sessions of EMNLP 2020 (Blog)

Conclusion

I hope this article helps spread awareness about the free and open-source resources available for researchers and brings about a self-improvement perspective toward research in academia.

I wish you a unique, exciting but informed Ph.D. journey!

Disclaimer

The article is based on my opinion and experience alone and does not reflect the views of any researchers I have met or collaborated with. I am a Ph.D. student trying to navigate the long Ph.D. journey and am in no way an expert. This article aims to present what has worked for me till now (in specific domains of machine learning and natural language processing) and aggregate the public views and experiences of more experienced and eminent researchers on the same topic.

💚30+ free articles already available at datanalytics101.com

💚 Your feedback is critical to improving the content, so please feel free to share your take on this topic

💚Follow me on Twitter @roysoumya1 for getting updates on “AI in Healthcare”

💚I plan to write one post a month on Medium. To get updates directly to your email, please subscribe at https://medium.com/subscribe/@soumyadeeproy

--

--

Soumyadeep Roy

datanalytics101.com Ph.D. student, CSE@IIT Kharagpur, India. Research Associate at Leibniz AI Lab, Germany. Love research and sharing knowledge.