Organizing and evaluating research ideas

Marco Tulio Ribeiro
11 min readJul 13, 2022


Part 1 of this post was about coming up with research ideas. But once you come up with an idea, you still need to flesh it out (I’m calling that ‘organizing’), and figure out if you want to work on it. We’ll deal with both of those in this part, but let me start by sharing a couple of personal anecdotes.

The first involves the same Google internship mentioned in Part 1 (I learned a lot in this internship). After working on a project for about 6 weeks, I scheduled a conversation with a senior researcher, hoping for technical help in getting my solution to work. I left the meeting with my spirit crushed, because they convinced me the problem I was working on was a bad one in the first place (and thus the solution didn’t matter even if it did work). This was a really valuable lesson, which by itself would have made the internship worth it (it was great for other reasons too). After this experience, I resolved to always think very carefully about what problem I was trying to solve and why.

Over time, I evidently forgot this lesson. Soon after I finished my second paper on interpretability, I wasted a lot of time (months, if I remember) flailing about on a project that I struggle to even describe now. If I remember correctly, it was a mix of ‘unifying interpretability’, ‘a bunch of cool interpretability experiments’, and ‘new interpretability techniques’ — none of which were clear enough to evaluate, or conducive to any kind of meaningful progress.

Since then, I’ve been using and constantly refining structured processes to make sure I don’t forget to ask certain questions I’ve found useful in the past. I codify these processes into templates or checklists.

Organizing potential projects

Below is my current template for organizing ideas into potential projects. I use a snapshot of the template during the development of my CheckList paper as a running example.

The problem
A simple statement of the problem we’re trying to solve. Whenever I share this template with people, they invariably write ‘the solution’ here instead of the problem, e.g. ‘we use technique X to do Y’, rather than saying ‘we currently can’t do Y’, or ‘we want to do Y’

CheckList example:
We want to know if our models behave according to business rules, prior knowledge and common sense. Right now, it is hard even for experts to write down what these are. Most people don’t even think about why they need to do this.

Why bother
A summary of why the problem is important, or why we care about solving it.

CheckList example:
Training /eval data is static and usually has biases that models can exploit.
Accuracy is usually not all we care about, and furthermore testing sets are almost never large or varied enough to test for certain behaviors (e.g. how the model behaves in the presence of typos, how the model behaves for certain outliers, etc).

Current solutions and why they fail
A list of approaches that may be applied to ‘the problem’, even if they were not designed specifically for it. Note that ‘they fail’ is not saying that these are bad approaches, just that we expect them not to work for our problem, even if they are excellent otherwise.

CheckList example:
I listed a bunch of approaches (not relevant to this blog post), and said “These do not really test model behavior for human expectations, with minor exceptions. Also, some require access to model internals”

If I had a solution, what would it look like?
The point of this is not to actually propose a solution, but to ‘map’ what kind of solution we expect, as illustrated by the example below. If you can come up with multiple kinds of potential solutions, even better.

CheckList example:
- A framework for testing, including categorization of different test types
- A way to elicit tests from users
- Software that makes the whole process easy

How do I know if I solved it?
Again, this is mostly a sketch of evaluation (details are not necessary)

CheckList example:
- A lot of compelling examples of SOTA models failing tests, with gleaned insights
- A ‘user study’ with Microsoft engineers, where we (hopefully) show they find a bunch of (fixable) problems in their own models after writing tests.

What are uncertainties? What has to be true for this to work? What can’t be true for this solution to work?
This is about mapping out areas of high uncertainty, and making sure we try to rule out things that would invalidate either the project or the sketch of a solution as quickly as possible.

CheckList example:
- If I am the only user capable of coming up with tests, this will be a failure.
- If SOTA models only fail idiosyncratic tests, maybe accuracy is good enough and we don’t need this project

Sketch of a plan
Sketch is the operative word here. We only need a reasonable level of detail for the first couple of steps, and the first step is often some form of preliminary literature review.

CheckList example:
- Lit review
- Write tests for a SOTA sentiment analysis model
- Write tests for a couple of other models (paraphrase, SQuAD)
- Try to organize these into general principles
- Develop whatever tools are needed
- Case study with Microsoft person; Case study with researcher
- User study
- Write paper

Comments on the template

The first five points (up to ‘How do I know I solved it’) give us a vision of what the project is, what will change in the world if it is successful, why we care about that change, and how we can show that some change actually occurred. The last two points are mostly for the purposes of evaluating the project (next section), and having a tentative ‘map’ of how to make progress (not being able to come up with even a tentative map is a bad sign).

In addition to organizing new ideas, this template is a subset of a larger template I use and constantly revise throughout while I work on my projects (I plan on writing future post about this). The first version of the template almost always ends up being wrong in some way, e.g. the problem definition changes mid-way, the solution we come up with looks nothing like our first guess, the evaluation is totally different, etc. However, I’ve found that having a tentative version at this stage is much better than not having a template at all, even if everything changes later.

One caveat is that this template is very ‘problem’ oriented, rather than ‘solution’ oriented. This just happens to be the kind of research that I like doing, and thus I naturally have more experience with it. Having said this, I think some ideas here can be applied (with some adaptation) to more exploratory ideas, where the problem definition is perhaps a little murky.

Heuristics for evaluating potential projects

The template above helps make ideas more concrete, and avoid ill-defined projects. However, we still have to decide whether or not to work on a specific project, or which project to pick amongst many. I’ve seen many grad students start working on the first ‘good enough’ project that presents itself to them. I think this is a bad approach, as most people find it hard to give up on projects, and the timescale to come up with alternative ideas is much shorter than that of wrapping up projects (weeks vs months).

Of course, the objective function here depends on your overall goals. The best project for an undergrad trying to build their resume is usually not the same as for a graduate student trying to establish a thesis direction, or for an industry researcher who wants their research to add value to existing products. Time constraints also matter — e.g. an internship project is typically constrained to 12–16 weeks. Finally, one has to consider what resources are available, either individual resources (e.g. an experienced researcher is usually in a better position to tackle a very open-ended problem when compared to a novice) or institutional ones (e.g. availability of compute, funds for crowdsourcing, engineering help, etc).

With all of these caveats, I think the following non-comprehensive heuristics are useful in avoiding common pitfalls I’ve fallen into.

Heuristic 1: Imagine different futures

Write down what the possible outcomes of working on a project are, with your estimate of their likelihoods (try to be really concrete). Some projects are more spiky at the extremes (success or fail), while others have more possibilities for partial success. Even a rough estimate forces you to consider how your current resources and constraints impact the probability of success, and also about what kind of success or failure you can expect, e.g. “What type of impact could this project have if I succeed?”, “Can I figure out if I’ll definitely fail in a few weeks or will I only find out after a few months?”. In turn, this helps you think about your risk / reward portfolio and how the profile of this project fits your goals, e.g. if you’re a terminal grad student hoping to graduate soon, you probably shouldn’t take a very risky project with few avenues for partial success.

One caveat: you should have very high uncertainty around these estimates, and be really careful about premature optimization. I think adding some randomness to your current best estimate (e.g. sometimes doing projects that don’t really ‘make sense’ based on your current beliefs and goals) is a good strategy to avoid missing great opportunities because you could not anticipate them. Having noted that most of us are bad at estimating outcomes and their probability, I still think it’s useful to make a crude estimate when evaluating projects, even if one doesn’t take these estimates too seriously.

Heuristic 2: Make sure the ‘reward upper bound’ is high

This is just an application of Heuristic 1, but I think it’s important enough to stand on its own. Note that I’m not saying ‘high upper bound’ is a sufficient condition for a good project, but that ‘low upper bound’ is usually enough to reject a project, regardless of how you define ‘success’ or ‘rewards’. You may disagree (which is fine), but I just don’t think it’s worth spending a significant amount of time on almost any project if the best case scenario is an ‘ok’ paper (even if the risk is close to zero). My advisor used to say that 1 great paper is worth more than 3 good ones, and I think that mindset is really helpful. Of course you can’t guarantee greatness, but if you shoot for ‘good enough’, that is typically as far as you’ll go.

The one exception I see here (there are probably others) is students applying to PhD programs, where the number of top-tier papers seems to matter a lot (especially if that number is non-zero). In this case minimizing risk and effort required for publication seems like a good strategy, even if you end up with a string of ‘ok’ papers (I suggest switching strategies once you start your PhD).

Heuristic 3: Get more information

Instead of making a non-reversible decision to embrace or reject a project, it can be useful to gather additional information in order to refine the proposal and reduce uncertainty. One caveat: the more you invest in a project, the more sunk cost fallacy kicks in. Thus, it may seem silly, but I think it’s important to think and talk about the project as a possibility (rather than a certainty) while at this stage, e.g. avoid saying ‘this is the project I’m working on right now’.
Here are two ways of gathering additional information that are almost always worth trying.

Talk to people
Pitch the project to a few people, to see if they ‘buy’ your story, your estimate of its importance, the downsides of current solutions, etc. I think people who are skeptical and negative are great at this point (‘this is cool’ doesn’t give us much information), but paradoxically I don’t think you should listen to them too closely. Most original ideas don’t sound great from the get-go (e.g. search for ‘fragile’ in this transcript, or see this thread). If you don’t get discouraged, negative feedback is great fodder for the ‘uncertainty’ part of the template, in giving you falsification hypotheses that you can go check and defend against. Further, this is a great way to do cheap literature review, as when people ask you ‘hasn’t this been done by X back in 1984?’

As a personal anecdote, I pitched this paper (which won a best paper award) to a few NLP folks who were much more senior and experienced than me, and they gave me extremely negative feedback with very high certainty. Such a reaction was really unexpected to me (I thought the project was pretty cool), but also really helpful in making me be extra careful in the ‘how do I know I solved it’ part of the template.

Asking experts specific questions you are uncertain about can also be a cheap way to reduce uncertainty about the risk of failure. If this is the goal, I typically try asking the question I’m uncertain about abstractly (e.g. ‘how would you solve this subproblem?) before giving the project context — otherwise the person can get more focused on the overall project rather than on reducing my uncertainty where I most want it. Of course there is a tradeoff here, as asking more specific questions reduces the probability you’ll get unexpected (and useful) advice.

Commit to a hacking session
Look back at the ‘uncertainty’ part of the template, and try to reduce uncertainty, especially around things that could kill the project. Here it is often possible to cheat, e.g. if the proposed solution is a pipeline A → B → C → D and you have uncertainty about D, just fake C rather than going from A → B → C.
It’s also often possible to reduce uncertainty with cheap lower bounds — use zero shot GPT-3, use a pretrained model trained on a related task, write a crude rule-based model, etc. Once you have a cheap lower bound for some step, you can use it and and see how far it gets you.

Hacking is also great to make sure you really set up the problem well. Even if you do all steps by hand (a great way of ‘cheating’), you are forced to check if the steps and ‘output’ of whatever you are proposing are realistic (the ‘output’ is particularly relevant to check if your ‘why bother’ part of the template is reasonable).

The result of a hacking session can be great fodder for talking to a new set of people, as you typically get concrete examples that make it easy for people to imagine the ‘end result’ of a project.


An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. (John Tukey)

For better or for worse, choosing a research project sets the trajectory of your work for a significant amount of time (3 to 12 months for most people). Further, as the quote above indicates, even mediocre progress on a good problem can be worth way more than excellent progress on a problem that is not as important.
I personally think spending some time deliberately thinking about which project to do is a very good investment of time, and the heuristics above tend to help me do this. Hopefully some of them help you too :)


Alex Cabrera, Gabriel Ilharco, Adarsh Jeewajee, Fereshte Khani, Scott Lundberg, Shikhar Murty, Sameer Singh, Tongshuang Wu, and Yilun Zhou read a previous (and much worse) version of this post, and contributed to it with helpful suggestions. Thanks!