How to ask for datasets
Having the right datasets at your disposal is key for the success of any research project. For some questions you may be able to produce the needed data yourself, but sooner or later you’ll encounter a situation in which you’d love input from fellow researchers or obtain access to their data. How you approach them about this can make or break your study.
I’m a systems researcher. I work with data, plenty of it. Over the past decade I have sent lots of data inquiries, and have received dozens. Judging by the latter it’s safe to say that people often go about this poorly, so I’d like to give a bit of advice regarding how to formulate inquiries to other researchers. But before we start, a few clarifications. This article is dataset-centric, but the concerns apply similarly to resources such as algorithms, methods, or code. Also, I assume you have done your background research and already know whom to ask. This is not a guide for finding useful stuff. Finally, the following is by no means a complete guide on how to collaborate with other researchers, but it might provide some tips regarding how to start such a collaboration.
Don’t be shy. Let’s get this out of the way first. If you go about it well, there’s no absolutely no harm in asking. Most researchers are keen to share and discuss their work. Never be afraid to reach out to big names.
Make your purpose clear. Before you send your inquiry you should have a thorough understanding of your goal. You might be conducting a comparative evaluation of an algorithm you’ve developed, and need to understand an implementation detail. Perhaps you’re involved in interdisciplinary effort combining several interesting datasets. Or you simply have an interesting question about someone’s latest results. Make sure you can explain this. If it’s clear that you intend to do something interesting with a dataset, this makes you interesting. Nobody wants to hand out a hard-earned dataset to a potential hoarder who’s unlikely to ever make good use of the data. Outline your intended project, describe how you’ll publish the results, or better yet, propose a collaboration. You might well have identified a win-win situation for everyone involved.
Make sure you’ve done your homework. This, too, is key. You need to demonstrate that you know your stuff and have good reasons for the inquiry. To give a negative example, I frequently receive requests for “botnet data”. That could mean anything. Are you interested in malware binaries, traffic captures, NetFlow data? Why, and why would you need mine? Understand the meaning and potential of the data you’re asking for, and be concrete. Understand the implications of obtaining certain datasets, such as privacy concerns, risks to others, or repeatability of the experiment. If you’re basing your inquiry on a specific piece of work such as a paper, blog post, or open-source project, again, be concrete. Don’t say things like “in your recent paper” but name the exact context. Understand that datasets can be large. Getting a copy of terabytes of data and processing it is a non-trivial undertaking.
Make your affiliation clear. This is especially important if your job affiliation relates to your cause. Don’t use a random Gmail or Hotmail address (good luck particularly with the latter) when you have one that shows you’re in the Computer Science department of a well-known university. Students, I’m looking at you! I’m amazed how often this happens. Identifying the exact group you’re working in, as well as who’s your adviser, also helps.
Find the right point of contact. If you’re addressing the authors of a paper, start by contacting the first author. Some papers explicitly list the right contact for correspondence regarding the paper. Under no circumstances should you send an individual email to each listed contact in parallel — the authors will notice, it makes you look lazy and inconsiderate, and it wastes the recipients’ time as they figure out whether they’ve all been contacted by the same person. (This doesn’t just apply to dataset inquiries but general email etiquette — a subject for another article.) Realize that it’s your job to find the right point of contact. Sometimes people move on to new things and it’s not their job to find that contact for you.
Don’t be a jerk. When somebody shares their stuff with you, they are doing you a favor. So be grateful and don’t bite the hand that feeds you. Ideally this wouldn’t need saying, but unfortunately reality has proven otherwise. If you believe you’ve found flaws in somebody’s data or algorithm, show some consideration. Use extra care with de-anonymization: people likely had good reason to strip certain information from the data, so trying to find a way around it is unlikely to leave a good impression. Before you go out and blast someone who’s shared material with you, confirm your findings. Put yourself in the position of the donor and consider how you would feel about the new results.
Be respectful. If somebody decides not to share data with you, it might well be that they want to share but can’t, perhaps for legal or ethical reasons. For example, network traffic captures frequently contain highly private information that can prove extremely difficult to remove. The authors simply may not have the resources to anonymize the data to the extent required by their organization’s policies, legal frameworks, etc.
Be responsible. Understand limits if others ask you for the data you’ve received. Unless you have explicit permission to re-share the data, you almost certainly should direct requests back to the original authors.
Be grateful. This last part should be obvious: no matter how you publish your findings, make sure you properly acknowledge everyone who shared resources with you. Thank people directly. Send them a pointer to your work.
That’s it for now … best of luck for your work! I would love to hear your feedback and questions, so feel free to get in touch via @ckreibich or email@example.com. Many thanks to everyone who’s sent helpful comments.
Background photo credit: https://www.flickr.com/photos/ian-s/2152798588