DNA CARTOGRAPHERS, #6: Maya Anand
A new blog series shedding light on the behind the scenes aspects of DNA.Land
by Richard Aufrichtig

Since October, I have had the privilege of working for the Erlich lab as the DNA.Land User Engagement Coordinator and Technical Support leader. This has allowed me the exciting experience of communicating directly with users all over the world, fielding questions, and helping to calibrate the DNA.Land program for our users’ needs. Over the past nine months, I have received many questions about our team and what we do. Though we are an academic group from Columbia University and the New York Genome Center, many of our users are under the impression that we function similarly to large companies like 23andMe, Ancestry, and FamilyTreeDNA.
In an effort to help connect the scientists of DNA.Land more directly to our users, we have launched this blog series: DNA Cartographers. We hope that you enjoy learning more about our scientific team!
My sixth subject for the blog series is Maya Anand. A undergraduate research assistant for the Erlich lab since January 2016, she has been involved in our office this summer on an internship. I interviewed Maya on Wednesday, August 3rd, 2016 on the New York Genome Center’s 7th Floor terrace.

Richard: Thank you for being the first intern to appear on DNA Cartographers!
Maya: I’m glad to be here!
Richard: On the beautiful roof.
Maya: Yes, on the windy day.

Richard: I’m wondering if you could describe, for our users, the journey that led to you A) becoming interested in having an internship with the Erlich lab and B) the process you went through to become approved and start working here.
Maya: I’ve been interested in health sciences and computer science for quite some time, at least since I was in high school. We had a research program at my high school and I did a lot of epidemiology research, studying disease on the population level. When I came to college, I discovered computer science, started taking some intro classes, really loved it, and I happened upon Yaniv’s class that he taught at Columbia.
Richard: What was the name of that class?
Maya: It was called Ubiquitous Genomics. I remember getting an e-mail about this class being offered, and I was just so excited to see a class where computer science was combined with something health and bio related. In the class, we learned about the basics of DNA sequencing, and basic analytic pipelines. We got to work with the MinIons, which Sophie does a lot of work with.
[Editor’s note: This is an article explaining the class in depth http://www.cs.columbia.edu/2016/dna-sequencing-in-classroom/]
Richard: And, this was in the fall of 2015?
Maya: Yeah. And, we actually came to the New York Genome Center a couple of times to do little projects, sequencing things with the MinIon and analyzing the data. And then, based on the quality of my work, Yaniv asked me if I wanted to do actual research with the lab.

Richard: Wow! That’s incredible.
Maya: I was really excited. I started in the spring, coming in only once a week as my schedule allowed.
Richard: What kind of DNA.Land stuff were you working on then?
Maya: Well, I don’t know if this is a feature that will end up going live, but we were working on thinking about alternatives to the badge that we have for users. Right now we have two circles and it shows your number of bits contributed of data, along with the percentage of data you have contributed versus other users.
Richard: For some of our users this has been a confusing feature.
Maya: We were trying to think of a more intuitive way to show users how they’ve contributed. And, so we thought of different badges. For example, getting a badge for uploading your genome — or, getting a badge for linking to Geni. Or a special badge for being one of our first 10,000 users. I got to fool around in Illustrator and design some of those things.
Richard: Nice.
[Editorial Note: This feature is still being planned for launch! Please send any ideas you may have to info@dna.land]
Maya: I learned a lot through that process — and, ultimately decided that I’d like to stay on for the summer. I ended up applying to a grant offered by The Computing Research Association’s Committee on the Status of Women in Computing Research (CRA-W). They are actually funding my research for the summer, and through their research program I’m here. The program is called DREU (http://cra.org/cra-w/dreu/#overview) and is joint sponsored by CRA-W and the CDC (Coalition to Diversify Computing).
Richard: Wow — that’s amazing!
Maya: It’s a ten week research program, but actually I ended up staying for 12 weeks this summer.
Richard: What did you know about bioinformatics or genomics before you took Yaniv’s class?
Maya: Before I took Yaniv’s class? Probably not that much! [Laughing]
Richard: [Laughing]
Maya: It’s hard to say now, after I’ve been here for 7 months. But, I probably assumed that it involved computers and data related to different bio-purposes. I’m sure that any of the details in that were fuzzy. So, it was really exciting to see what kind of tools bioinformaticians actually use and what kind of questions they’re trying to answer. It has been amazing to see how broad and new the field really is.
Richard: Now that you’ve been here for a while, I’m wondering if you might be able to describe for our users what the Erlich lab does, and how DNA.Land fits into that.
Maya: So, there are a lot of questions that we’re interested in related to genomics. But, one of the main questions that we’re looking at is how we can use mass media to collect and crowd-source genomic data. And, how we can then get meaningful information out of that data.
Richard: Right.
Maya: And, that’s a lot of thinking about: “well, what questions should we be asking?” But, also how should we analyze the data? And, what forms of technology or programs should we use?
Richard: Hmm.
Maya: And, for my project, I’m working with some different sources of data that are also publicly available. Like, OpenSNP is an example of another site where users can upload their genetic information. It’s interesting how we’ve had this platform with DNA.Land where there’s more protection in how your data is shared, and to see how that contributes to research.
Richard: Now that you’ve worked with large data sets — I don’t mean the ones in DNA.Land, but rather the files you’ve been working with from the 1000 Genomes Project and OpenSNP — I’m wondering what your thoughts on genomic privacy are at this time, and how they’ve evolved as you’ve done your research.
Maya: I think doing all of this research has made me realize how important it is that people share their genomes. It’s so complex, that the only way that we can develop some sort of understanding of it is if we have a huge amount of data and numbers to go off of. And, I think that’s really important. But, also, to see that there are platforms where we can have different ways of doing informed consent. For example, the DNA.Land informed consent is very clear. It’s not like the Apple Terms of Agreement, with all the fine print. It’s very upfront and clear about what you’re agreeing to.
Richard: How much data did you actually process?
Maya: I processed the genomes of roughly 3130 individuals from the 1000 Genomes project. I’m not actually sure how much it was when I downloaded it originally — and I can find that out for you — but, by the end, after I had worked at changing it into different file formats that are used for different things –I ran our Ancestry program on them and all of that, and I ended up having about 9.3 terabytes worth of data.
Richard: How many computers did it take and how much time?
Maya: So, I had my laptop — and we were lucky enough to be able to work with the Amazon EC2 instances. We started with 2 machines of 40 CPUs each working around the clock. Later, we realized that we needed to re-scale horizontally. And, so, we switched to a different method — we used many more machines with less CPUs and memory.
Richard: And it took a long time, right?
Maya: All in all, we used approximately 5,000 hours worth of Amazon EC2 instances, and most of the time those machines that were running had 8 or 16 CPUs each.
Richard: Wow.
Maya: The whole process of figuring out which commands to run, which scripts to run and debugging everything… I think just to get from the VCF files to the 23andMe formatting files that we wanted took about three weeks of running different programs, processing, and troubleshooting. So, I definitely developed a lot of patience over this project. It may sound trivial, but when you’re new in the field the devil is in the details.
Richard: In relation to that, can you describe what some of the hurdles, obstacles, and challenges are that you encountered while trying to learn/process/analyze large data sets?
Maya: I think one of the first and foremost frustrations I came upon was that, while there are a lot of really great tools in the bioinformatics community that you can use, a lot of the time their documentation is a little bit lacking. You’ll go to the manual, and it’ll be extremely vague in what the different commands mean or what the different options are. So, I spent a lot of time running code with different options — and just looking, almost physically, manually through the outputs while trying to understand what the different options meant.
So, I think that’s one big area where, if documentation was improved in the bioinformatics field it would be a lot easier to do things. And, I understand that it’s really hard. Even as I was writing my own scripts, Gordy [Assaf Gordon, DNA.Land’s Lead Developer] would tell me to be really mindful of documenting and putting comments in my code, writing good ReadMe’s that explain it. Sometimes, it’s a little annoying — because, I know what it does and I know it works. But, going back a month or two later on what I had done previously in the summer, having the comments and ReadMe’s was super helpful.
Richard: That makes sense.
Maya: I think the other challenge of just being new to the field is getting adjusted to how long it takes to run things, and how to think about that. For example, if it’s going to take a really long time to run a job, instead of running it on the whole genome, I could run it on one chromosome — preferably the smallest chromosome — and see if it works before I get too ambitious about trying the code on the whole dataset.
Richard: Right.
Maya: Breaking things down into manageable chunks.
Richard: In relation to there being a lot of huge data that is available — for example, DNA.Land user can download large VCF files of their own genome — what kind of advice would you recommend to someone who is not very good on the computer, but who wants to start exploring the data? The reason I ask this question, I’ll be clear, is that a lot of our users who are not really computer savvy, will download the VCF file and be like: “Okay, so what the heck is this?”
Maya: Well, yeah. That’s a hard question, I think, because the bioinformatics field is new and growing. A lot of times there aren’t cut and dry or clear established methods of doing this or that. I spent a lot of time looking up or comparing methods for different things. #1: Don’t open your VCF in Microsoft Word or Excel. I would open it up in TextEdit and make the screen really wide, ’cause it has a lot of columns. That’s a good way to just first look at your file. And, you can look up online what the different fields of a VCF file stand for.
[Editor’s note: This is the technical explanation of the VCF: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40/. This is DNA.Land’s far more user friendly explanation of the VCF file https://dna.land/vcf-info. Stay tuned for a new DNA.Land feature dealing with the VCF file in the near future!]
I think SNPedia is a really interesting website, because you can look up the SNP rsids of different variations. And it tries to list what papers have been done on that SNP, or what we think that SNP might be about. For silly ones, sometimes it’s like: this is a SNP for wet earwax or dry earwax. A lot of times they’re just really random things. And, sometimes it can be really fun to look at that — with the huge caveat that it’s just something that’s fun to look at.
Richard: Hmm. Do you mean that it shouldn’t replace clinical diagnosis?
Maya: Yeah, you really have to be careful about interpreting the data. Genomics is a really powerful tool, and it’s something that we can learn a lot from — but, the field is just so young and in its infancy that you really have to be careful of how you’re interpreting things. As we know more and more about genetics, hopefully it can help us better understand and diagnose diseases.