DNA CARTOGRAPHERS, #7: James Touthang
A new blog series shedding light on the behind the scenes aspects of DNA.Land
by Richard Aufrichtig
Since October, I have had the privilege of working for the Erlich lab as the DNA.Land User Engagement Coordinator and Technical Support leader. This has allowed me the exciting experience of communicating directly with users all over the world, fielding questions, and helping to calibrate the DNA.Land program for our users’ needs. Over the past nine months, I have received many questions about our team and what we do. Though we are an academic group from Columbia University and the New York Genome Center, many of our users are under the impression that we function similarly to large companies like 23andMe, Ancestry, and FamilyTreeDNA.
In an effort to help connect the scientists of DNA.Land more directly to our users, we have launched this blog series: DNA Cartographers. We hope that you enjoy learning more about our scientific team!
My seventh subject for the blog series is James Touthang, an undergraduate summer intern for the Erlich lab this summer. I interviewed James on Wednesday, August 3rd, 2016 on the New York Genome Center’s 7th Floor terrace.
Richard: First off, let me thank you for being the second intern on DNA Cartographers.
James: Thank you for having me!
Richard: To start off, I was hoping you could describe for me the journey that led to you becoming interested in this internship.
James: Well, it started out with an announcement that my Discrete Mathematics professor in Oklahoma gave during class, saying how there was an internship at a research institute in New York that was crowd-sourcing genetic data. I thought that was interesting, partially just because it was in New York. My major is computer engineering, but at the same time I have a background in biology and chemistry. So, I thought it might also be at an interesting cross-section of those fields.
Richard: Can you describe what you’ve been working on as part of your internship? If I’m correct in saying so, you didn’t work directly on the DNA.Land website, or with the DNA.Land users’ data.
James: That’s right — they shifted the interns towards working on big data sets instead. My particular internship was made possible by a generous gift from Andria and Paul Heafy.
Richard: Hmm, so what kind of data were you working with?
James: There’s a project called The Personal Genome Project — which allows anyone to donate their genomic data, and also offers surveys about their users’ personal health information. What I’ve done is take the genomic data from PGP users and run it through an imputation pipeline similar to the one that DNA.Land uses.
Richard: So, how much information does a process like that involve?
James: We focused on 600 samples from the PGP website. Each sample is a genotype file (such as the 23-and-Me file, which DNA.Land users) and for some samples we also had phenotype data (e.g. height or weight). Our initial input data was about 25GB, and by the time we finished we generated about 1.8TB of data.
[Ed. note: 1 TB (Terabyte) = 1,000 gigabytes = 1,000,000,000,000 bytes]
Richard: Wow! How did you process all of that information?
James: We used Amazon Web Services, commonly known as “Cloud Computing” (or sometimes just “the cloud”). We had several powerful computing-nodes running for 24/7 for a more than week.
Richard: And what has the final result of your project been?
James: Well, to encourage data sharing in the scientific community, we decided to share all the results we generated — so that others can build upon what we’ve done. I built a website (http://repgp.teamerlich.org) over the summer that has the results I gathered from the Personal Genome Project in a nice organized format. This website includes data that the public can use and inspect for themselves. It also includes tutorials on how to use the files for those who are new to bioinformatics.
Richard: Now that you have dipped your toes into the bioinformatics field, what kind of advice or recommendations would you give to the scientists building its cutting edge tools? After your project this summer, I’m sure you have a lot of things that you could say about this!
James: I think that the biggest problem for me with using bioinformatics tools has been that the documentation is often insufficient. If someone doesn’t know how to use the tools, it’s very likely that they could get the wrong information — which could have severe and unintended consequences. That’s why I made it very clear how to use information on the website I built — and it’s honestly one of the things I find most impressive about working closely with the DNA.Land team.
Richard: How do you feel your experience at the New York Genome Center will inform your future studies and career?
James: Well, it’s definitely gotten me to learn to code faster. There are definitely new skills that I’ve learned from being here. I could definitely use this experience to move in the direction of the bioinformatics field.
Richard: And you’d be interested in doing that, now?
James: Oh yeah, definitely!
Richard: Along that line, what do you think is the most rewarding aspect of engaging in genetic research? Or, what has piqued your interest the most?
James: It’s the idea that this field is so new that we don’t really have an exact answer to most questions. That’s what I’ve gathered from all of the people that I’ve been in discussion with. I find that really interesting, because the field is full of a lot of possibility right now.
Richard: It’s exciting!
James: It is! We have no idea what to expect, but it feels like we’re on the verge of learning more than we could ever imagine.