Google Summer of Code 2018 @ Julia Computing — Report

I would like to thank my mentor Lyndon White for constantly supporting me and reviewing my code during the GSoC project. It was a privilege for me to work with the Julia community.

Project Overview

Github Repository: https://github.com/oxinabox/DataDepsGenerators.jl/

It is a dream of every data scientist to get hold of data on their plate without much hassles. This includes data for a new set of experiments or data needed in order reproduce an existing result. Vandewalle et al. (2009) distinguishes 6 degrees of reproducibility for scientific code. To achieve either of the 2 highest levels, requires that “The results can be easily reproduced by an independent researcher with at most 15 min of user effort”. It is our experience that one can often expend much of that time just on setting up the data. This involves reading the instructions, locating the download link, transferring it to the right location, extracting an archive, and identifying how to inform the script as to where the data is located. These tasks are automatable therefore should be automated; to save user time, and remove the opportunity for mistakes, as per the key practice identified by Wilson et al. (2014) “let the computer do the work”.

DataDeps.jl is a library for the Julia programming language, which helps beat the exact same cause. It uses a registration block, a chunk of Julia code, which describes where the data can be downloaded, who created it, what the terms and conditions for its use are, etc. The URLs retrieved from these blocks aid in downloading the required data for running the experiment. It can be pointed out that creating a registration block can be a tedious task, but there exists a support package DataDepsGenerators.jl, which covers the most popular data repositories.

As a part of my Google Summer of Code work, I worked on adding more repository support to DataDepsGenerators.jl.

At present, DataDepsGenerators.jl supports UCI ML, GitHub, both versions of DataOne, DataDryad, CKAN, DataCite and Figshare repositories. Additionally, we support extraction from JSONLD formats present on some pages. Our work is very extensible towards supporting new data repositories. And also we have included asynchronization if the user isn’t sure of the data generators to use or if the user wants to reap maximum benefits from all the generators. This means that we don’t have the need to specify the repository from which it needs to be extracted. The program will scrounge in all the available generators to get the best of all the data according to rules defined.

Project Description

From a user’s point of view, the entire DataDepsGenerators.jl revolves around providing a method generate() . There are several generators which can be used to get the desired register block for further use by DataDeps.jl. These are:

  • JSONLD_DOI() : For finding JSONLDs with cross negotiation using DOIs.
  • JSONLD_Web() : For parsing websites and finding content from JSONLD.
  • Figshare() : For getting data from Figshare.
  • DataCite() For getting data from DataCite.
  • CKAN() : For getting data from CKAN.
  • ArcticDataCenter(), KnowledgeNetworkforBiocomplexity() as part of the KNB network. These used DataOne version 2.
  • TERN() : Uses DataOne version 2.
  • DataOneV1() : DataDryad uses DataOne version 1.
  • DataDryadWeb() : For parsing DataDryad contents from their websites.
  • GitHub() : For getting data from GitHub based repositories like fivethirtyeight.
  • UCI() : For getting data from UCI ML repositories.

An example of how to use these generators:

generate(DataCite(), "10.5061/dryad.74699") |> print

will produce the following output:

register(DataDep(
"Data from Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation",
"""
Dataset: Data from: Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation
Website: https://doi.org/10.5061/dryad.74699
Author: Eric J. B. Von Wettberg et al.
Date of Publication: 2018
License: https://creativecommons.org/publicdomain/zero/1.0/
Please cite this paper: Von Wettberg, E. J. B., Chang, P. L., Başdemir, F., Carrasquila-Garcia, N., Korbu, L. B., Moenga, S. M., … Cook, D. R. (2018). Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation. Nature Communications, 9(1). doi:10.1038/s41467-018-02867-z
Please cite this dataset: Von Wettberg, E. J. B., Chang, P. L., Başdemir, F., Carrasquila-Garcia, N., Korbu,L., Moenga, S. M., … Cook, D. R. (2018). Data from: Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation [Data set]. Dryad Digital Repository. https://doi.org/10.5061/dryad.74699
""",
missing,
))

But as you all might feel, remembering/knowing about all these generators is a tedious task. No worries, there is no need to specify the generator, DataDepsGenerators.jl will take of that, might possibly give you a better result. Check for yourself:

generate("10.5061/dryad.74699") |> print

The result:

register(DataDep(
"Data from Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation",
"""
Dataset: Data from: Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation
Website: http://dx.doi.org/10.5061/dryad.74699
Author: Eric J. B. Von Wettberg et al.
Date of Publication: 2018-02-27T21:46:38
License: http://creativecommons.org/publicdomain/zero/1.0
Domesticated species are impacted in unintended ways during domestication and breeding. Changes in the nature and intensity of selection impart genetic drift, reduce diversity, and increase the frequency of deleterious alleles. Such outcomes constrain our ability to expand the cultivation of crops into environments that differ from those under which domestication occurred. We address this need in chickpea, an important pulse legume, by harnessing the diversity of wild crop relatives. We document an extreme domestication-related genetic bottleneck and decipher the genetic history of wild populations. We provide evidence of ancestral adaptations for seed coat color crypsis, estimate theimpact of environment on genetic structure and trait values, and demonstrate variation between wild and cultivated accessions for agronomic properties. A resource of genotyped, association mapping progeny functionally links the wild and cultivated gene pools and is an essential resource chickpea for improvement, while our methods inform collection of other wild crop progenitor species.
Please cite this paper: Von Wettberg, E. J. B., Chang, P. L., Başdemir, F., Carrasquila-Garcia, N., Korbu, L. B., Moenga, S. M., … Cook, D. R. (2018). Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation. Nature Communications, 9(1). doi:10.1038/s41467-018-02867-z
Please cite this dataset: Von Wettberg, E. J. B., Chang, P. L., Başdemir, F., Carrasquila-Garcia, N., Korbu,L., Moenga, S. M., … Cook, D. R. (2018). Data from: Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation [Data set]. Dryad Digital Repository. https://doi.org/10.5061/dryad.74699
""",
Any["/bitstream/handle/10255/dryad.166590/draft_dryad_upload_dec2017.xls?sequence=1"],
[(md5, "3c837a0a41e01966a0f037beb7db43b8")]
))

There is obviously a ton of more information more than the one created by a specific generator.

Project Laundry List

Going into the technical aspect of my work done over the summer. This is in order to record whatever work I have done till now, and may not be of interest to people who only want to know about DataDepsGenerators.jl.

Pull Requests related to the Project

The following is the chronological list of commits/pull-requests along with a short description as to what functionality was added as a part of that commit:

(Work in Progress) Pull Requests

  • [WIP] Bump to Julia 1.0 : Julia 1.0 was released during JuliaCon’18. Being a major release, there are several differences and work needs to be done in correcting them.
  • Implement Dataverse API : Dataverse is an open source web application to share, preserve, cite, explore and analyze research data. This PR got left out in between due to focus on other important work need to be done.

(Merged) Commits

  • Multisource data acquisition : To support data acquisition from multiple sources. This means that given a url, it’ll check all the available generators present in DataDepsGenerators.jl. This helps particularly when a user is not sure about the generator to use or wants to reap maximum benefit from all the generators available. The downside is that it’ll be slow compared to producing for a specific generator. Though, we are making efforts to add aynchronisation support (backed out due to julia 0.6 bug).
  • Overhaul Metadata for individual components : Done in order to make DataDepsGenerators.jl more modular and hence easily scalable for addition of new generators.
  • Using JSONLD to retrieve information : A lot of information distributing websites and repositories are equipped with JSONLD. A generator supporting it would be very helpful due to JSONLD’s immense presence.
  • Add Figshare API : Add support for Figshare’s API
  • Update Gumbo; Fix parsing issue : The latest release by Gumbo was broken and was not able to parse XML and raised errors. We fixed by regressing to the previous stable version.
  • Fix GitHub broken tests : Github changes their url structure quite often which we observed over the course of this project. This broke our reference tests. Hence, we stripped the urls from our tests so as to not break the tests.
  • Add DataCite API : Add support for DataCite’s API
  • Reword the APIs : Due to addition of new APIs there were conflicts with the existing naming of APIs which was fixed.
  • Add CKAN API : Add support for CKAN’s API
  • Change Register Block structure : DataDeps.jl changed the code structure of the input register blocks in its latest release. Updated the structure in order to match with the latest release.
  • Add Integration Tests : To check the correctness of the existing register blocks it is necessary to have it run through DataDeps.jl which will be the eventual consumer of the register blocks created by DataDepsGenerators.jl. We add integration tests in order to detect any errors in structure or likewise in the register blocks.
  • Implement DataOneV2 abstraction : Add support for repositories supporting DataOne version 2 like KNB, TERN.
  • Implement DataDryadAPI : Add support for DataCite’s API
  • Provision to add checksum in Register Blocks : Checksums are an important component when downloading datasets. We add a provision to include checksums in the registration blocks for further usage.
  • Adding build status badges : Minor update to put build status badges on the README page.
  • Fix Github.jl dataset urls : Github urls keep changing and this was done in order to fix the issue which existed then. However, in a later PR listed above, we completely strip of urls from the reference tests.
  • Update .travis.yml to allow failures on julia nightly : Tests were not passing at this stage due to .travis.yml not properly configured to ignore the buggy nightly version.
  • Update references for fivethirtyeight and DataDryad : Minor update to fix reference tests.

Pre-GSoC Commits