Google Summer of Code 2018 @ Julia Computing — Report

I would like to thank my mentor Lyndon White for constantly supporting me and reviewing my code during the GSoC project. It was a privilege for me to work with the Julia community.

Project Overview

Github Repository: https://github.com/oxinabox/DataDepsGenerators.jl/

Project Description

From a user’s point of view, the entire DataDepsGenerators.jl revolves around providing a method generate() . There are several generators which can be used to get the desired register block for further use by DataDeps.jl. These are:

  • JSONLD_Web() : For parsing websites and finding content from JSONLD.
  • Figshare() : For getting data from Figshare.
  • DataCite() For getting data from DataCite.
  • CKAN() : For getting data from CKAN.
  • ArcticDataCenter(), KnowledgeNetworkforBiocomplexity() as part of the KNB network. These used DataOne version 2.
  • TERN() : Uses DataOne version 2.
  • DataOneV1() : DataDryad uses DataOne version 1.
  • DataDryadWeb() : For parsing DataDryad contents from their websites.
  • GitHub() : For getting data from GitHub based repositories like fivethirtyeight.
  • UCI() : For getting data from UCI ML repositories.
generate(DataCite(), "10.5061/dryad.74699") |> print
register(DataDep(
"Data from Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation",
"""
Dataset: Data from: Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation
Website: https://doi.org/10.5061/dryad.74699
Author: Eric J. B. Von Wettberg et al.
Date of Publication: 2018
License: https://creativecommons.org/publicdomain/zero/1.0/
Please cite this paper: Von Wettberg, E. J. B., Chang, P. L., Başdemir, F., Carrasquila-Garcia, N., Korbu, L. B., Moenga, S. M., … Cook, D. R. (2018). Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation. Nature Communications, 9(1). doi:10.1038/s41467-018-02867-zPlease cite this dataset: Von Wettberg, E. J. B., Chang, P. L., Başdemir, F., Carrasquila-Garcia, N., Korbu,L., Moenga, S. M., … Cook, D. R. (2018). Data from: Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation [Data set]. Dryad Digital Repository. https://doi.org/10.5061/dryad.74699
""",
missing,
))
generate("10.5061/dryad.74699") |> print
register(DataDep(
"Data from Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation",
"""
Dataset: Data from: Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation
Website: http://dx.doi.org/10.5061/dryad.74699
Author: Eric J. B. Von Wettberg et al.
Date of Publication: 2018-02-27T21:46:38
License: http://creativecommons.org/publicdomain/zero/1.0
Domesticated species are impacted in unintended ways during domestication and breeding. Changes in the nature and intensity of selection impart genetic drift, reduce diversity, and increase the frequency of deleterious alleles. Such outcomes constrain our ability to expand the cultivation of crops into environments that differ from those under which domestication occurred. We address this need in chickpea, an important pulse legume, by harnessing the diversity of wild crop relatives. We document an extreme domestication-related genetic bottleneck and decipher the genetic history of wild populations. We provide evidence of ancestral adaptations for seed coat color crypsis, estimate theimpact of environment on genetic structure and trait values, and demonstrate variation between wild and cultivated accessions for agronomic properties. A resource of genotyped, association mapping progeny functionally links the wild and cultivated gene pools and is an essential resource chickpea for improvement, while our methods inform collection of other wild crop progenitor species.Please cite this paper: Von Wettberg, E. J. B., Chang, P. L., Başdemir, F., Carrasquila-Garcia, N., Korbu, L. B., Moenga, S. M., … Cook, D. R. (2018). Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation. Nature Communications, 9(1). doi:10.1038/s41467-018-02867-zPlease cite this dataset: Von Wettberg, E. J. B., Chang, P. L., Başdemir, F., Carrasquila-Garcia, N., Korbu,L., Moenga, S. M., … Cook, D. R. (2018). Data from: Ecology and genomics of an important crop wild relative as a prelude to agricultural innovation [Data set]. Dryad Digital Repository. https://doi.org/10.5061/dryad.74699
""",
Any["/bitstream/handle/10255/dryad.166590/draft_dryad_upload_dec2017.xls?sequence=1"],
[(md5, "3c837a0a41e01966a0f037beb7db43b8")]
))

Project Laundry List

Going into the technical aspect of my work done over the summer. This is in order to record whatever work I have done till now, and may not be of interest to people who only want to know about DataDepsGenerators.jl.

Pull Requests related to the Project

The following is the chronological list of commits/pull-requests along with a short description as to what functionality was added as a part of that commit:

(Work in Progress) Pull Requests

  • [WIP] Bump to Julia 1.0 : Julia 1.0 was released during JuliaCon’18. Being a major release, there are several differences and work needs to be done in correcting them.
  • Implement Dataverse API : Dataverse is an open source web application to share, preserve, cite, explore and analyze research data. This PR got left out in between due to focus on other important work need to be done.

(Merged) Commits

  • Multisource data acquisition : To support data acquisition from multiple sources. This means that given a url, it’ll check all the available generators present in DataDepsGenerators.jl. This helps particularly when a user is not sure about the generator to use or wants to reap maximum benefit from all the generators available. The downside is that it’ll be slow compared to producing for a specific generator. Though, we are making efforts to add aynchronisation support (backed out due to julia 0.6 bug).
  • Overhaul Metadata for individual components : Done in order to make DataDepsGenerators.jl more modular and hence easily scalable for addition of new generators.
  • Using JSONLD to retrieve information : A lot of information distributing websites and repositories are equipped with JSONLD. A generator supporting it would be very helpful due to JSONLD’s immense presence.
  • Add Figshare API : Add support for Figshare’s API
  • Update Gumbo; Fix parsing issue : The latest release by Gumbo was broken and was not able to parse XML and raised errors. We fixed by regressing to the previous stable version.
  • Fix GitHub broken tests : Github changes their url structure quite often which we observed over the course of this project. This broke our reference tests. Hence, we stripped the urls from our tests so as to not break the tests.
  • Add DataCite API : Add support for DataCite’s API
  • Reword the APIs : Due to addition of new APIs there were conflicts with the existing naming of APIs which was fixed.
  • Add CKAN API : Add support for CKAN’s API
  • Change Register Block structure : DataDeps.jl changed the code structure of the input register blocks in its latest release. Updated the structure in order to match with the latest release.
  • Add Integration Tests : To check the correctness of the existing register blocks it is necessary to have it run through DataDeps.jl which will be the eventual consumer of the register blocks created by DataDepsGenerators.jl. We add integration tests in order to detect any errors in structure or likewise in the register blocks.
  • Implement DataOneV2 abstraction : Add support for repositories supporting DataOne version 2 like KNB, TERN.
  • Implement DataDryadAPI : Add support for DataCite’s API
  • Provision to add checksum in Register Blocks : Checksums are an important component when downloading datasets. We add a provision to include checksums in the registration blocks for further usage.
  • Adding build status badges : Minor update to put build status badges on the README page.
  • Fix Github.jl dataset urls : Github urls keep changing and this was done in order to fix the issue which existed then. However, in a later PR listed above, we completely strip of urls from the reference tests.
  • Update .travis.yml to allow failures on julia nightly : Tests were not passing at this stage due to .travis.yml not properly configured to ignore the buggy nightly version.
  • Update references for fivethirtyeight and DataDryad : Minor update to fix reference tests.

Pre-GSoC Commits

Google Summer of Code @mozilla | BITS Pilani

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store