Data Retriever: Final Report
Achievements, Lessons and Future Work.
Data Retriever automates the tasks of finding, downloading, and cleaning up publicly available data, and then stores them in a local database or as flat files. To be precise, it is a package manager for data. This allows data analysts to spend majority of time in analyzing rather than in cleaning up or managing data.
Data Retriever while being a purely python based module, needed a Python interface. Since Python is the leading Data Mining and Wrangling language, creating a Python interface would allow and create scalability and smooth access of the Data Retriever functionality. With the interface all the functionality of the CLI is directly accessible from Python code. Removing the hassle of any other intermediary preprocessing efforts.
Complementing the code intelligibility along with the Python interface a Julia interface was also thought upon. Julia being the new sharpened language for statistical and data analysis. The upcoming user-base was a good prospect for the community to reach out. Providing package a for the Julia users would add on the the meager resource for a Julia user.
The summer project aimed to create a python interface for the existing CLI Data Retriever module. Implementing all the important functions for the Retriever Datasets. Along with a wrapper class in Julia. With integrated documentation for them.
The Python interface supports 12 important functions which can be used for listing, querying, downloading and installing datasets. Here is a example on installing a dataset by ‘P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.’ on wine quality. The installation is being done in
For complete documentation on using Retriever.
The Julia wrapper supports all the functionality offered by the Python interface. All function names remain same and the functionality akin with the interface.
For complete documentation on using Julia Retriever.
Having involved myself in packaging, testing, and creating the module. There are some key things that would test the module further and would require work:
- Adding more integration and regression tests for the module.
- Scalability improvements with increasing number of Retriever scripts.
- Cross platform testing and Multi threaded support.
A Final Word
I’m absolutely thrilled to have been given the opportunity to be a part of this adventure! I learned a lot: about coding, open source, and myself. I’d like to thank Google for creating this wonderful program, and lastly I’d like to thank all the mentors in my organization for being helpful and patient with me. Henry Senyondo, Ethan White, Andrew Zhang and everyone from the community. Thank you all!