Google Summer of Code 2019 — Final Report

Deepak Kumar
InterMine
Published in
5 min readAug 23, 2019

Project: InterMine Schema Validator

Project Links

  1. Repository (this repository is dedicated for this project only): https://github.com/intermine/biovalidator
  2. Project’s User docs: https://github.com/intermine/biovalidator/wiki
  3. Project’s Javadoc: http://intermine.org/biovalidator/javadoc/

My Major Pull Requests and Commits:

  1. Pull Request #1 (API and FASTA Validator): https://github.com/intermine/biovalidator/pull/4
  2. Pull Request #2 (Biovalidator API Changes): https://github.com/intermine/biovalidator/pull/10
  3. Pull Request #3 (GFF3 and CSV validator): https://github.com/intermine/biovalidator/pull/26
  4. Pull Request #4 [Yet to be merged](VCF validator): https://github.com/intermine/biovalidator/pull/29
  5. All of my Github commits: https://github.com/intermine/biovalidator/commits?author=deepakkumar96

About the Project

InterMine works with a lot different biological data files, and it wants its users to directly upload biological data files to InterMine from their browser, In this process, InterMine wants to make sure that each file uploaded by a user is syntactically and semantically correct and it is following the file format specification. So, the project’s main goal was to design and implement a library that can validate different biological data files(such as FASTA, GFF), so that if a user submitted a wrong file(ex. if a FASTA file is supposed to be containing DNA sequence but rather has PROTEIN sequences), then the library can detect error and report all errors and warnings found in the file.

Another task of this project was to create a command-line utility to validate biological files directly from the terminal, the command-line utility will use the same java library to validate the files. The Idea of command-line utility is that, if a user gets an error while uploading his/her files to InterMine, then the user can explore the file using command-line utility and fix error and warning present in the file.

Project Requirements:

  • A standalone Java library for validating biological file formats
  • A command-line utility to validate file formats from terminal

Project Tasks:

  1. BioValidator API and FASTA validator, to validate FASTA files( DNA and PROTEIN sequences).
  2. GFF3 validator and CSV validator.
  3. VCF validator (bonus/extra task)
  4. Command-line utility for validating files from the terminal using the same library

1. BioValidator API implementation and FASTA Validator (link to API and Fasta Pull Request)

  1. 1. API design and Implementation:

The first task was to design and implement a command validator API, this API defines how errors, warning, and validation-result will be represented, API also abstract out common functionalities of the project, and each validator will be an implementation of this API.

API also defines how a user can customize validation result, user can enable/disable errors and warnings, strict/permissive mode and other can specify other rules such as whether to continue validation even if an error occurred or not.

API usage Example:

Validator API uses

1.2. Fasta Validator

Fasta is a file format that is used to store either nucleotide sequences(i.e. DNA or RNA) or amino acid (i.e. protein). The FASTA file format does not have any formal specification but there are generally accepted rules that communities follow.

Types of FASTA validation mode

  • fatsa : look for formatting issues and valid DNA and PROTEIN sequence
  • fasta-dna: validate whether a file has valid DNA sequence or not
  • fasta-protein: validates whether a file has valid PROTEIN sequence or not

Using Fasta validator with Java library:

Fasta Validator Uses Example

Using Fasta validator with Command-line tool:

$ java -jar biovalidator-fat.1.2.jar -f=filepath -t fasta

2. GFF3 and CSV validator(link to gff3 and csv Pull Request)

2.1. GFF3 validator
The goal of this task was to create a validator for GFF3 file formats(not any previous version of GFF), unlike FASTA, GFF3 has a formal specification and GFF3 validator does follow the rules given by the GFF3 specification. This task was a little difficult than FASTA as GFF3 has a lot of rules that require a lot of string manipulation, which create a lot of performance and memory consumption issues.

GFF3 validation can be used similarly the way FASTA was used either from java library or terminal using biovalidator jar file.

2.2 Csv/Tsv Validator

The third task was to create a validator for CSV/TSV file, biological data can also be represented as CSV/TSV file and CSV files are just simple text file that can contain any type of data, so the goal of this task was to create a validator that can validate the consistency of a CSV or TSV file.

CSV validator does not validate against schema but rather it validates the consistency of CSV data, it checks whether a particular column is consistent throughout all the rows of CSV data or not.

CSV validator Features:

  • Performs consistency check on CSV/TSV data
  • Detect CSV delimiter automatically
  • Detect whether a CSV data has header line or not(so you may add a header or not may not in your CSV file, will work in most cases)

Two types of checks performed by CSV/TSV validator:

  1. Whether a column has one particular type of data or not(such as integer, boolean, etc..) if it is then validator checks whether all of the rows have the same type of data or not.
  2. If data of a column has mixed kind of data, then validator checks whether all of the rows of a particular column are following on more patterns or not.

CSV validator example (through command-line utility):

shos an example of running CSV validator over a csv file,

3. VCF validator[Not yet merged] (link to Pull Request)

This was the last task which is not yet merged but completed. The goal of this task was to create a validator for validating VCF files. VCF validator validates

4. Command-line utility:

The main goal of this project was to create a Java library for validating various biological files, but besides a java library, another requirement was to create a command-line utility for validating files from the terminal.

There are two main reasons for the command-line utility

  1. If a user gets an error while uploading a file to InterMine, then the user can use the command-line tool to get all the detailed errors and warnings from his/her terminal.
  2. The command-line utility will be useful for a biologist or any non-tech user who just wants to know whether a file is valid or not.

Command-line utility example run:

User Documentation:

https://github.com/intermine/biovalidator/wiki

Documentation is written from a user perspective, how a user can use this library to validate supported file formats using both Java library and command-line utility.

Documentation has all the validation rules that are followed by validator while validating any biological file.

Project’s JavaDocs:

http://intermine.org/biovalidator/javadoc/

Future plans on the project:

  1. Adding more biological file formats
  2. Improving text-analysis done by CSV validator (E.g. Improving CSV validator for common files, not just scientific data)
  3. Identify the type of file from its content (E.g. Whether the file has DNA or PROTEIN sequence)
  4. Improving Performance of validators

--

--