The Don Quixote of Data

Starting From the End
On reading back the first draft of this article it did come over a little negative, and perhaps a little jaded too if I am being honest. I had followed on whole heartedly from my OKHE1 post but somehow I had become lost in the narrative I had begun to slay dragons and fight giants.
Yes I had become the Don Quixote of data.
With no Sancho Panza by my side I had fought every windmill and become lost doing so. You see I was trying to write the piece from 2 different time frames and perspectives. That of an experimental officer (EO) working in a University department (the present), whilst drawing on my experiences as an undergraduate student in the early 90’s (the distant past). By doing this I was hoping to make a connection with a greater audience but just like Don Quxiote I had become embattled with imaginary enemies whilst en route to share my tale and link it all to the chivalrous goal of open knowledge in research.
I am a crystallographer, and therefore not all research data is created equal the most coveted form, to me, is sourced from X-ray crystallography.
A Biased View of the Problem
Lets play a thought experiment you have a £350K magic machine that can take a single crystal (SMX) and then tell you the crystal structure, the position of the atoms or ions that make up its very fabric. Or it takes a small amount of powder or solid and give you its phase ID (PXRD).

Now each measurement costs you time and resources so you really only ever want to collect new things. But I have good news each time you do a measurement you get a fingerprint which is unique and so you can search-match and know within 5 minutes if you have something new, novel or old and boring. Cool right?
Now scroll back 30 years and each SMX measurement takes a week or more and probably 10 or so measurements would be sufficient to build an entire thesis. So the obvious thing to do would be to start a library of these fingerprints, a database of crystallographic information and some very clever people did just that.

For single crystal data we can use the unit cell as a unique identifier just like a DOI for an article or ISBN for a book. The databases of choice being:
The Cambridge Structural Database (CSD) provided by the Cambridge Crystallographic Data Centre (CCDC) for organic and organometallics or,
Inorganic Crystal Structural Database (ICSD) from FIZ Karlsruhe
and as an undergraduate access was via the Chemical Database Service (CDS) at Daresbury laboratory.
The key point here being that access was via another service very few crystallography departments could afford to purchase the subscription for CSD or ICSD directly.
For a powder pattern we can take the raw data, the 3 strongest lines and their positions. Again the database of choice was the Powder Diffraction File (PDF) accessed via the JCPDS/ICDD and something you bought with your powder diffractometer if you had the funding. Or a much more complicated route of converting the SMX data via the CDS into a powder pattern for matching.

What was true then and today is to be effective you need easy and efficient access to those databases. However, the access is not actually as straight forward as you may have hoped. The knowledge is sadly not open.
Crystallography Suffers from Opern Access Issues
Other authors on the OKHE have highlighted how the academic publishing environment is changing and how difficult it can be to be “open”.
For publishing this can be summarised:
Money from tax payers funds research councils who fund research from academics who generate results and write papers. These are sent to publishing houses who send them back to academics for review (for free) and then finally publish them in their journal and then charge, in one form or another, for everyone to gain access to that article.
The irony being that the published articles are/have been used to rank Universities (REF) and as a measure to aid in securing more funding from those very research councils.
Show me the Windmills?
Your probably asking yourself at this point “So What is the Problem”? Unlike journal articles there is no friendly University library funding (~£6m/year) to give access to crystallographic data, perhaps there should be? Although reading Helen Dobson’s recent blog that would probably place an even greater strain on the limited open access (OA) resources.
Would you pay to deposit, access or both?
In some cases, e.g., the UK, there is national access such as EPSRC’s National Chemical Database Service (CDS) (established at Daresbury and now with the Royal Society of Chemistry) where the research council does provide access to some of the major chemical and crystallographic databases. Access is limited just as with the journals not all libraries subscribe to all possible journals and in the case of the CDS not all crystallographic data is available, e.g., there is no direct access to the PDF.
For crystallographic data it has become about the databases rather than the journals. We need to have OA databases rather than OA publications. The database entries themselves are not a requirement for funding but indirectly are required to publish and there is the rub, as they say. The funding body (a proxy for the tax payer) has paid for the research, to publish that research, a structure must be deposited in the database as a prerequisite of the journal but access to that database is licensed.
Access to the database is then paid for, in the UK, by the EPSRC for national coverage, the University or in some cases by the researcher or group. Hang on haven’t they paid twice for the same information? Possibly a third time for the journal article to be made open access as well? Plus the RSC host the current CDS which means a publishing house is also being paid to host the national access to the databases?
Rules of Engagement
So in order to publish my gold OA journal article I must first deposit my structure into a closed licensed database and receive a unique reference code. This code must be submitted as part of my manuscript deposition in order for my manuscript to be accepted. The guides for authors for both the RSC and American Chemistry Society make this quite clear.
It is true that the structural information may appear in my supporting information and can be accessed from the database by knowing the reference code from the paper. However, what if you wanted to know about any derivatives of that work. Any substructures? Sadly you can not find what you do not know already exists and well that is just a bit rubbish.
So although it could be said there is an element of open access to that data it is obfuscated from the end user.
To be truly open what you, what I, need is a portal that we can enter a set of unit cell parameters or draw the structural motif of interest and search for all the structures that contain that motif or cell. What we need is an OA version of CrystalWorks.
Now at this point something interesting happened which sort of disrupted my narrative, a little, a post made by the CCDC entitled:
seemed to be the answer to all, ok some, of my problems. The final line of the first paragraph made my heart race: “explore all chemical structures for free worldwide”, wow, that is just what we need! So I followed the link, foolishly whilst I was on my University computer, and low and behold I could now search using text and structural models both the ICSD and the CSD databases.
All Was Not as it First Appeared
We are surrounded by data. It pours out of every bit of electronic tat we carry around with us. We no longer need to retain important salient facts because we can quickly pull out our smart phone and search Google, Ask Jeeves (historic reference) or even Bing ourselves to an answer.
However, if, you walk away from the University network leave your Eduroam connection you suddenly discover how little actual journal access you have. Now walk away further leave articles behind and search for the raw data and well you have pretty much hit a brick-wall let alone a pay-wall.
To understand why this limitation matters I really need to describe a little bit more about the databases and their histories. Fortunately there is a rather nice OA article by Bruno et al entitled: Crystallography and Databases which does exactly that so I can paraphrase and focus on COD, CSD and ICSD.
What is Crystallographic Data & How is it Stored?
Storage
In the 1990’s a universal data interchange format called the Crystallographic Information File (CIF) was created as part of the IUCr Working Party on Crystallographic Information. The simple text file contains all the information in an open format with a series of open dictionaries for syntax checking. This allowed all crystallographic, database and modelling software to read and write to a common file.
Separation
The databases themselves segregated the data: CSD with organic and organometallic structures. The ICSD was built from inorganic, ceramic and more recently metallic materials.
The CCDC
There is an interesting quotation which appears a lot when the CCDC is cited:
“a passionate belief that the collective use of data would lead to the discovery of new knowledge which transcends the results of individual experiments”
Olga Kennard & J. D. Bernal
With such a fantastic quotation OA and distribution of the ~950,000 crystallographic structures is implied by the very people who laid the foundations the CCDC was built upon.
However, counterpoint this against the CCDC usage statement and data access polices seems to somehow go against a free and OA environment:
“…These services must not be used to systematically download or redistribute these structures, data or associated information.
Programmatic access to these services is not permitted.”
The ICSD
There appears to be a more community driven growth of the ICSD with a lot of the original search software and web-interfaces coming from third parties and being available as free downloads with a teaching database. Data curation and completeness being a driving force and something that was balanced against cost of distribution. Data acquisition by direct submission but also by scanning existing publications and extraction of crystallographic and associated meta data, ~200,000 structures.
COD — Crystallographic Open Database
COD was founded in February of 2003 being inspired by a email sent on an open and now defunct yahoo groups by Michael Berndt. The ethos of the database was to have a single repository for CIFs from all streams of research and to be open and publicly accessible, downloadable and searchable, currently ~400,000 entries. A truly noble goal.
In an article by Graazulis et al published in 2009 it is clear that the changing nature of research to being cross divisional so that one research group and even one publication could be based around organic, inorganic and ceramic materials meant that the segregation of crystallographic data into lots of silos was no longer a logical or efficient model.

Back in 2008 I even wrote a small piece about COD and the CDS on my forum and a year later another author picked up on the importance of COD in his post.
So when you read the following:
“Recent advances in chemistry have meant that the distinctions between inorganic and organic structures have become blurred, for instance through research to design new batteries, gas storage systems, zeolites, catalysts, magnets, and fuel additives. This, coupled with the desire from researchers for more integrated databases, has been the driving force behind the development of these joint services.
As a result, researchers and educators worldwide, working across all fields of chemistry, are able to explore over one million crystallographic structures through a joint Access Structures service …..”
You would be surprised firstly that the text above comes from a CCDC post and that over 10 years later those silo databases actually appear to be following CODs model themselves. Perhaps from pressure from research councils or the end user or even from the existence of COD itself.
A Change of Heart
Clearly things have begun to change, the unified deposition access of the CSD and ICSD and common search gateway allow for the first step towards a more open infrastructure. Bruno et al’s paper perhaps shows that the status quo is changing? A key statement from that paper made in the conclusion struck a strong cord with me:
“Data must be “Findable, Accessible, Interoperable, and Reusable” or FAIR for short.”
They went on to demonstrate such initiatives with Zenodo articles by Prof. John Helliwell, DOI.
There is always going to be a balance between cost of data collection, curation, storage and providing access. However, just like publishing a journal article I believe that there needs to be a change in how the journals accept the data and importantly how the databases provide access.
I see daily the need for students and researchers to access crystallographic data. My 300 facility users make do with only 4 PCs with ICDD access compared with 15 with COD shows the difference between licensed and open access.
To Conclude
In the end after fighting all those giants I discovered that in fact they were only windmills.
I think the future of crystallographic databases and open knowledge in research is one which will continue to prosper and grow because of disruptive players like COD whilst also being driven by research council funding requirements to be more open at all levels.
True there are established players and the publishers are still forcing authors to pre-submit data into those databases prior to publication. However, it is the way that the community responds and develops to open data challenges. Just like the single email request made by Michael Brandt triggered the establishment of COD. Perhaps with enough pressure the ICDD, ICSD and CSD will donate their CIFs to COD.
Yes there will be blind alleys and yes it will sometimes be frustrating trying to access a structure or a family of structures but if we teach the next generation of researchers how to use the open databases on open platforms using open software then open will become the norm.
As long as we aspire to be FAIR with our research data then the community will grow and thrive. So if we must, and we should, submit to the historic databases with our structural information, there is nothing as far as I can see stopping us if anything there is a moral obligation to submit the same data to COD as well.
