A Crypto Data Project

The first part in this series introduced the side project and the motivations behind it. Recap: I want to use Machine Learning to model the price variation of a cryptocurrency in relation to particular categories of events — like a conference about it, a milestone release and so on.

TL;DR: Each data project needs to take into account various quality attributes and evolve with those in mind.

Clearly, this project is based on the collection and analysis of data. One of the books I read first to get more acquainted with the topic is The Art of Data Usability, currently in MEAP phase at Manning.com.

One of the first things I learned from this book is the DIKW hierarchy, or data pyramid. It is a visual representation or link between four level on the path to wisdom. Raw data is the observation of facts and it sits at the bottom. Through processing of raw data we obtain information, which is data in a useful form — cleaned, formatted and so on. Information becomes knowledge when put into context. Finally, when we can apply knowledge to decision making we gain wisdom. This hierarchy is often represented as a pyramid as each layer builds on top of the one below. Contextualised to my project, each level gets a meaning.

Thinking in these layers helps keeping the end in mind. This is crucial in keeping focus, especially for a side-projects: they tend to be overshadowed by the next external-driven priority.

Following the book, a data project can be thoroughly described through six different phases. During development, one often jumps back and forth between them as requirements change or adjustments become necessary.

  • DESIGN: What is the world in which this data project takes place?
  • COLLECTION: How am I going to collect data?
  • MANAGEMENT: How am I going to manage our data? Where will I store it? How will I cope with failures?
  • PROCESSING: What kind of processing needs to happen on raw data to climb up the pyramid?
  • DISSEMINATION: How will the results be made available?
  • CLOSING: What happens when I stop working on this project?

Let’s begin to answer some questions. I need two different kinds of inputs: a source of crypto events and the variation of the related coin’s price.

“Data quality” is a variable criteria defined by expectations. The global suggestion of the book is to consider every aspect of a data project with a view on specific quality attributes that we want to achieve. The amount of quality attributes to be chosen for each phase depends on the project. For instance, the attibute “Anonymity” is not relevant to my case as no user data is involved. Just like Software Engineering, Data Engineering is also often made of compromises. The table below lists the quality attributes that I decided to take into account for each phase.

Such a list helps to flesh out requirements. At this stage of the project, I must confess, it is difficult to fight the temptation to get down and code away! A list of answers follows, to show how this is helpful to think about requirements.

The next chapter in this series will start to get closer to the real world as we start with modelling and Event Storming.