Bluekiri learnings @ BigDataSpain18

Published in

bluekiri

5 min readDec 11, 2018

In this post we will summarize the most important learnings that could help medium-sized companies accelerate the path to a real data-driven transformation. We will start from a data engineering point of view: the deployment of predictive models, analytical tools and heavy usage of data in traditional applications. We will then review some of the advances to help bookkeeping ML lifecycle and end with something usually left out of the equation but of paramount importance, that is: people.

Datalake architecture challenges

With no doubt, the design and implementation of a proper datalake strategy that suits the enterprise goals is challenging from a conceptual standpoint, but also from the practical perspective of real deployment cases. While it is clear that there is a trend towards the cloud as the main data store and container technologies to ease and scale deployments, there still is no easy way to decide which, when and how to store and read data.

At this stage, there are challenges from both delivery and technical point of view, as well as challenges that come from business necessities. In order to develop an effective strategy, we can follow the DataOps methodology as it was described in some of the talks. DataOps strives to shorten the data cycle from preparation and ETL ops to reporting. It is a combination of Agile-like methodologies, proper monitoring and process control applied to data operations. A good resource to explore real cases and applications is Creating a Data-Driven Enterprise with DataOps, published at O’reilly.

One of the goals of a data-driven enterprise is self-service and democratization of data. One way to achieve this and be efficient and flexible with the pool of data we have is by implementing some sort of indexation strategy or virtualization layer on top of our stored data. The solutions available are both proprietary and open source and they enable the efficient access of data and the building of APIs on top of your datalake. From these, two proprietary solutions draw our attention: Denodo and Kyligence Enterprise.

Denodo platform offers a complete package that allows the efficient access and aggregation of multiple data sources at the same time. Another interesting solution is Kyligence Enterprise (powered by Apache Kylin), a full data analytics platform which allows for excellent low latency performance and auto-modelling based on history. Apache Kylin is an OLAP engine that allows fast SQL queries (ie. through ODBC/JDBC) and aggregations on different data storages (Hadoop, Kafka, relational DBs, etc.) through an OLAP Cube (stored in HBase in the case of Kylin), a multidimensional spreadsheet, which allows different analytical operations like rotation or summarising while holding one axis (dimension) fixed.

Databases for enterprise needs

Key value stores

At Bluekiri we make use of Couchbase as key-value database in some of our applications, but there are alternatives that are worth exploring. Some of the alternatives we found interesting at BDS18 were Apache Ignite, Redis and LeanXscale.

Redis is well-known and we often use it for reliable in-memory caching, or even message queuing, but it is also usable as a full ACID compliant and persistent data store. It may be possible to reproduce Couchbase capabilities at a lower price with the same replication factor so it’s something worth exploring.

Ignite is an open source Apache foundation project, a distributed database focused on memory with the option to persist data on disk. One of its advantages is the ability to run SQL queries with the option to efficiently partition data for free. Another interesting feature is the possibility to choose between third-party engines and Ignite Persistance (their own solution) for data persistence on disk.

LeanXcale is also an interesting technology as it also offers both key-value and SQL access, is scalable and is good for OLAP operations making it potentially good option for data storage and analytical purposes.

Graph databases

Graph databases are not as often talked about but they are one of the fastest growing segments in the database market due to the growing need for constructing relationships between seemingly unrelated elements (ie. on social networks) and efficiently explore data (ie. using graph traversal algorithms). When choosing a graph database, it’s important to understand the differences between LPG (labeled property graphs), RDF (resource description frameworks), and multimodels.

LPGs (like neo4j) are flexible, scalable and allow for fast traversals, they are useful if you want efficient storage and fast querying across connected data. However, they often have poor schemas and non-standardized expressing. RDFs (like GraphDB) have, one one side, good interoperability, flexible schemas and well defined semantics, which allow for building things like knowledge graphs, semantic webs, etc. On the other side, they are usually less compact, scalable and have worse performance as well as higher complexity. Then there are multimodel databases (key-value, document, etc.) which are not graph databases at core, and hence may not be as optimized for graph operations, but allow more flexibility (one example is CosmosDB).

Amazon presented their vendor solution, Amazon Neptune, available only in AWS. It is a managed graph database which offers API’s for LPG and RDF models and allows for querying the RDF throught SPARQL, which may be a breakthrough in terms of flexibility.

MLflow: ML on easy mode

Every Data Scientist who has developed production ready ML algorithms knows how complex this can be. It is not the usual software development pipeline. For ML, complexity scales with the number of trial and error experiments. Keeping track of experiments, results and deployment is not a productive task but it’s fundamental to evolve ML lifecycle as robust, predictable and wide-spread as traditional software development. ML success lies between creative experimental “chaos” and ordered software development, and this is where MLflow comes to play, bridging the gap between these two worlds. Of course, MLflow is not the only platform on the market. For example, Google TFX, Uber’s Michelangelo and Facebook’s FBLearner Flow are some of the alternatives to it. However, these come with some limitations regarding built-in models and are tied to their own infrastructure. Databricks open-source platform is a good way to overcome these limitations.

Data Governance: one ring to rule them all

Last but not least, we all agree that technology is an important enabler, however industry adoption of Big Data and ML/AI must come together with a cultural shift. Organizations naively think that technological challenges can be solved purely with hardware & software, but it happens that solely based on this, failure is around the corner. As Andrés Garcia-Rodeja (DXC Technologies) puts it “Data governance can be defined as the formal orchestration of people, processes and technology that enables an organization to leverage data as an enterprise asset by properly managing Data Entropy.” As defined, Data Governance is many things, complex and complicated at the same time, but one questions rises above all: where to start? In short, start with people. Start engaging them in understanding the importance of taking care of data processes and make them the owners of data. Importantly, involve all departments, especially the business ones and not just IT. At Bluekiri we believe this is the (only) path towards a real data-driven organization, and doing so Big Data ML/AI success is (almost) guaranteed.

Thanks to @andreeamih for proofreading and making sense out of the myriad of info…