Bluekiri learnings @ BigThings19

Diego Garcia Valverde

Published in

bluekiri

7 min readJan 21, 2020

My teammate Héctor Santos, our coworkers of the Data Science team of Bluekiri and I had the opportunity to attend the last edition (8th edition) of Big Things congress. So, this post gathers the lessons learnt divided in three main sections: data culture, technical and product talks.

Previous edition notes are available at https://medium.com/bluekiri/bluekiri-learnings-bigdataspain18-395729be3c5c

This 8th edition the congress had a new name: Big Things. The previous editions was called Big Data Spain. The main motive of this change, as explained by Óscar Méndez (CEO of Stratio, principal sponsor), is because the term “Big Things” has a wider scope and involves all the technologies and tools that will change the way we work and do things as a society.

The structure of this post is as follows. First, “Data culture talks” includes talks about Data Governance and Data Engineering topics. Next, “Technical talks”, covers NLP problems and probabilistic structures and data lake product. And finally, we will talk about “Product talks”, that includes new developments in technologies such as AWS Alexa, Kibana, Neo4J, the Hopsworks data-platform and some guidelines from Google to share or put in production Jupyter notebooks.

Data culture talks

The first talk after the opening talk was “Staying Safe in the AI Future” by Cassie Kozyrkov (Chief Decision Scientist at Google). It was as fundamental guide for any company or person that is or will be involved (consumer, developer, team lead…) on an IA service or in any data-driven project. In a nutshell, she concluded that people lack of rigor in model testing or wrong conclusions can mislead the ML model’s goal, and not the other way round as is mistakenly thought in general.

The topic of Data Governance was very present in the congress. Once through the hype of Big Data in these recent years, Paco Nathan, in his talk “Overview of Data Governance” (slides available at https://derwen.ai/s/6fqt), briefly explained the history and evolution of data governance along these years, and how the complexity of data governance had increased in our companies. Specially, in companies that did not start with technological roots such as Netflix, Google, etc.

He showed several surveys on how the companies are adopting these roles, practices and technologies, and most important, indicating how the data governance culture should be aligned at all levels in the organization in order to success in a data-driven project or decision. Otherwise, he indicated it will fail.

(source https://derwen.ai/s/6fqt#6)

In this regard, the talk “Creating a Data Engineering Culture” given by Jesse Anderson from Big Data Institute, indicated how companies, who are either just starting out in their big data journey, should understand, value and manage their data engineering and data science teams in order to have success. He really insisted on the recognition of the value and importance of data engineering should be given at all levels at the organization.

The talk also covered specific management details such as what skills are required for a data engineer or data scientist, even the ratio of data engineers (DE) and data scientist (DS) that should be kept in any company — in his opinion at least 5 DE for 2 DS.

Technical talks

The talk “Probability data structures” given by Siemens was one of the most interesting ones. Detailed their first attempts to compute real time cybersecurity indicators based on high data volumes using cloud-based solutions, such as Google BigQuery and Amazon Athena. And how they reframed the issue with a combination of a count-min-sketch and a bloom filter on lambda functions in AWS, obtaining a very good trade-off between accuracy and costs.

In a nutshell, count-min-sketch allows computing frequency of elements based on hash data structures to reduce required memory, but the caveat is that can have false positives on low frequent elements. On the other hand, the bloom filter allows testing membership of elements. So, combining first a bloom filter to test membership solves the false positives’ caveat of the count-min-sketch on low frequent elements.

The talk “Solving Natural Language problems with scarce data” given by Álvaro Barbero Jiménez, was a great structured explanation about the state-of-the-art approaches to solve NLP problems. He clearly expressed how the BERT model, which main advantage is to take into account the contextual relation between words, outperforms the others. The code and data presented are available in https://github.com/albarji/big-things-2019.

Moreover, he insisted on using existing pre-trained models as Fast Text when facing any NLP problem in order to achieve better and more generalizable solutions.

The Airbus talk “Time-Efficient Aircraft Fault Isolation Procedures with NLP techniques” was focused on how they are using ML to improve the maintenance operations to reduce time and costs. Specifically, using NLP techniques to extract knowledge from flight operation reports in order to reduce maintenance tasks prioritizing the most important ones.

Product talks

In this congress, where technology is the core, product talks clearly could not be absent. The Amazon Alexa talk showed the extended use of Long Short-Term Memory Neural Networks in the voice recognition module. Moreover, they showed a lot of effort to make the voices more natural and less artificial. Finally, some new features and improvements were shown on video. In particular, it was played a commercial to show the celebrity voice feature, where Samuel L. Jackson appears recording himself in a studio. But, it was very disappointing not listening him saying at least one of his films quotes, which clearly could be a significant added value.

Michael from Databricks presented “Delta Lake”. He introduced this open-source storage layer that supports reliability on data lakes. He exposed the intrinsic problems of traditional data lakes when gathering high volumes of data variety (a lot of them garbage), and how these problems lead us to develop complex architectures in order to extract value from data.

The main idea of “Delta Lake” is to focus on the data flows. It brings ACID transactions to Apache Spark, guaranteeing the data quality and allowing all type of changes on them. This feature allows, for instance, to unify the data generated from batch and streaming processes.

The talk about Neo4J, was centered in a use case about matching different information sources and remove duplicated data. The talk insisted on the simplicity and explainability of a tree-based solution to represent data and relation in a more natural way.

“Data Driven Dashboards with Kibana Canvas” talk was centered on the new Canvas module, which allows the user to design shareable dashboards and create custom quantitative charts from scratch. Moreover, the most useful feature presented was the entity-centric data type, which allows the user to aggregate data applying “transforms” using “group by” only by doing a few clicks. This new feature avoids developing ad hoc ETLs for aggregation only purpose. These improvements are aligned with common use cases in most companies and clearly intended to shorten some distance to its competitors.

The talk about the Hopsworks data platform, titled “End-to-end ML pipelines with Beam, Flink, TensorFlow and Hopsworks”, covered all the solutions integrated and features included in the platform. This open source platform intends to reduce the complexity of the maintenance and administration of Big Data related systems and solutions.

In particular, the talk was focused on the data processing programming model Apache Beam running on the stream-processing framework Apache Flink.

Apache Beam is a high level programming model that allows the developer to abstract from low level implementation details when programming data pipelines (Google Dataflow solution is based on Apache Beam). In particular, the batch and stream programming models are unified, and avoids developing different applications for the same purpose (as done by the Apache Spark structured streaming programming API). This programming model runs on Apache Flink (can also run on Apache Spark and other execution engines). Apache Flink is stream processing framework that covers a wide set of data pipelines use cases.

Finally, the “Jupyter Notebooks on GCP (Development Best Practices/Tooling)” talk made by Google, gave a complete list of difficulties and required hacks on notebooks when sharing them or trying to run them on a production environment. They gave a solution based on embedding custom non-standard metadata on the schema. In this regard, we expected a more mature and standardized solution.

Conclusion

To conclude, we clearly think that Big Things is an interesting congress to whoever wants to understand and keep up to date about new developments and guidelines that are followed by the industry. Current relevant topics such as Data Governance, Block Chain, NLP and Deep Neural Networks were covered in various talks.

Moreover, the diversity of industry sectors — bank, transport, energy, retail, health and telcos — allowed us, the attendees, to focus on the talks of interest.

In addition, important players of the industry were present not only in the talks but also in exhibition booths answering any questions.

Last but not least, the good mood, the diversity of roles and profiles of the people made the congress an enlightening experience.