Data ownership or the core of a company

Edo Scalafiotti
Techfare
Published in
9 min readJul 22, 2018

A core principle of the digital economy is to consider data as the fundamental unit of value, much like a traditional homo economicus considers money as the ultimate unit of value.

This new paradigm has evolved since the first web 2.0 tools such as social networks and is intensifying with the advent of Machine Learning-powered Digital Assistants; for these systems, data is the main multiplier and enabler of growth. Since data is not business-driven, but domain-driven (a “customer” data-point doesn’t belong to one business unit only), every organisation that plays in the digital economy is de-facto a digital organisation first, so value and competitive advantage derives from its data.

A digital organisation is defined by its data and its data becomes the virtual representation of its strategic intentions. Data ownership, when considered as the very first tactical enabler, gives an organisation the right and ability to play, compete and win in any subset of the digital economy. To achieve data ownership, a digital organisation must first approach data from four distinct perspectives: security, access, integrity and scalability.

A digital organisation is defined by its data and its data becomes the virtual representation of its strategic intentions

Security

Security is the protection of an organisation’s dataset from both external and internal threats. A common legacy concept was to consider every agent outside of the organisation as a threat and as benign when within. This translated into specific VPC (Virtual Private Cloud), networking and authorisation rules within service-to-service communication. Leaders in digital such as Netflix, Google, Spotify and Uber have abandoned this concept and now treat internal threats much more seriously; this is a response to growth in digital teams, which often include large numbers of contractors within their Squads.

Security is the protection of an organisation’s dataset from both external and internal threats

When engineering security or authorisation and authentication strategies, it is best practice to start from the very basics CI/CD pipelines, like enforcing the development of services on a test environment, performing tests on a staging environment prior to a completely automatic deployment to production. Although this seems obvious, we’ve experienced (very) large organisations with developers transferring code via FTP on a live server. The access to the dataset of the production environment must also be restricted with the usage of private keys held by select few members of the senior leadership (usually the CTO and a trustee from the Board). In large organisations, it is common practice to have a VP (Vice President) of a dataset, while only the President or CEO have access to the entire dataset.

However, these internal security measures should not allow human gatekeepers to withhold data; data is a valuable business asset and should be shared with services that can use it to generate further value. Hence, governance must balance security risk and the potential to generate value: services should be given more access with regards to data, while security should always prevail when a human requires access to a dataset. To empower humans to make data informed decisions, user interfaces and reporting tools should be provisioned since they fulfil business needs without compromising the security or integrity of the underlying dataset.

[…] services should be given more access with regards to data, while security should always prevail when a human requires access to a dataset.

Access

Access is the ability to create, read, update and delete (CRUD) the correct dataset or data point. Access to a dataset must be easy and cost-effective. This principle usually expands upon the notion of a unique source-of-truth for a specific domain. All datasets and storage systems must be accessible via APIs with an underlying fit-for-purpose storage strategy : for example, a microservice that stores product information may store all the data from CUD commands (create, update and delete) in a noSQL, document-oriented database (such as MongoDB, Redis, CouchDB etc.), which is highly efficient in sharding, but comparatively inefficient at querying. All the R commands (read) that are expected to be realised synchronously (for example, to populate an autocomplete list) could be redirected to an highly-efficient indexed database such as ElasticSearch or Apache Solr.

Integrity

Integrity is also known as the immutability principle. Data, at its core, must be immutable in order for an organisation to detect and avoid MIM (man in the middle) attacks. For example, without a immutability, someone could modify a University database to update a grade or even delete a degree. Blockchains are a modern response to growing threat of integrity attacks, however other tools can be used; for example, event-driven data lakes which stores data as a collection of independent events across the lifetime of a dataset, from “created”, to every “update” and to the final “delete” event. This chronology of an organisation’s data events can be combined with monitoring and analytics to enforce data integrity.

Scalability

Scalability is the ability for a digital organisation to grow its data linearly — opposed to exponentially — with cost. There are two ways to scale: vertical scaling is achieved by augmenting the power of a machine, while horizontal scaling is achieved by augmenting the size of the fleet of machines (sometimes also referred as a swarm) that handles the data and its necessary computing power. A digital organisation should be biased towards horizontal scaling because there is a very hard limit to vertical scaling, imposed by the highest achievable power and memory of a single machine, in addition to its cost. In other words, a database sitting on a single machine that needs more power to grow its indexes will hit the hard limit of the machine power itself. At some point, the database will have grown so large that there will not be enough RAM available to continue. This principle must be observed when choosing the right technology to develop a data-service; relational databases like MySQL are very hard to deploy on multiple machines and require expensive paid versions to be sharded. For core-data systems we always recommend to consider the ability for your data and your databases to scale horizontally.

Data Platforms are banks

Not all data is money, of course. Like the international monetary system, there are currencies that can be converted to goods and services more easily than others (which is the real current limit to a wider adoption of Cryptocurrencies).

Data is the foundation: the prize is ML

Data usually becomes equivalent to money when it is collected in a way that enables Machine Learning to generate an accurate-enough model for predictions and recommendations. The advertisement business model so popular within the tech giants is not the only one that can be unlocked with ML: predictive maintenance is a business model in the manufacturing industry that mostly rely on a prediction of when a part will break.

Connecting physical controllers to ML models is potentially the most disruptive business model of the entire IIOT field. When the issue of trust is solved, controller-to-sensor and controller-to-controller communication can be established without the need for human intervention. Imagine, for example, an hydroponic greenhouse based in the UK (say, the Tomato Stall company) where physical controllers just queries an ML model owned by a biotech research lab in Switzerland for how much water should a plant receive based on the current conditions. Not only the entire process becomes automated but, most importantly, the learnings are instantly available to every other controller in the world.

Connecting physical controllers to ML models is potentially the most disruptive business model of the entire IIOT field

The latter case is especially evident for self-driving cars. One car having an incident triggers an immediate update on the behaviour of all the cars connected to the network: that scenario will likely never generate an incident in the future.

However, if actionable data is money, then any data platform must view itself as a bank. The behaviour of the producers of data — whether conscious or not, confirms it. Much like no-one will give its savings to the first Joe claiming to have “a very safe space to store it”, what the industry is lacking at the moment are clear rules, standards and auditors for data platforms.

[…] what the industry is lacking at the moment are clear rules, standards and auditors for data platforms

GDPR mostly deals with user data and issues of privacy, but a GDPR-like chart or consortium should be established for industrial and commercial data as well.

Symptoms of pre-digital IT

Pre-digital transformation organisations are typically modelled around a traditional IT system composed of many business-oriented software verticals that have been sourced or licensed on a per-need basis. This type of architecture has been effective and will continue to deliver in scenarios where an organisation does not play in the digital economy. However, this architecture has become highly ineffective and expensive for any public or private organisation playing in the digital economy.

The software services of a traditional IT system are mostly off-the-shelves products acting in isolation from each other. Domain-data is shared between the services by the usage of triggers and CRON jobs frequently acted on a chain of databases. The triggers and the scripts are tightly coupled with a proprietary database system (for example, MicrosoftSQL or OracleDB) on a specific machine. This is usually considered an anti-pattern to scalable digital systems development, since the complexity of the relations between every trigger is very hard to track and could lead to a series of overwrites with time. Debugging the system also becomes very expensive and the fact that this regularly happens on a live, production system makes the practice insecure.

We also view a traditional IT architecture as unable to comply with the latest GDPR regulations; it would be technically challenging and very expensive to let a user download every data point that the organisation has collected on them. Moreover, with the large amount of compensation-tables and temporary databases, it would be even more difficult to ensure full and comprehensive deletion if a user requested removal.

Understanding what is core

A digital organisation must have a very clear understanding of what is its core mission, and own the data that represents it. Adopting a domain-driven-design approach for the data services is usually a good start; in practice, this means that data is decoupled from business-oriented silos and categorised in coherent, high-level domains.

[…] core data must be decoupled from business-oriented silos and categorised in coherent, high-level domains

A University could consider three domains as core: Users (students, researchers, academics etc.), Storage (white papers, thesis, books, lectures, etc.) and Courses (timetables, programs, exams, etc.). The three domains will then be managed by a swarm of microservices acting on different events on the lifecycle of the data produced and consumed. A microservice reacting to a user enrolment shall contain specific logic to handle only that event. Specific attention shall be given to the Data Ingestion System, which is meant to address all the challenges regarding the domain of Storage; such system could contain a swarm of microservices and serverless functions communicating through an event bus (like Kafka) and sharing data with a data lake (like Apache Hadoop).

We recommend the establishment of clear rules and practices on data security and access so that the process can be clearly understood and be transparent to everyone. We have witnessed, for example, that the creation of API keys to access specific services within several organisations were controlled by a single developer. To prevent this, an Authentication and Authorization system must also include the management of API keys for both humans, organisations and applications.

On AGILE and governance

Establishing a data-driven architecture and the practices around data security will also provide an organisation some governance for free while the decoupling of systems will reduce development cycle times. The latter improvement in particular will enable an organisation to transition from waterfall to agile, allowing the department to deliver valuable, user-aligned services at speed. We strongly recommend adopting two agile approaches to managing digital services; agile SCRUM is recommended for developing new services where there is inherently a high degree of uncertainty around technical feasibility, demand, user needs et al. SCRUM entails iteratively building and delivering Minimum Viable Products (MVPs) on a cycle of 2 weeks or less in order to test feasibility and hypotheses. Once the service has surpassed its particular definition of an MVP and development efforts are predominantly feature engineering, the process should be managed by Kanban; this process focuses the developer team’s on progressing tasks through a known workflow e.g. from backlog through design, build, test and deploy.

--

--

Edo Scalafiotti
Techfare

“Cooper, this is no time for caution!” I work for @AWSCloud & my opinions are my own