How to ease the strain as your data volumes rise
September 14, 2017 | Written by: Manish Bhide
Ever had to make a decision when you didn’t have the time, means or patience to look up all the data that could help you choose the best option? Yes, well, you’re not alone on that score. Usually, this doesn’t have significant or long-lasting consequences — does it really matter if you choose where to go for dinner because you like the look of a place, rather than combing through recent reviews?
But some decisions carry a lot more weight. For example, executives at Kodak decided not to pursue the digital camera technology that their employees invented, giving arch-rivals Fuji and Sony a golden opportunity to seize market share that they were never able to claw back. Executives at Kodak decided not to pursue the digital camera technology that their employees invented, giving arch-rivals Fuji and Sony a golden opportunity to seize market share that they were never able to claw back.
For some time now, the party line has been that big data could have saved these organizations and countless others from bad decisions. But that isn’t the whole story. As my colleague Jay Limburn shared in a previous blog post, having lots of information — particularly when it is poorly organized, difficult to find or not fully trusted — can hold you back just as much as not having enough data.
Solving the scalability conundrum
We all know how important scalability is when building an infrastructure that can cope with big data — the clue is in the name ‘big data’! But how do you actually achieve scalability that delivers service continuity as your data grows? First, you need to take some key considerations into account.
Scalability isn’t just about coping with gigabytes of data that grows to terabytes, petabytes, exabytes, zettabytes, yottabytes and beyond. It’s also about dealing with increasing numbers of data sets, formats and types. For that, you’ll need to make sure that the tools that help your knowledge workers make sense of data and manage governance policies can scale up too, or you’ll soon be in trouble.
You can scale data infrastructure vertically, by adding resources to existing systems, or horizontally, by adding more systems and connecting them so you can load balance across them as a single logical unit. Vertical scaling is limited, because you will eventually reach the maximum capacity of your machine. In contrast, horizontal scaling may take more planning but presents far fewer restrictions.
So, what’s the answer?
The best approach is multi-faceted: give knowledge workers access to lots of data, along with the tools they need to quickly find the most relevant assets without violating governance policies along the way. Of course, this is easier said than done.
But with data management tools that include built-in cataloging — such as IBM’s new IBM Data Catalog solution — you will be able to quickly search for data both within and across extremely large sets. As an example, if one of your data scientists discovers a relevant data set when researching a topic, they will be able to add tags and descriptions to make it easier for other data workers to find it when working on similar problems or questions. As more people add to the metadata, it will become increasingly easy for data scientists to gather the information they need through keyword searches.
In addition to its cataloging capabilities, IBM Data Catalog will also feature a business glossary, to help users tackle the challenges of continually evolving terminology. Different people refer to different things in different ways, which can prevent knowledge workers from finding relevant data sets, a problem that only gets worse as organizations and their data get larger. A business glossary will enable you to establish a consistent set of terms to describe your data, so that knowledge workers can quickly understand which assets are useful and which are irrelevant to their analyses.
Users will also be able to take advantage of an auto-discovery service. It will trawl through their systems to find available data sources, work out the types and formats of data in each, and present them to the data user, who can then choose which to publish in the catalog. It doesn’t stop there — through auto-profiling, the solution will be able to automatically classify data, figuring out whether it contains social security numbers, names, addresses, zip codes, or other common types of data.
As discussed in more detail in another previous blog post, IBM Data Catalog will also offer automated, real-time classification and enforcement of governance policies. This is currently a unique proposition, and resolves one of the major obstacles to scaling up the size and use of data management systems. Automated governance will remove the need for the Chief Data Officer (CDO)’s team to manually enforce governance policies, avoiding scalability issues as the number of data assets grows.
Moreover, the governance dashboard will offer CDOs an aggregated view of enforcement across an organization, including requests for access and usage of assets. The scale and complexity of governance efforts usually grow alongside companies and their data, so these tools will represent a real game-changer in the building and use of data management systems.
And what will happen behind the scenes?
Delivered via the cloud, IBM Data Catalog will give users the chance to no longer worry about scaling infrastructure. But let’s take a look behind the curtain to understand a few of the ways IBM will ensure seamless services, even when demand suddenly spikes.
The IBM cloud provides load-balancers that can automatically distribute workload between the available application nodes, avoiding bottlenecks when one node gets busy. The cloud platform can also automatically scale horizontally, spinning up new nodes to deal with more data when demand exceeds a certain threshold. The result is that the user can enjoy stable response times, with little to no degradation of performance even during busy periods.
For added resilience, nodes can also be deployed across data centers in multiple availability zones, protecting service continuity in the event of an outage at one location.
The same scalability and resilience are provided at the storage layer, too. All Data Catalog metadata is stored in IBM Cloudant, where it is auto-replicated across nodes. This replication avoids the risk of having a single point of failure, helping to keep the Catalog available even in the event of a node failing.
And for customers who choose to use Data Catalog not only as a metadata store, but also as a repository for the data itself, the solution harnesses IBM Cloud Object Storage to provide massive scalability for any volume of data.
Finally, behind the scenes, IBM will have a team of specialists monitoring your infrastructure to catch and address any potential scalability issues. Utilizing the best of IBM’s technology to analyze and track key performance indicators such as CPU and memory usage, they will be notified of any emerging problems so they can take action before users feel the impact.
In summary, Data Catalog has been engineered from both a functional and non-functional perspective to solve the real problems posed by scaling big data architectures. Instead of focusing purely on the storage of the data itself, Data Catalog addresses practical issues such as findability, usability and governance — helping you not only preserve and organize your data, but also allow users and data stewards to work with it more effectively.
Originally published at www.ibm.com on September 14, 2017.