No Solution for Big Data

Junis Alico
The Startup
Published in
7 min readSep 19, 2019
Don Quixote from Adventures of Don Quixote by Miguel de Cervantes Saavedra
rromer on Visual hunt / CC BY-NC-SA

When working with big data, sometimes it could feel like you’re Don Quixote tilting at windmills — if “…thou art not versed in the business of adventures… get thee aside and pray, while I engage with [these giants] in fierce and unequal combat.” (from Adventures of Don Quixote by Miguel de Cervantes Saavedra)

You’re not alone. Big data has stumped some of the best scientists and researchers. In fact, companies have made a fortune managing and getting meaning out of large amounts of data. In Computer Science, a plethora of new concepts have been introduced as a result of the struggle in dealing with big data. Still, there is not a single standardized way of managing large amounts of data. There still is no one single algorithm that will work for every use case. It has and will most likely remain a pain point for data scientists.

Therefore, each company has to deal with big data differently. Sure, there are some commonalities in methodology, but nothing concrete. Each use case is very unique and needs to be addressed accordingly. So what causes big data to be such a challenge, you ask? There seem to be three main hurdles — storage, processing, and accessibility.

Storage

Storage, or where and how to physically store big data, may seem as the easiest challenge to tackle, but in fact it isn’t that simple to solve. It’s true that provisioning more storage space is now easier and cheaper than ever. Cloud providers simplify the process and help you build out your own data farm. Databases can also be autoscaled up or down as needed, making it easier to index the data. But when building or refactoring a big data storage architecture, we must consider five factors — data compression, data search, short-term storage, long-term (cold) storage, and data movement.

Data compression — compress, compress, compress. Compressing the data before storing it is key. This saves lots of space in the long run, taking the load off of hardware. There are numerous compression libraries available that will do the job. You can even build your own, if the data is unique enough to necessitate such a feat.

However, there are some cases where the data cannot be compressed, or else it will lose quality and deteriorate. For example, this usually happens when the data consists mostly of images, videos, or audio. Compressing these artifacts results in loss of quality, and whether you should do it, should be determined on a case-by-case basis. However, for most other, text-based data, compressing before storing is a must.

Data search — with large amounts of data, the key to success is to come up with an overarching and consistent strategy on how to index it to make it searchable. Here, it’s best to adopt a search-first methodology. This is where databases become useful. Data should first be enriched with metadata and properly indexed before it can be compressed and stored. You can use a special database cluster to store the metadata, created by utilizing NoSQL and/or a relational database.

Short-term storage — depending on how data is accessed, it’s a good idea to have two storage bins, short- and long-term. The short-term storage bin is a place where the data is temporarily stored. Once in that bin, the data can then be aged based on business needs. To do so, you could use a first-in-first-out (FIFO) queue. The queue holds the data for n days, and on n+1 days, older data is moved to long-term storage. If n is small enough, and storage space and/or processing speed are not an issue, then short-term storage could be plain, not-compressed data. This will allow for even better search capabilities.

The short-term storage hierarchy provides better data control, allowing newer entries to be accessed much faster. In general, recent data needs to be accessed more frequently. As the data ages, the probability of that same data being used again decreases. Of course, the process for moving data to long-term storage could get more intricate than just a simple FIFO time-based queue. You can add number-of-times-accessed, search hits, etc. to your algorithm so that if a piece of data is constantly being used, compared to all other data points, it will remain in the short-term queue longer.

Long-term (cold) storage — this is usually where the data that does not need to be acted upon as often is stored. There still must be some way to perform all activities on it, just like in short-term storage. The difference will be that any action on long-term storage will take much longer to complete.

Data movement — when dealing with terabytes, petabytes, or even larger amounts of data created daily, moving the data from cluster to cluster becomes a problem. A simple data backup of both short- and long-term storage will inevitably take a long time, and data movement must be minimized at all costs to ease network load. A central data repository could, therefore, be maintained, from which all other services (search, end-user display, sorting, indexing, etc.) could operate. This strategy will also help you eliminate data duplication and will make it easier to manage scheduled backups and disaster recovery (DR).

Processing

Processing concerns the algorithms that help you analyze the data. Here’s where things get complicated. While storage architecture can be easier to standardize, using the five (5) principles I described above, data processing is not that straight-forward. The algorithms that are constantly working on the data must be fast. They are working on very large sets of data at once, and performance must be taken into consideration.

Whether the computations and processing that the data goes through is for reporting, BI, research, graphs, or a slew of other use cases, there’s no single algorithm that encompasses all the functionality needed for a data set.

Each use case has to be considered separately. The computations will be different for each data set. To enhance computational performance, data scientists must take into consideration and utilize the uniqueness of the data. They have to have knowledge of what the data means. They have to know what data points/fields are being used the most and which are the least, how often data sets are accessed, and so on. Using this information can help speed things up when processing big data.

There’s no universal library you can utilize to execute against your data that will give you the results you need in a timely fashion. This is understandable. Each data set is unique, with different fields, metadata, and hierarchy, so there’s currently no good way to standardize the processing of such data.

Accessibility

Accessibility refers to quick data access and data display from anywhere. In the storage section, I described how a well-thought-out architecture can help with search, storage space, and data movement. After a good data architecture has been implemented, there’s still one problem left. How do you display the data to the end-user?

Searches usually return very large amounts of hits and even the smallest set could return data so large that it can’t be displayed all at once. If the end-user interface is served as a website, the browser will crash. If served on a desktop/mobile app, the machine will inevitably run out of RAM.

Just like in the processing stage, here we need to look and understand the data. We have to utilize the data uniqueness to only show what is needed, minimizing load times. On-demand loading should be used. There’s a number of UI/UX libraries available that will allow you to do this. I won’t go into much detail on building such displays here as there are standardized concepts of displaying these data. If feasible, a custom viewer specific to your data is best. Otherwise utilizing existing frameworks is always an option.

Conclusion

Many, many years ago, I interviewed at a big data company (won’t disclose the name of the company in this article since I don’t have their permission to do so). One of the questions I was asked was how I would deal with the big data problem. How would I set up the architecture to solve the issues surrounding big data? After giving the interviewers my spiel about the three main hurdles as I understood them at the time, I ended up divulging an opinion which, in retrospect, maybe I shouldn’t have. That opinion, which I still hold to this day, is that there’s no solution to big data.

Big data is not something new. We’ve been working on this problem for quite some time now, both at the professional and research level. Some of the brightest computer scientists have worked on this problem, and yet, no universal solution has been found. The reason is simple — there’s no such solution. Which is why we are still talking about it to this day. Sure, you can follow the concepts that I outlined above to ease the pain. However, the hurdles still remain. Processing will be slow and will require periodic tweaking. Data storage space will require constant maintenance. Users will complain about slow load times, and so on, and on, and on.

To clarify, there probably is a solution, and we may eventually find it, but there’s a good chance that we won’t come up with a solution anytime soon. Storage is cheap but not fast when large amounts of data are concerned. We have blazing fast computers, and the speed keeps improving, but the amounts of data we need to process also keeps growing. Current technology will simply not work — its limitations have clearly surfaced in this niche. We’ll have to have a big jump in our processing power to be able to deal with these hurdles. Until then, we strive to do the best we can with the tools we have. The three big data hurdles will continue to be just that — hurdles.

--

--

Junis Alico
The Startup

Tech exec & entrepreneur, experienced in product development at Fortune & Global 500 companies, federal/local government organizations, and startups.