Aljabr
Published in

Aljabr

14# From wireframe to fully rendered

Resampling for trial and error

The interplay between scale and reasoning contains many pitfalls for the unwary, as we discussed in the last posts. It is not one that users should need to confront on a technical level. Prototyping and testing code causes headaches for data scientists and may waste a considerable amount of computational resources. To avoid dealing with massive (non portable) data sets, it’s common generate fake data, much smaller in size, that can be tested quickly, but in doing so we might lose track of real causal effects — which combinations of changes were the ones the mattered? In this post we look at how a smart platform can shield users from these chores.

Wireframing

Most readers will have seen the movies Avatar and Jurassic Park, and will have been amazed at the realism of the computer generated graphics in those movies. But the finished and polished version of the movies is very different from the models used during the planning and directing of the film. In the initial stages, skimpy wireframe skeletons move around the screen, with varying levels of filling over time (see the figure below). Only in the final cut are the details rendered fully at the maximum resolution of the final format.

The same principle is true when developing test data for a later full processing. The larger the input data, the more painful it becomes to test and debug processes using the real thing later (e.g. when I re-published my old novel Slogans, on the Amazon platform, through an online book production service, the validation of the 500 page PDF document took several hours each time before crashing without a useful error message). When you try to fake data you inevitably miss the real issues that come back to bite you.

While developing their story-lines, directors and CGI graphic designers work with simple real-time wireframe models. They run the same experiments over and over again, to get the details right. Later, when all the details are known, the entire experiment is run at full resolution as a batch job that might take significantly longer. The same is true of digital music rendering at different bit densities. All that happens at the flip of a switch. Data resolution affects the promise made by the final artifact. In a production environment, we want data to keep its best promises, but for wireframing a storyline strategy, we only need an outline of the data. The final rendering can be done later, in batch, in the cloud.

The ghosts in the shell

We can imagine a kind of “volume level” attached to data streams that governs their intensities, much like the volume controls on the different stages of your stereo system, from source to amplifier, etc. The lightest possible rendition of data analysis is to turn the volume to minimum, doing no sampling at all! Perhaps that sounds silly, but it could be used as a simulation of the pipeline, with no actual data, just empty messages to validate the connectivity and control flow. Tools like this have been common for modelling TCP/IP networks and Petri Nets for years.

Like the Internet, batching of dataflow is basically a traffic problem. There are stages (tasks) and there are links that guide the flow in a forward direction (links). Each stage needs to wait for enough data to arrive to make it worth while to process.

There are lots of ways to selectively sample data for testing, depending on file formats and batching strategies. For example:

  • First n records,
  • Every ⅓ record,
  • Skip a stage.
  • Dummy transformation.

Data scientists may only need to provide a notion of what the minimum acceptable sample is that can be used to generate an output. As steps are taken to full resolution — described as policy for the pipeline — smart links could adapt automatically to build low-resolution artifacts. These temporary artifacts could be cached as just another version attribute too, allowing re-scalings to take place without mandatory re-computation of every stage.

Pump up the volume!

Suppose we imagine a data pipeline with several stages (in the figures below). At each smart link in the process (dashed boxes), one could imagine a slider or volume knob, something like what you find on an audio-visual mixing desk, which would cause the link to alter the throughput of data by sub-sampling transparently. The user’s task code need never know anything about the fullness of the sample — the user can keep the setting low while working from the hammock on the beach. And turn it to full once the code and data are both believed to be for the final cut.

Turning up the volume on the lower link (between task 2 and task3) might spawn a flood of data, which the link can then respond to by parallelizing its task. At low resolution, the user can save CPU cycles and time by turning down the resolution. If running the pipeline on a laptop from the beach, you don’t want to be pulling data over the WiFi. Later when you run it in the cloud, congestion is less of an issue.

These are very simple helper functions for users, but they can make all the difference.

The beginning of a platform

Behaviours that make a real difference to user focus should only need the permission and guidance of users, not complex re-coding and associated administrative skills — and it should especially not require any special incantations to summon forth changes to cloud infrastructure. Kubernetes is the beginning of a great platform, but it should never be mistaken for a cool toy to engage users. Its job is to disappear. One of Aljabr’s goals with the Koalja project is to enable platform transparency for users in all organizations. No one should need a team of SREs just to get simple work done flexibly and scalably.

Simple, Smart Data Pipelines

Recommended from Medium

2020 in Review: Editorial Graphics

The power of visualization

Applied Group, Set and Parameter for Report and Dashboard in Tableau

Stock Market Prediction using News Sentiments

Using CRISP-DM to Predict Car Prices

How to Future-Proof Your Data Science Project

10 Most Practical Data Science Skills for 2022

Why do we need multidimensional analysis?

Dashboard with visualizations that show the same data from different perspectives.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mark Burgess

Mark Burgess

Technologist, Em. Professor

More from Medium

Automated Column Level Lineage in Spark using Spline

Big Data Valuations: A Pause in a Journey of Learning

What is Headless BI? (And How Does It Impact Embedded Analytics)

Building AI/ML Products for Data Scientists