A New Kind of Data

Bit width defender

Nicholas Teague
Automunge
8 min readJul 20, 2021

--

Was just thinking that since is becoming common practice in data science to normalize numeric data, it would be neat to have a new data type with limited integer registers but high capacity in the fractionals. Might be more bit width efficient for this use case.

Ok just offered a suggestion to the IEEE workgroup behind standard 754 for floating point arithmetic and thought would formalize with a demonstration to flesh out the details.

To offer a little background, IEEE-754 is the standard that defines in-memory representations of floating point numbers, which for an arbitrary number (e.g. 1234.56), when that number is represented in the computer it needs to be translated to a bitwise representation. Yeah so long story short, in the mainstream architectures of the day memory is composed of integrated circuits which are like a whole bunch of transistors with on and off states, you know like 0 and 1. In order to encode a number as bits, we simply represent as 0 or 1 states for each of a defined set of exponents of the number 2. For example, if we had a single bit to work with, we could represent two distinct numbers, as 0*2⁰ and 1*2⁰, which produces the two integers 0 and 1. Now let’s say we had two bits to work with, so can add a register for 2¹. We now can represent four distinct numbers, for the set {(0*2¹ + 0*2⁰), (0*2¹ + 1*2⁰), (1*2¹ + 0*2⁰), (1*2¹ + 1*2⁰)}, which is equivalent to the set {0, 1, 2, 3}. (Go ahead and do the math if you don’t believe me). Similarly, by increasing the bit width to 3 we get the capacity to represent {0, 1, 2, 3, 4, 5, 6, 7} and so on.

This same concept can be extended to the points found after the decimal, by simply incorporating registers for negative exponents, for example for a single bit could be {0*2^-1, 1*2^-1} = {0, 1/2}, or for two bits could be {0, 0.25, 0.5, .75}. (There’s a very real part of me right now feeling the urge to follow the two bit representation with the phrase “all for the gators stand up and holler”, going to try and keep it together.) And of course we can combine the registers for both positive and negative exponents to represent a set of real numbers as opposed to just integers, for example if we had two positive exponent registers and one negative we can represent the set {0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5}.

There’s actually a third type of register that may become useful, one to represent whether a number is +/- (that is a sign register). And of course just like any other register by incorporating we double our representation capacity, e.g. adding a negative register to the prior example gives {3.5, 3.0, 2.5, .20, 1.5, 1.0, 0.5, 0.0, — 0.0, 0.5, -1.0, -1.5, -2.0, -2.5, -3.0, -3.5}. (The entry for a negative zero is a whole can of worms not going to get into at the moment.)

As we continue to increase our bit width, we can represent larger and larger numbers in the integers with positive exponent registers and and finer and finer grained numbers in the fractional with negative exponent registers.

And then there are further extensions possible. We could represent numbers in base 10 instead of base 2 for instance, will leave that as a reader exercise to explore.

Yeah so basically part of what IEEE-754 is accomplishing is standardizing on the capacity and order of registers for several distinct data types that are common between computer languages. If you’ve studied textbooks you may recognize phrases like half precision, single precision, double precision, etc. Or for those in the real world working in python you’re probably more familiar with designations like float16, float32, float64. The progression through these data types represent increasing bit width (i.e. 16 bits, 32 bits, 64 bits, etc) of register counts for accommodating increasing capacity in the integers and increasing precision in the fractional. Actually there’s a whole range of data types that are directed for different use cases. There are data types which are just for integer representations, there are unsigned data types in which the sign register is omitted. For scientific applications where decimal representations require precision there are base ten alternatives to the binary representation.

So since this is an Automunge blog, seems reasonable to offer a few examples with the support of the library, so yeah just to zoom out for a second for those new to these pages, Automunge is a python library platform for tabular data munging (where in this context “munge” refers to data transformations). The library is intended as a resource for preparing tabular data for machine learning, and through application univariate transformations are applied to each feature under automation to perform operations like normalizations of numeric sets and binarizations of categoric — or a user also is able to designate alternate types of transformations from an extensive built in library documented in our read me. The library is also a resource to automate missing data infill, and with application auto ML models may be trained specific to each feature to infer imputations based on other features.

As a supplement to this essay linked below, we’ll offer a companion demonstration Colaboratory notebook, in which we’ll turn to a particular option for encoding numeric sets available in the Automunge library as the ‘qbt1’ family of transforms. These transforms are of interest because they are a resource for translating a column of numeric integers or floats to a multicolumn binarization similar to the form discussed above, that is with boolean integer encodings for each register associated with a set of positive or negative exponents for the base number 2. The qbt1 transform was actually created to support a use case for quantum computing, where some algorithms take as input a set of qubits initialized to the binary representations of numbers (partly inspired by discussions in the book Programming Quantum Computers by Eric Johnston, Nic Hurrigan, and Mercedes Gimeno-Segovia). The qbt1 transforms accept parameters to designate the quantity of integer, fractional, and sign registers, and return the resulting encoding as a set of columns with boolean integer entries and with suffix appenders to the column headers designating each associated register. In this context the library is especially helpful when you consider that it doesn’t only offer encodings, it can also invert those encodings to recover the original form of numeric floats, such as may be desired after completion of the quantum circuit for instance. Thus the qbt1 transform can be helpful to illustrate conversions between the displayed floats and binarized representations stored in memory.

Yeah so before linking to the demonstrations, turning back to the whole point of the essay is that the common forms of float data types were initially designed so that increased precision in the fractional registers corresponded to increased capacity in the integers. In other words, as we progress through the bit widths of 16/32/64, in each case IEEE-754 adds registers in both the integers and fractionals.

As the Automunge development journey has progressed, we’e given increasing attention to data types to try and minimize bit width for returned representations. In general, boolean integer sets are returned as type int8, ordinal sets are returned with a conditional encoding between uint8/16/32 as a function of the size of the encoding space, and the default float type is 32 bit and configurable by the automunge(.) floatprecision parameter between 16/32/64. And yeah so as was considering what data types to default to for numeric, really struggled with what capacity float is ideal. You see the numeric forms returned from Automunge have a unique characteristic in comparison to general numeric sets found in the wild, in that under automation we normalize everything by z-score normalization, which is another way of saying we center the data to a mean of zero and scale to a standard deviation of 1. This type of normalization is common practice in data science for neural networks, and is a benign operation towards the performance of decision tree based learning. For most thin tailed distributions, z-score will give us some amount of certainty for the expected range of most returned entries, as for example any resulting entries outside of the range +/-6 would represent entries falling >6 standard deviations from the mean which for a gaussian (normal) distribution covers 99.99966% of the data.

Of course not all data will be normal distributed, the presence of fat tails can be detected by measuring statistics like kurtosis and skew, in these cases we suggest mitigation by applying other options available in the library to translate to a more tractable distribution prior to normalization such as the ‘bxcx’ transforms built on top of the scipy.stats implementation of Box-Cox power law transform, or as another option the ‘qttf’ built on top of the sklearn.preprocessing QuantileTransformer. (This type of operation is available under automation by activating the automunge(.) powertransform parameter.)

So yeah getting back on the train of thought, as was considering what kind of default should be applied to numeric sets, realized that when data is normalized, you end up with a somewhat inefficient bit representation in any of the IEEE-754 default float types, since as the types increase expressivity in the fractionals we end up with more and more unused integer registers.

The solution is very simple. We need a new data type with limited integer capacity and high capacity in the fractionals. This would result in much improved bit width efficiency for our numeric representations in data science, and could accomplish benefits like speeding up model training and who knows perhaps even dampening carbon intensity of machine learning practice at scale.

What I would suggest is to create a set of normalized float types, we’ll call them normal half precision, normal single precision, normal double precision. And then when we go to represent them in python perhaps we can call them something like nfloat16 / nfloat32 / nfloat64. Yeah so basically as you may infer from these labels, I’m advocating that we create standard representations for normalized floats with register counts of 16/32/64, but in this case limited integer depth and high capacity fractionals. For our nfloat16, if we set 1 sign register, 3 integer registers, and 12 fractional registers we can accommodate data approximately +/- 8 standard deviations from the mean and 4 significant figures in decimals. For our nfloat32 we could do same for 1 sign and 3 integer registers, but increase fractional registers to 28, which would give us something like 8 significant figures in decimals. Similarly our nfloat64 would realize around 18 significant figures in the decimals.

Oh and one more thing, it would be great if we had an integer version of the boolean data type, so that we could represent the integers 0/1 with one bit instead of eight. Cheers.

This essay has a companion Colaboratory notebook available here.

Colaboratory Demonstration

Books that were referenced here or otherwise inspired this post:

Programming Quantum Computers — Eric Johnston, Nic Hurrigan, and Mercedes Gimeno-Segovia

Programming Quantum Computers

As an Amazon Associated I earn from qualifying purchases.

Miles Davis’ Blue in Green — Nicholas Teague

For further readings please check out the Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.