Approximate Computing

Chamod Samarajeewa
Jul 25, 2017 · 8 min read

What is Approximate Computing (AC) ?

A computation which provides an inaccurate result instead of exact answer within a certain accuracy range.

Inaccurate ?

  • Not wrong
  • Not random
  • Deviated from the exact answer with a known error margin
  • Good enough

Overall Impact of AC

Trade off between quality of output and performance of operation

  • Quality of the output will reduce due to the approximation
  • Performance will increase due to the bypassing of steps in calculation

Why AC is needed ?

  • Need to reduce energy consumption

Usage of all the data for calculations makes heavy use of hardware. It can be reduced by using an approximate set of data to calculate. Further, the AC algorithm can be more cost effective than normal algorithm

  • Due to the Increasing amount of data stored in databases

Today, the databases store huge amount of data of different types. Querying some data from a huge amount of data can be costly. In these situations, AC will be used to filter the data which are only going to be used for the output.

  • For real time computations

In stream processing, instead of saving all the data received via a stream, a representative value of the data can be saved (e.g. sum of data). Those representative values can be used to derive approximative statistic values regarding data (e.g. mean, median, variance).

  • Need of performance over accuracy of outputs

Situations will arise where how fast the output is obtained is more important than the accuracy of the output

E.g. — Knowing the number of customers visited the web site like “Ebay”. “Ebay” has billions of customers. Therefore, knowing the exact number of customers within a hour will be useless. Instead knowing an approximate answer within a second will be more useful.

How to achieve AC ?

Probabilistic data structures

When analyzing big data we may need to make checks on pieces of data like number of items in the dataset, number of unique items, and their occurrence frequency. Hash tables or Hash sets are usually used for this purpose. But when the dataset becomes very large that it cannot fit inside the memory all at once, we need to use special kinds of data structures known as Probabilistic Data Structures.

  • Bloom Filter
  • Cuckoo Filter
  • Count-Min Sketch
  • AMS Sketch
  • HyperLogLog

(1) Bloom Filter

A bit array of fixed size (m) is used to represent the values inserted to a set. Certain number (k ) of hash functions (h1,h2,…,hk) are defined to map the values to a hash value of 0,1,…,m-1.

insertion of value “foo”

Insertion of values

  • Consider m=10, k=3
  • Input value = “foo”
  • h1(“foo”) = 0, h2(“foo”) = 3, h3(“foo”) = 6
  • The bits in the positions obtained from hash functions are set to 1
querying a value “z”

Querying a value

  • Consider that the x, y, z are entered as values so that the relevant bits in the array are set to 1.
  • When checking whether value z is present in the set
  • h1(z) = 4, h2(z) = 13, h1(z) = 15
  • Since not all the three bits are set to 1, z not in the set
  • A value is said to be present in the set only if all the bits relevant to the hash values are set to 1.

Important

  • Can be false positive

The values which are not present in the set can also be identified as present in the set.

  • Not false negative

If a value is not in the set, output will definitely be no.

  • Minimizing the false positive rate

m = size of the bit array

n = estimated number of insertions

k = number of hash functions

False positive probability (p) = ( 1-e ^ (-kn/m) ) ^ k

p is minimized when,

  • m = -n.ln(p)/(ln(2)²)
  • k = m.ln(2)/n
  • e ~ 2.71828

Applications

  • To check and block the IP addresses which are not allowed for certain web sites
  • To filter Rows in large tables within databases when using join operations
  • Keep track of the pages that a given user has visited using a browser

(2) Cuckoo Filter

An alternative for the bloom filter with a contrasting feature of deleting elements. An array is used to store the fingerprints of the input values.

Insertion of values

  • Input value = “x”
  • h1(“x”) = 2, h2(“x”) = 6
  • If either one out of 2 or 6 is empty, “x” is placed in that cell. Since both cells are full, one item(a) is relocated to its alternate position. As that cell is also occupied by c, c is also relocated to its alternate location. This procedure happens finite amount of steps or until an empty position is found.
Cuckoo Filter

Querying a value

  • Querying is rather simple. If the hashed values of the value is present in one of its locations in the array, the value is said to be present.

Deleting a value

  • If the hashed values of the value is present in one of its locations in the array, delete the value from that cell.

(3) Count-Min Sketch

A 2D integer array used to keep count of the values entered to a certain set.

Width of the array = w

Height of the array = number of hash functions used = d

Selecting w and d

  • ε = error factor (accuracy of output)
  • δ = certainty with which the accuracy is reached
  • w = e/ε and d = ln(1/δ)
  • w and d is selected by considering the accuracy of the output which is expected

Inserting a value

  • Consider d number of hash functions (h1, h2,…, hd)
  • Each hash function maps the values to one of values in 0, 1,…, w-1
  • Consider the hash values obtained for a value is,
  • h1(value) = 5, h2(value) = 3,…, hd(value) = 8
  • Each value in the cells relevant for the hash functions is incremented by 1
Count-Min Sketch

Querying a value

  • Consider if you want to know the count of value x
  • Then all the hash values of x are calculated
  • h1(x) = 1, h1(x) = , …, hd(x) = 2
  • Then the minimum value in the cells of those values are taken as the answer
  • Count = min{cell_value(h1(x)), cell_value(h2(x)), …, cell_value(hd(x))}
  • This algorithm does not under count the value. But over counts can happen due to the mapping of different values to the same cell by the hash functions.

(4) AMS Sketch

An alternative for Count-Min Sketch. More generalized version of Count-Min Sketch. Uses extra set of hash functions (g1, g2, …, gd) to map hash values to one of +1 or -1.

Selecting w and d

  • ε = error factor (accuracy of output)
  • δ = certainty with which the accuracy is reached
  • w = 4/(ε²) and d = 8log(1/δ)

Inserting a value

  • Like in Count-Min Sketch, value is hashed by set of hash functions (h1, h2, …, hd).
  • Then the obtained values are again hashed with another set of hash functions (g1, g2, …, gd) to obtain +1 or -1
  • Unlike in Count-Min Sketch, without directly incrementing (+1) the cell values, this checks for the second hash values (g1, g2, …, gd) to select whether to increment or decrement (+1 or -1)
AMS Sketch

Querying a value

  • Consider if you want to know the count of value x. Then all the hash values of x are calculated
  • h1(x) = 1, h1(x) = , …, hd(x) = 2
  • Unlike in Count-Min Sketch, without getting the minimum of the values, the median of the values are taken.
  • Count = median{cell_value(h1(x)), cell_value(h2(x)), …, cell_value(hd(x))}
  • The reason for taking median is to be within the tail bounds of the approximation
tail bounds of the approximation

(5) HyperLogLog

An integer array of fixed size(m) used to keep the count of cardinality of a set. Use the concept of occurring of patterns.

  • If a bit string of 0010xxx… is seen, this string has a probability of 1/16 (=1/2x1/2x1/2x1/2) of occurring.
  • That means, to obtain that kind of strings, at least 16 (=2⁴) strings should have been occurred.
  • Therefore, by keeping the largest bit pattern of the inserted values, an approximation for the number of unique values can be obtained.
  • To improve the approximation, inputs are categorized into different buckets(array cells) by using a fixed size prefix of them.
  • Then the remaining part of the input is used to find the largest bit pattern (e.g. consecutive number of zeros).
HyperLogLog example

Insertion of values

  • Consider the 3 IP address values.
  • First, they are converted to bit strings via a hash function.
  • Here, the first 4 bits are considered as bucket ids.
  • Then the number of consecutive zeros in the remaining part of bit string is entered into the buckets unless the values already in the buckets are bigger than them.

Querying values

  • Dis-consider the largest 30% of the values in the buckets for the purpose of reducing errors.
  • Take the harmonic mean of the remaining values (70% buckets).
Harmonic mean formula
  • Harmonic mean is used to normalize the result.
  • m = array size
  • Cardinality =( 2 ^ (harmonic_mean) ) x m x Estimation_Factor
  • Average Error = 1.04/√m

Applications

  • Find the unique number of visitors to a web site.
  • For the aggregation functions in the databases.

The above mentioned data structures are few of the most important probabilistic data structures which can save a lot of time and improve the performance of applications dramatically.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade