What trends are driving us toward Big Data?

Notes from Couchbase CB020 Fundamentals of NoSQL

http://learn.couchbase.com/login#/course/609093980578837103

Velocity

Batch → Periodic → Near Real → Real time

Variety

Table → Database → Web Photo Album → Unstructured Social Media

Volume

Megabyte → Gigabytes → Terabytes → Petabytes

Why are social and mobile content so significant to this process?

  1. 75 billion smartphone users
  2. 35 billion hrs/month spent online
  3. 3 billion global online population

US adults spend 7.5 hours per day online

Every single minute:

$272,000 in purchasing

$391,6800,000/day

207 new mobile activations

298,080/day

How have cloud-based apps changed the game

Before:

Client-Server apps on LAN/WAN

On premise apps with thousands of users

High-end centralized servers

Gigabytes of well-known, structured data

Now:

Table-Mobile Browser apps in the cloud

Cloud based apps with millions of users

Running on robust clusters of commodity servers

Terabytes+ of ever-changing, unstructured data

What is the internet of things?

Currently 14 billion things connected to the internet

By 202o? 32 billion

Many tracking unstructured data over time

20% of all data generated from embedded sensors

Why does personalization drive big data flows?

Over 200 million people shop online in the US alone

Shopping cart — item, item, item

Preferences — like, like, like, like

Reviews — comment, comment, comment

Prior Purchases — date, date

Travel Plans — document, document, document

Credit History — check, check

Purchase — approved

Triple this worldwide

Each transaction

  • closes the deal by integrating with multiple systems
  • adds to the customer’s ever-growing history

Various data structures from different platforms all drive what’s come to be known as big data

What is unstructured data?

Data where the number, type, length of fields in each record may vary and are not precisely known until stored

  • Sensor data
  • Customer profiles
  • Blog posts
  • Product catalogs
  • Online content
  • File uploads
  • Social media
  • System logs
  • Personalized news
  • Cloud API data feed
  • Much, much more

What data structures are driving all this volume?

  • User activity
  • Customer activity
  • Machine activity
  • Transactions
  • Social comments
  • Media uploads
  • etc

The volume of unstructured data is exploding …

How do you put all this data to work?

Analytic Use

Using batched workloads you can do vast data aggregations of retrospective analysis on focused data pools to improve future outcomes

Operational Use

Using Real time intelligence of your data flows and processes

Because with extremely fast (in-memory) reads and extremely fast (log append) writes you can improve the current outcome

How are analytics and operational uses different?

Batched-oriented analytical database systems such as Hadoop — analytical and volume

Real-time operational database systems — NoSQL — Operational and velocity

Big data is the result of increased velocity, variety and volume

Average adults are spending 7+ hours per day online

Cloud based solutions invite 24x7 global scale

The “internet of things” is opening surging data flows

Most new data flows are unstructured

Analytical databases improve future outcomes

Operational databases improve the current outcome

How does Big Data challenge traditional RDBMS technology?

How are traditional RDBMS struggling?

Slow performance at cloud scale — millions of concurrent users

Inflexible for cloud data — continuously evolving data models and documents

Why does it matter how systems scale?

Application servers “share nothing” to scale horizontally/out

Need more power?

Add another commodity server

Add another

RDBMS servers generally “share everything” to scale vertically/up

Need more power?

Replace your server. Again and again

RDBMS Scale up

Big servers are expensive

And there’s a point at which the underlying software can no longer address all the power of your monolithic server

Upgrade software — Buy a bigger server

Upgrade software — Buy a bigger server

Cloud applications scale out

Just add another box to the rack

This way your system costs stays in line with your application performance

How do the architectures fundamentally differ?

RDBMS approach — Disk first

Write to disk → Log for availability → Cache index in memory

Virtually all reads/writes must seek on disk

It becomes a bottleneck at cloud scale

NoSQL approach — Memory First

Cache data in memory — Replicate for availability, then written to disk

All reads/writes can be served immediately from memory

RDBMS technology uses fixed, pre-defined tables

Each new data type needs planning and integration

  • Normalization
  • New relations
  • Query draftings and optimizations

But applications focus on objects ever-evolving things like people, places

and products

Is there something wrong about rows?

Expensive disk seeks and table joins required just to assemble logically related data

Complex object-relational mapping (ORM) frameworks have evolved

But the impedance mismatch between rows and objects leads to complex application code

Code to handle the mismatch is complex and expensive to maintain

How have RDBMS tried to solve these problems?

Need performance?

Try adding a caching layer (memcached)

Need scalability?

Try manual table sharding across a cluster

Need flexibility

Try storing documents and binaries in a common table as key, value pairs

How has NoSQL technology evolved to respond?

Need performance?

Caching built-in and optimized

Memory-first architecture

Need scalability?

Clustering built-in and optimized

Replication and failover fully managed

Need flexibility?

Key-Value/Document storage built-in and optimized

SQL for Documents, Global Secondary Indexing and MapReduceViews

RDBMS systems scale up to bigger hardware, not out

Scale-out architecture has been proven in the architecture tier

NoSQL takes it to the persistence tier

RDBMS architecture is disk-first, for durability

NoSQL is memory-first for performance

RDBMS schema focuses on strict tables and rows to save disk space

NoSQL schema are flexible to make use and coding flexible

RDBMS try to keep up with caching, table-shards and BLOBS

NoSQLs build these in, optimized for scalability and speed