What trends are driving us toward Big Data?
Notes from Couchbase CB020 Fundamentals of NoSQL
http://learn.couchbase.com/login#/course/609093980578837103
Velocity
Batch → Periodic → Near Real → Real time
Variety
Table → Database → Web Photo Album → Unstructured Social Media
Volume
Megabyte → Gigabytes → Terabytes → Petabytes
Why are social and mobile content so significant to this process?
- 75 billion smartphone users
- 35 billion hrs/month spent online
- 3 billion global online population
US adults spend 7.5 hours per day online
Every single minute:
$272,000 in purchasing
$391,6800,000/day
207 new mobile activations
298,080/day
How have cloud-based apps changed the game
Before:
Client-Server apps on LAN/WAN
On premise apps with thousands of users
High-end centralized servers
Gigabytes of well-known, structured data
Now:
Table-Mobile Browser apps in the cloud
Cloud based apps with millions of users
Running on robust clusters of commodity servers
Terabytes+ of ever-changing, unstructured data
What is the internet of things?
Currently 14 billion things connected to the internet
By 202o? 32 billion
Many tracking unstructured data over time
20% of all data generated from embedded sensors
Why does personalization drive big data flows?
Over 200 million people shop online in the US alone
Shopping cart — item, item, item
Preferences — like, like, like, like
Reviews — comment, comment, comment
Prior Purchases — date, date
Travel Plans — document, document, document
Credit History — check, check
Purchase — approved
Triple this worldwide
Each transaction
- closes the deal by integrating with multiple systems
- adds to the customer’s ever-growing history
Various data structures from different platforms all drive what’s come to be known as big data
What is unstructured data?
Data where the number, type, length of fields in each record may vary and are not precisely known until stored
- Sensor data
- Customer profiles
- Blog posts
- Product catalogs
- Online content
- File uploads
- Social media
- System logs
- Personalized news
- Cloud API data feed
- Much, much more
What data structures are driving all this volume?
- User activity
- Customer activity
- Machine activity
- Transactions
- Social comments
- Media uploads
- etc
The volume of unstructured data is exploding …
How do you put all this data to work?
Analytic Use
Using batched workloads you can do vast data aggregations of retrospective analysis on focused data pools to improve future outcomes
Operational Use
Using Real time intelligence of your data flows and processes
Because with extremely fast (in-memory) reads and extremely fast (log append) writes you can improve the current outcome
How are analytics and operational uses different?
Batched-oriented analytical database systems such as Hadoop — analytical and volume
Real-time operational database systems — NoSQL — Operational and velocity
Big data is the result of increased velocity, variety and volume
Average adults are spending 7+ hours per day online
Cloud based solutions invite 24x7 global scale
The “internet of things” is opening surging data flows
Most new data flows are unstructured
Analytical databases improve future outcomes
Operational databases improve the current outcome
How does Big Data challenge traditional RDBMS technology?
How are traditional RDBMS struggling?
Slow performance at cloud scale — millions of concurrent users
Inflexible for cloud data — continuously evolving data models and documents
Why does it matter how systems scale?
Application servers “share nothing” to scale horizontally/out
Need more power?
Add another commodity server
Add another
RDBMS servers generally “share everything” to scale vertically/up
Need more power?
Replace your server. Again and again
RDBMS Scale up
Big servers are expensive
And there’s a point at which the underlying software can no longer address all the power of your monolithic server
Upgrade software — Buy a bigger server
Upgrade software — Buy a bigger server
Cloud applications scale out
Just add another box to the rack
This way your system costs stays in line with your application performance
How do the architectures fundamentally differ?
RDBMS approach — Disk first
Write to disk → Log for availability → Cache index in memory
Virtually all reads/writes must seek on disk
It becomes a bottleneck at cloud scale
NoSQL approach — Memory First
Cache data in memory — Replicate for availability, then written to disk
All reads/writes can be served immediately from memory
RDBMS technology uses fixed, pre-defined tables
Each new data type needs planning and integration
- Normalization
- New relations
- Query draftings and optimizations
But applications focus on objects ever-evolving things like people, places
and products
Is there something wrong about rows?
Expensive disk seeks and table joins required just to assemble logically related data
Complex object-relational mapping (ORM) frameworks have evolved
But the impedance mismatch between rows and objects leads to complex application code
Code to handle the mismatch is complex and expensive to maintain
How have RDBMS tried to solve these problems?
Need performance?
Try adding a caching layer (memcached)
Need scalability?
Try manual table sharding across a cluster
Need flexibility
Try storing documents and binaries in a common table as key, value pairs
How has NoSQL technology evolved to respond?
Need performance?
Caching built-in and optimized
Memory-first architecture
Need scalability?
Clustering built-in and optimized
Replication and failover fully managed
Need flexibility?
Key-Value/Document storage built-in and optimized
SQL for Documents, Global Secondary Indexing and MapReduceViews
RDBMS systems scale up to bigger hardware, not out
Scale-out architecture has been proven in the architecture tier
NoSQL takes it to the persistence tier
RDBMS architecture is disk-first, for durability
NoSQL is memory-first for performance
RDBMS schema focuses on strict tables and rows to save disk space
NoSQL schema are flexible to make use and coding flexible
RDBMS try to keep up with caching, table-shards and BLOBS
NoSQLs build these in, optimized for scalability and speed