<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Tino Tereshko on Medium]]></title>
        <description><![CDATA[Stories by Tino Tereshko on Medium]]></description>
        <link>https://medium.com/@thetinot?source=rss-a539d84012a4------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*c4ggED7ljoQt9MSjFTODfA.jpeg</url>
            <title>Stories by Tino Tereshko on Medium</title>
            <link>https://medium.com/@thetinot?source=rss-a539d84012a4------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Fri, 21 Jul 2017 22:58:22 GMT</lastBuildDate>
        <atom:link href="https://medium.com/feed/@thetinot" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[The 12 Components of Google BigQuery]]></title>
            <link>https://medium.com/google-cloud/the-12-components-of-google-bigquery-c2b49829a7c7?source=rss-a539d84012a4------2</link>
            <guid isPermaLink="false">https://medium.com/p/c2b49829a7c7</guid>
            <category><![CDATA[google-cloud-platform]]></category>
            <category><![CDATA[bigquery]]></category>
            <category><![CDATA[big-data]]></category>
            <dc:creator><![CDATA[Tino Tereshko]]></dc:creator>
            <pubDate>Tue, 11 Jul 2017 00:28:37 GMT</pubDate>
            <atom:updated>2017-07-11T18:01:33.167Z</atom:updated>
            <content:encoded><![CDATA[<p>Folks have been discussing BigQuery quite a bit these days, which is fantastic. But there’s a lot of STUFF to BigQuery — it’s a sophisticated, mature service with many moving pieces, and it’s easy to get lost!</p><p>In order to aid in understanding what exactly IS the BigQuery service, here is a quick rundown of what I’d consider the major user-facing components:</p><ul><li>Serverless Service Model</li><li>Opinionated Storage Engine</li><li>Dremel Execution Engine &amp; Standard SQL</li><li>Separation of Storage and Compute through Jupiter Network</li><li>Enterprise-grade Data Sharing</li><li>Public Datasets, Commercial Datasets, Marketing Datasets, and the Free Pricing Tier</li><li>Streaming Ingest</li><li>Batch Ingest</li><li>Federated Query Engine</li><li>UX, CLI, SDK, ODBC/JDBC, API</li><li>Pay-Per-Query AND Flat Rate Pricing</li><li>IAM, Authentication &amp; Audit Logs</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JHmYWdJNM9D-2Fb3hs1nXw.png" /><figcaption>BigQuery through the lens of a practitioner</figcaption></figure><p>There is, of course, much more to BigQuery than this, and we’re taking a customer-centric viewpoint here. I’m ignoring some of the more mundane bits. That said, we previously wrote about <a href="https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood">what BigQuery looks like under the hood</a>, and the <a href="https://medium.com/google-cloud/15-awesome-things-you-probably-didnt-know-about-google-bigquery-6654841fa2dc">15 things you probably didn’t know about BigQuery</a>.</p><h3><strong>Serverless Service Model</strong></h3><p>Probably the most important aspect of BigQuery is its serverless model (excuse the buzzword). I can say this with a straight face — BigQuery carries some of the highest level of abstraction, <a href="https://cloud.google.com/blog/big-data/2016/08/google-bigquery-continues-to-define-what-it-means-to-be-fully-managed">manageability</a>, and automation in the industry, freeing you from the tyranny of VMs and CPU/RAM sizing. BigQuery’s compute is incredibly elastic, <a href="https://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-query">capable of scaling</a> to dozens of thousands of cores for just a few seconds, while letting you pay only for what you consume; BigQuery is a highly available, durable, and secure out of the box. There, checks all the boxes of being “serverless”, you see…</p><h3>Opinionated Managed Storage Engine</h3><p>BigQuery has an amazing storage engine, continuously evolving and optimizing your storage on your behalf — for free and without disruptions.</p><p><a href="https://www.systutorials.com/3202/colossus-successor-to-google-file-system-gfs/">Colossus</a> is Google’s successor to GFS. Colossus is great — it’s durable, incredibly performant, and super-scalable. BigQuery stores its data in Colossus in the opinionated <a href="https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format">Capacitor</a> format. BigQuery’s Capacitor does lots of optimizations under the hood, burning a good amount of CPU/RAM in the process (without affecting your query performance.. or your bill!).</p><p>One great example of what BigQuery does with storage is automatically re-materialize your data in cases when your tables are powered by too many small files. The “many small files” problem is the bane of existence of a whole generation of DBAs.</p><h3>Dremel Execution Engine &amp; Standard SQL</h3><p>Everyone knows that BigQuery runs on top of <a href="https://research.google.com/pubs/pub36632.html">Dremel</a>. That said, Dremel itself has evolved, and it’s a bit of a different beast these days than what’s described in the paper:</p><ul><li>As of Summer of 2015, 100% of BigQuery users run on a new version of Dremel.</li><li>BigQuery <a href="https://cloud.google.com/blog/big-data/2016/08/in-memory-query-execution-in-google-bigquery">executes</a> its shuffle in-memory in a separate sub-service</li><li>Dremel does <a href="https://www.youtube.com/watch?v=UueWySREWvk">stuff</a> like pipelined execution and smart scheduling</li></ul><p>Dremel itself is a vast multi-tenant compute cluster. Your queries are just short-term tenants in Dremel, and the Dremel scheduler performs Cirque Du Soleil-like gymnastics to keep all queries running at top shape. Nature of Dremel also makes you immune to any individual node going down — Yay!</p><p>Dremel these days supports its legacy SQL-ish dialect, as well as 2011 ANSI Standard SQL dialect.</p><h3>The Jupiter Network and Separation of Storage &amp; Compute</h3><p>We’ve <a href="https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood">ridden</a> this hobby pony to its last legs. <a href="https://cloudplatform.googleblog.com/2015/06/A-Look-Inside-Googles-Data-Center-Networks.html">Jupiter</a> is Google’s inter-data center network, capable of a Petabit of bisectional traffic, and allowing BigQuery to sling data from storage to compute without a hiccup. It’s the glue.</p><h3>Enterprise-Grade Data Sharing</h3><p>BigQuery’s pure separation of storage and compute, coupled with awesomeness of Colossus allows folks to share Exabyte-scale datasets with each other, much like Google Docs and Sheets are shared today.</p><p>It’s not an anti-pattern in some architectures to copy data into disparate clusters in order to share the data, creating data silos. I’ll argue that data silos are the worst — you’re playing a game of telephone with data, you’re increasing your complexity of operations, your infrastructure is very inefficient, and in the end your bill is through the roof!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/646/1*45pSqcaKcAGUJuGyOQ_IgA.png" /><figcaption>Just say no to data silos</figcaption></figure><p>BigQuery doesn’t leverage VMs for its storage layer (not even as accelerators), so you’re going to be immune to any data storage-related shenanigans like race conditions, locking, hot spotting, and you can throw away your knowledge of the CLONE and SWAP commands. Did I mention that Colossus is great?</p><p>Finally, serverless nature of BigQuery allows you to share data with other organizations, without ever forcing them to create their own clusters. You pay for storage, and they pay per-query, and it’s all entirely transparent. Who likes paying for idle clusters? That’s right, nobody!</p><h3>Public Datasets, Commercial Datasets, Marketing Datasets, and FREE TIER!!</h3><p>Some good folks are leveraging BigQuery’s powerful data sharing features to do some really cool stuff.</p><p><a href="https://cloud.google.com/bigquery/public-data/">Public Datasets Program</a> has the momentum of a downhill runaway freight train. Datasets are added almost weekly, with NOAA data being the latest entrant.</p><p>You can either monetize or procure commercial-grade datasets through BigQuery’s <a href="https://cloud.google.com/commercial-datasets/">Commercial Datasets</a>. Likewise, if you’re using Google Analytics, AdWords, Doubleclick, or Youtube, your data can end up in BigQuery in one click with <a href="https://cloud.google.com/blog/big-data/2017/05/introducing-ads-data-hub-next-generation-insights-and-reporting">Marketing Datasets</a> (<a href="https://www.youtube.com/watch?v=2vMfseR8EhA">as folks from Mercedez-Benz, Hearst, and New York Times found out</a>).</p><p>Finally, BigQuery has a perpetual <a href="https://cloud.google.com/free/">free tier</a>, allowing you to store 10GB and query up to 1TB per month.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zSsvBjb81CcpYEvE-g8g5g.png" /><figcaption>SapientRazorFish’s story</figcaption></figure><h3>Streaming and Batch Ingest</h3><p>BigQuery‘s <a href="https://cloud.google.com/bigquery/streaming-data-into-bigquery">Streaming API</a> is a rather unique feature. You’re <a href="https://cloud.google.com/blog/big-data/2017/06/life-of-a-bigquery-streaming-insert">able to stream data into BigQuery</a> to the tune of millions of rows per second, and data is available for analysis almost immediately. This is actually a pretty hard technical problem for analytic databases to solve, so kudos to the team here.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/623/1*dmk4UUFpokIrgVDWlx3k3g.png" /><figcaption>Streaming Ingest — more than meets the eye</figcaption></figure><p>BigQuery’s Batch ingest is no slouch, and there’s a whole separate <a href="https://medium.com/google-cloud/paying-it-forward-how-bigquerys-data-ingest-breaks-tech-norms-8bfe2341f5eb">blog</a> on it (tl;dr: it doesn’t eat at your query capacity, nor does it cost anything!)</p><h3>Federated Query Engine</h3><p>If your data resides in Bigtable, GCS, or Google Drive, you’re able to query that data directly from BigQuery without any data movement. This is what we refer to as “<a href="https://cloud.google.com/bigquery/external-data-sources">federated query</a>”.</p><h3>UX, CLI, SDK, ODBC/JDBC, API</h3><p>Typical access patterns, all wrapped around the REST API. One point worth mentioning is how nice it is working with BQ’s semantics:</p><ul><li>Jobs that commit storage (query, load, copy) commit all-or-nothing. There is no need to cleanup failed or half-completed jobs.</li><li>Queries see storage at a snapshot in time. They are thus immune to race conditions, table/row/cell locks, halfway states, whatever.</li></ul><h3>Pay-Per-Query AND Flat Rate Pricing</h3><p>BigQuery has two pricing models — the ultra-efficient cloud-native pay-per-query model, and the predictable Enterprise-grade Flat Rate model.</p><p>Folks decry the pay-per-query model as being too expensive. I am empathetic to folks having a hard time grok this model, and predict it, but expensive it is not. You ONLY pay for what you consume, and not a penny more, and in analytic workloads (which tend to be volatile) that <a href="https://cloud.google.com/blog/big-data/2016/02/visualizing-the-mechanics-of-on-demand-pricing-in-big-data-technologies">counts for a lot of dough</a>.</p><p>If you’ve a rather large use case and cherish price predictability over efficiency, BigQuery does offer the Flat Rate Pricing model. You pay one flat fee, and all queries are free!</p><p>Here’s the cherry on top — you get perfect visibility into either model. You can choose to jump from one model to the next, as it fits your budget needs.</p><h3>IAM, Authentication and Audit Logs</h3><p>BigQuery is compliant with Google Cloud’s <a href="https://cloud.google.com/iam/">IAM</a> policies, which allow organizations to carve out high-granularity role and controls for its users.</p><p>BigQuery supports two general modes of authentication:</p><ul><li>OAuth (the 3-legged user-involved auth approach)</li><li>Service Accounts (headless through a secrets file)</li></ul><p>There are valid use cases for both. OAuth is great if you’re already integrating with Google’s authentication, and Service Accounts work if you’re federating access controls on your side.</p><p>Finally, BigQuery’s Audit Logs is a paper trail of all things that happen in BigQuery. A large number of users export Audit Logs back to BigQuery and <a href="https://medium.com/google-cloud/visualize-gcp-billing-using-bigquery-and-data-studio-d3e695f90c08">visualize BigQuery usage in Data Studio</a> in real time!</p><p>So there you have it. You’ve made it to the bottom of my tirade. Hopefully I’ve done an okay job detailing the breadth and power behind BigQuery, Google’s analytics workhorse since 2012. Please do leave comments or questions!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c2b49829a7c7" width="1" height="1"><hr><p><a href="https://medium.com/google-cloud/the-12-components-of-google-bigquery-c2b49829a7c7">The 12 Components of Google BigQuery</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud Platform — Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Yea there’s definitely a use case, and our products folks are constantly looking at what’s best for…]]></title>
            <link>https://medium.com/@thetinot/yea-theres-definitely-a-use-case-and-our-products-folks-are-constantly-looking-at-what-s-best-for-c744ee54c0b4?source=rss-a539d84012a4------2</link>
            <guid isPermaLink="false">https://medium.com/p/c744ee54c0b4</guid>
            <dc:creator><![CDATA[Tino Tereshko]]></dc:creator>
            <pubDate>Wed, 28 Jun 2017 17:51:47 GMT</pubDate>
            <atom:updated>2017-06-28T17:51:47.007Z</atom:updated>
            <content:encoded><![CDATA[<p>Yea there’s definitely a use case, and our products folks are constantly looking at what’s best for the market. Since Google is the contra here, however, this would likely take shape of “surge pricing” or something similar.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c744ee54c0b4" width="1" height="1">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Hello Saif,]]></title>
            <link>https://medium.com/@thetinot/hello-saif-84fd7069ea4c?source=rss-a539d84012a4------2</link>
            <guid isPermaLink="false">https://medium.com/p/84fd7069ea4c</guid>
            <dc:creator><![CDATA[Tino Tereshko]]></dc:creator>
            <pubDate>Wed, 28 Jun 2017 17:26:49 GMT</pubDate>
            <atom:updated>2017-06-28T17:26:49.255Z</atom:updated>
            <content:encoded><![CDATA[<p>Hello Saif,</p><p>I agree with the premise that in a fully-transparent and frictionless market, forces of supply and demand meet at the most efficient point, maximizing value for both parties (or at least for the system in aggregate).</p><p>However, there’s a number of friction points in play here with Spot Instances:</p><ul><li>Per-hour billing granularity</li><li>High number of instance types prevents commodization of asset down to its raw qualities of CPU, RAM, network, etc.</li><li>Ephemeral non-guaranteed nature of assets</li><li>Volatility of supply and demand</li><li>Volatility of pricing.. averages look great, but highs can be very high.</li></ul><p>It’s telling that there is a thriving ecosystem trying to smooth out Spot Instance markets.</p><p>You are also operating under the assumption that Preemptible VMs suffer from low availability and have large contention on demand side.</p><p>So yes, while Preemptible VMs aren’t hyper-efficient in terms of distributing value, they’re simple and predictable. You know what you’re getting, and our clients tell us that’s pretty important.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=84fd7069ea4c" width="1" height="1">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The Google Data WareCity]]></title>
            <link>https://medium.com/@thetinot/the-google-data-wareuniverse-cccc7768000d?source=rss-a539d84012a4------2</link>
            <guid isPermaLink="false">https://medium.com/p/cccc7768000d</guid>
            <category><![CDATA[analytics]]></category>
            <category><![CDATA[aws]]></category>
            <category><![CDATA[big-data]]></category>
            <category><![CDATA[google-cloud-platform]]></category>
            <category><![CDATA[bigquery]]></category>
            <dc:creator><![CDATA[Tino Tereshko]]></dc:creator>
            <pubDate>Fri, 23 Jun 2017 23:15:58 GMT</pubDate>
            <atom:updated>2017-06-23T23:27:57.510Z</atom:updated>
            <content:encoded><![CDATA[<p>The data analytics space is fascinating with healthy competition and increasing pace of innovation. One example is a novel idea of cross-organizational data sharing — leveraging pure separation of storage and compute to keep storage in one place and give organizations the ability to share vast amounts of data with each other without data movement (and data silos, which turn into data swamps).</p><p>This is a fantastic feature of BigQuery, one that BigQuery has had since its General Availability in 2012.</p><p>There are some very interesting and unique aspects of BigQuery’s data sharing capability worth pointing out:</p><ul><li><a href="https://cloud.google.com/bigquery/public-data/">BigQuery Public Datasets</a> is a freight train of innovation and community-building, with Github, Hackernews, NOAA, and many other datasets hosted and updated, and we invite you to partake as well!</li><li>BigQuery is a purely serverless offering. This means that clients do not need to go the extra step of deploying and measuring out their clusters. Reliability is also a nice bonus here, since we abstract away hardware instability and maintenance.</li><li><a href="https://cloud.google.com/commercial-datasets/">BigQuery Commercial datasets.</a></li><li>BigQuery has a perpetual <a href="https://cloud.google.com/free/">free tier</a> — store 10GB of data for free and query 1TB of data for free per month.</li><li>One-click Integration with Google Analytics, Google AdWords, Google Doubleclick, Youtube, and so on.</li><li>Federated Query and Federated Storage — you can use BigQuery Storage from Hadoop/Spark/Flink/Dataflow, and you can run BigQuery SQL on GCS/Bigtable.</li></ul><p>With that in mind, I am here to introduce a new buzzword — Google Data WareCity — it’s much more than a house, you see!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=cccc7768000d" width="1" height="1">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Good point — custom discounts are frequent in the cloud industry.]]></title>
            <link>https://medium.com/@thetinot/good-point-custom-discounts-are-frequent-in-the-cloud-industry-23afd7cedc7a?source=rss-a539d84012a4------2</link>
            <guid isPermaLink="false">https://medium.com/p/23afd7cedc7a</guid>
            <dc:creator><![CDATA[Tino Tereshko]]></dc:creator>
            <pubDate>Tue, 21 Mar 2017 09:57:42 GMT</pubDate>
            <atom:updated>2017-03-21T09:57:42.348Z</atom:updated>
            <content:encoded><![CDATA[<p>Good point — custom discounts are frequent in the cloud industry.</p><p>However, do not conflate “custom discounts” with Google’s Committed Use Discount — two very different concepts.</p><p>As long as we’re on this topic, I do encourage you to watch the interview with Pivotal at Re:Invent :</p><p><a href="https://youtu.be/BUN0wx2e6Dc?t=4m54s">https://youtu.be/BUN0wx2e6Dc?t=4m54s</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=23afd7cedc7a" width="1" height="1">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Great post!]]></title>
            <link>https://medium.com/@thetinot/great-post-c93f2c26f424?source=rss-a539d84012a4------2</link>
            <guid isPermaLink="false">https://medium.com/p/c93f2c26f424</guid>
            <dc:creator><![CDATA[Tino Tereshko]]></dc:creator>
            <pubDate>Wed, 15 Mar 2017 23:58:15 GMT</pubDate>
            <atom:updated>2017-03-15T23:58:15.621Z</atom:updated>
            <content:encoded><![CDATA[<p>Great post! It’s also worth mentioning that Google Cloud has Spanner, Bigtable, and Datastore.. and AWS has DynamoDB</p><p>(work on GCP as well)</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c93f2c26f424" width="1" height="1">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[I know this is a blog about AWS and for/by AWS customers, but the topic is highly pertinent, so I…]]></title>
            <link>https://medium.com/@thetinot/i-know-this-is-a-blog-about-aws-and-for-by-aws-customers-but-the-topic-is-highly-pertinent-so-i-905f84397a8c?source=rss-a539d84012a4------2</link>
            <guid isPermaLink="false">https://medium.com/p/905f84397a8c</guid>
            <dc:creator><![CDATA[Tino Tereshko]]></dc:creator>
            <pubDate>Wed, 15 Mar 2017 23:42:46 GMT</pubDate>
            <atom:updated>2017-03-15T23:42:46.037Z</atom:updated>
            <content:encoded><![CDATA[<p>I know this is a blog about AWS and for/by AWS customers, but the topic is highly pertinent, so I feel the conversation would benefit from author and readers knowing that Google Cloud offers Multi-Regional storage, at a very competitive price:</p><p><a href="https://cloud.google.com/storage/docs/storage-classes#multi-regional">https://cloud.google.com/storage/docs/storage-classes#multi-regional</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=905f84397a8c" width="1" height="1">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Great response, and I’m glad you brought this up.]]></title>
            <link>https://medium.com/@thetinot/great-response-and-im-glad-you-brought-this-up-5bea7bb2bf0f?source=rss-a539d84012a4------2</link>
            <guid isPermaLink="false">https://medium.com/p/5bea7bb2bf0f</guid>
            <dc:creator><![CDATA[Tino Tereshko]]></dc:creator>
            <pubDate>Sun, 12 Mar 2017 22:48:01 GMT</pubDate>
            <atom:updated>2017-03-12T22:48:01.649Z</atom:updated>
            <content:encoded><![CDATA[<p>Great response, and I’m glad you brought this up.</p><p>The argument in my blog is confined to specific scope — how and why Google’s Compute is more flexible than Amazon’s, especially when it comes to RIs. I am certainly not intending for that argument to be taken as a superset of all arguments around cloud.</p><p>Your response does expand the scope of this conversation, and into a topic that Google shines in (and far from accidentally)- powerful higher level services, so I’d love to hear your thoughts on just some of these points:</p><ul><li>Unique to Google — Google has <a href="https://cloud.google.com/storage/docs/storage-classes#multi-regional">Multi-region Object Storage</a>, in case of those pesky regional storage outages.</li><li>Unique to Google — Bigtable, a high-level serverless NoSQL database, <a href="https://cloudplatform.googleblog.com/2016/03/financial-services-firm-processes-25-billion-stock-market-events-per-hour-with-Google-Cloud-Bigtable.html">demonstrated</a> to serve 53 million qps by Sungard FIS last year.</li><li>Unique to Google — BigQuery <a href="https://cloud.google.com/blog/big-data/2016/08/google-bigquery-continues-to-define-what-it-means-to-be-fully-managed">provides</a> the highest level of manageability for an analytics database yet, at Petabyte-scale. Last week Yahoo and NYT detailed their migrations from, well, a different cloud vendor :)</li><li>Unique to Google — <a href="https://cloudplatform.googleblog.com/2017/02/introducing-Cloud-Spanner-a-global-database-service-for-mission-critical-applications.html">Spanner</a> is the only serverless Petabyte-scale strongly consistent regionally replicated horizontal, and highly available RDBMS.</li><li>Unique to Google — Cloud Dataflow is serverless batch and stream processing engine.</li><li>Unique to Google — Dataproc is job-scoped Hadoop and Spark clusters, very compelling to folks coming from EMR, and I wrote on that topic <a href="https://medium.com/@thetinot/why-dataproc-googles-managed-hadoop-and-spark-offering-is-a-game-changer-9f0ed183fda3">here</a>.</li></ul><p>So while I attempted to clarify one bit of debate around just one piece of the puzzle (compute), your chief argument is much broader, and is far from false on Google Cloud. In fact, one may very easily argue that Google is a market leader in offering higher-level services, and I gave just a couple of points of evidence for you to ponder.</p><p>I look forward to hearing your thoughts!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5bea7bb2bf0f" width="1" height="1">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Why Google’s Answer to AWS Reserved Instances is a Big Deal]]></title>
            <link>https://hackernoon.com/why-googles-answer-to-aws-reseved-instances-is-a-big-deal-4b9b36d8e631?source=rss-a539d84012a4------2</link>
            <guid isPermaLink="false">https://medium.com/p/4b9b36d8e631</guid>
            <category><![CDATA[cloud-computing]]></category>
            <category><![CDATA[azure]]></category>
            <category><![CDATA[google-cloud-platform]]></category>
            <category><![CDATA[aws]]></category>
            <dc:creator><![CDATA[Tino Tereshko]]></dc:creator>
            <pubDate>Fri, 10 Mar 2017 17:56:58 GMT</pubDate>
            <atom:updated>2017-07-17T16:17:51.011Z</atom:updated>
            <content:encoded><![CDATA[<p><strong>Why Google’s Answer to AWS Reserved Instances is a Big Deal</strong></p><p><strong><em>Update March 13</em></strong><em>: Two days later AWS </em><a href="https://aws.amazon.com/blogs/aws/new-instance-size-flexibility-for-ec2-reserved-instances/"><em>responded</em></a><em> to this move by relaxing rules about switching between instance types. However, this does nothing to alleviate restrictions around instance families and generations. You are still limited by network/disk/GPU/CPU characteristics of the instance family.</em></p><p>I’ve been watching discussions around Google’s freshly minted Committed Use Discounts program , and how it compares with Amazon’s Reserved Instances (RIs). The verdict is in — third parties like <a href="http://www.rightscale.com/blog/cloud-cost-analysis/aws-reserved-instances-vs-google-committed-use-discounts">Rightscale have already done the math</a> and showed Google to be at least 35% cheaper.</p><p>It’s easy to overlook the larger impact, and some press has already <a href="http://www.zdnet.com/article/google-cloud-platform-committed-use-discounts-trump-aws-reserved-instances-for-now/">concluded</a> that it’s an easy glitch to fix (just drop that 35% gap to 0%, you see…). Not so fast. Google Committed Use Discounts are much more than just a “pricing schema”. There are some serious practical benefits that have nothing to do with cost, and the other cloud vendors aren’t in a position to compete here technologically any time soon. In the end, all customers win!</p><p>Let’s discuss the what, the how, and the why…</p><p><strong>Google Committed Use Discounts vs Amazon RIs</strong></p><p>In short, Google now <a href="https://cloud.google.com/compute/pricing#committed_use">lets</a> users pre-purchase chunks of CPU and RAM on 1- and 3-year commitments in return for substantial discounts, up to 57%. With Google you can create <a href="https://cloud.google.com/compute/pricing#custommachinetypepricing">Custom VMs</a>, picking your own CPU and RAM configuration. All instances get fantastic networking, and all instances can get top-notch disk and GPUs. So you are truly buying CPU and RAM, while retaining architectural flexibility. Users get to turn the IOPS/disk/network/GPUs knobs whenever they want, invariant of “instance family” or some other arbitrary (to us) limitations.</p><p>This is in sharp contrast to what Amazon offers with RIs — pre-purchasing instance types, which have specific characteristics like “nice network instances” or “GPU instances” or “great storage instances”. Thus with Amazon you’re pre-purchasing a pre-set configuration of CPU/RAM/IOPS/Network/GPU/Disk characteristics with only minimal flexibility (mostly around EBS and instance sizes), and your mobility to other pre-set configurations is severely limited. So you better be damn sure you made the right choice, because you’re living with it for 1–3 years.</p><p>Here’s another way to look at AWS instances. A “compute optimized” instance is just another name for an instance with inferior disk, network, and a low amount of RAM. Why not have great everything!!</p><p><strong>How how how is this even possible?</strong></p><p>Google is able to offer this due to the unique nature of Google Cloud. Google Compute Engine under the hood is NOT a service that sells a bunch of VMs running on specific hardware. Compute Engine is an opinionated, living and breathing supercomputer, continuously carving out resources for its clients in the most optimal fashion (compare this to Microsoft’s perplexing <a href="http://www.zdnet.com/article/azure-is-becoming-the-first-ai-supercomputer-says-microsoft/">claims</a> in this space) . Complexity is abstracted away, and users are exposed to familiar IaaS primitives — VMs, networking, disk, etc.</p><p>Since 2013 Google’s been heavily leveraging <a href="https://cloudplatform.googleblog.com/2015/03/Google-Compute-Engine-uses-Live-Migration-technology-to-service-infrastructure-without-application-downtime.html">Live Migration</a> to help make these primitives as customer-friendly as possible (not to mention, to patch critical hypervisor flaws or to perform maintenance or remove noisy-neighbor problems). Goodbye maintenance windows!</p><p>Live Migration also lets us truly maximize performance, and to make that performance stable and predictable. Test us, I dare you! As far as I know, no other cloud has Live Migration, certainly not to the same degree.</p><p>Here are some more critical technicals of Google Cloud:</p><ul><li>Google runs homogeneous <a href="https://cloud.google.com/security/whitepaper">hardware footprints</a>, manufacturing and designing components that go into its data centers.</li><li>Google’s <a href="https://cloudplatform.googleblog.com/2015/06/A-Look-Inside-Googles-Data-Center-Networks.html">Jupiter</a> network offers a Petabit of bisectional network bandwidth within each data center cell. Bandwidth does grow on trees, and a major reason why Google’s able to offer juicy services like BigQuery and Spanner.</li><li>Borg is Google’s orchestrator (and a predecessor to <a href="https://kubernetes.io/">Kubernetes</a>), spinning up workloads on-demand, bin-packing resources, performing rolling upgrades, and making sure performance is top-notch all around.</li></ul><p>Google’s the only cloud provider running this way, and has been doing it so for the past 3.5+ years. The fact that no other competitor has emerged there in that long of a timeframe is indicative of just how damn hard the problem is, and how many technical barriers there are to creating this offer.</p><p><strong>Benefits of Committed Use Discounts</strong></p><p>Let’s quickly run through some of the benefits of Google Committed Use Discounts:</p><ul><li>They’re quite inexpensive. I suspect this matters for clients, but don’t hold me to it.</li><li>You aren’t required to pay upfront to get the inexpensive price.</li><li>On Google, you aren’t stuck on an “instance family” and have to creatively compel your sales rep to get moved up to the newest generation. Sadly, your sales rep’s incentives are in direct conflict with yours here, as even <a href="https://youtu.be/BUN0wx2e6Dc?t=4m54s">Pivotal</a> found out (go watch that video!).</li><li>You aren’t stuck on a CPU/RAM combination that may be inefficient or wasteful for you. Re-carve it using Custom VMs.</li><li>Your network/disk/IOPS/GPU knobs aren’t soldered shut. You retain 100% of the flexibility here. This is a big deal!</li><li>As Urs mentioned in his talk yesterday, you can retire your “Ministry of RI Optimizations”. Stop playing RI tetris, seriously — the sadomasochism is entirely optional.</li></ul><p><strong>How do you compete with this?</strong></p><p>You can answer Google’s Committed Use Discounts by lowering prices or making RIs more user friendly and less restrictive, but in order to provide Google Committed Use Discounts, you need to do some serious engineering homework:</p><ul><li>Get yourself a custom-manufactured data center stack with very few vendor dependencies</li><li>Make sure this stack is as homogeneous as possible</li><li>Build a Jupiter-like network that lets every instance in a data center talk to every other instance at 10G.. All at the same time..</li><li>Acquire Borg or a similarly-minder orchestrator, and heavily invest in this orchestrator for 10+ years</li><li>Create VMs that give you best-in-class disk AND network AND GPUs, invariant of “instance family” or type.</li><li>Productize Live Migration and give it 3+ years to mature.</li><li>Productize Custom Machine Types</li></ul><p>Do not discount (no pun intended) the technical complexity here. These problems are very very hard. As Eric Schmidt has said, Google’s poured $30 billion dollars over the past three years on this bonfire, and it shows. In the end, users win!</p><p>So think of Google next time you’re trying to make sense of your cloud bill, or next time your sales rep calls you to re-up your commitment, or next time you’re trying to get a discount from your cloud vendor. You deserve the best!</p><blockquote><a href="http://bit.ly/Hackernoon">Hacker Noon</a> is how hackers start their afternoons. We’re a part of the <a href="http://bit.ly/atAMIatAMI">@AMI</a>family. We are now <a href="http://bit.ly/hackernoonsubmission">accepting submissions</a> and happy to <a href="mailto:partners@amipublications.com">discuss advertising &amp; sponsorship</a> opportunities.</blockquote><blockquote>To learn more, <a href="https://goo.gl/4ofytp">read our about page</a>, <a href="http://bit.ly/HackernoonFB">like/message us on Facebook</a>, or simply, <a href="https://goo.gl/k7XYbx">tweet/DM @HackerNoon.</a></blockquote><blockquote>If you enjoyed this story, we recommend reading our <a href="http://bit.ly/hackernoonlatestt">latest tech stories</a> and <a href="https://hackernoon.com/trending">trending tech stories</a>. Until next time, don’t take the realities of the world for granted!</blockquote><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4b9b36d8e631" width="1" height="1"><hr><p><a href="https://hackernoon.com/why-googles-answer-to-aws-reseved-instances-is-a-big-deal-4b9b36d8e631">Why Google’s Answer to AWS Reserved Instances is a Big Deal</a> was originally published in <a href="https://hackernoon.com">Hacker Noon</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Why Dataproc — Google’s managed Hadoop and Spark offering is a game changer.]]></title>
            <link>https://hackernoon.com/why-dataproc-googles-managed-hadoop-and-spark-offering-is-a-game-changer-9f0ed183fda3?source=rss-a539d84012a4------2</link>
            <guid isPermaLink="false">https://medium.com/p/9f0ed183fda3</guid>
            <category><![CDATA[spark]]></category>
            <category><![CDATA[hadoop]]></category>
            <category><![CDATA[cloud-computing]]></category>
            <category><![CDATA[aws]]></category>
            <category><![CDATA[google-cloud-platform]]></category>
            <dc:creator><![CDATA[Tino Tereshko]]></dc:creator>
            <pubDate>Thu, 05 Jan 2017 16:29:53 GMT</pubDate>
            <atom:updated>2017-07-14T20:33:39.097Z</atom:updated>
            <content:encoded><![CDATA[<p>So far I’ve written articles on Google BigQuery (<a href="https://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-query">1</a>,<a href="https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood">2</a>,<a href="https://cloud.google.com/blog/big-data/2016/08/google-bigquery-continues-to-define-what-it-means-to-be-fully-managed">3</a>,<a href="https://medium.com/p/6654841fa2dc">4</a>,<a href="https://medium.com/google-cloud/paying-it-forward-how-bigquerys-data-ingest-breaks-tech-norms-8bfe2341f5eb">5</a>) , on cloud-native economics(<a href="https://cloud.google.com/blog/big-data/2016/02/visualizing-the-mechanics-of-on-demand-pricing-in-big-data-technologies">1</a>,<a href="https://cloud.google.com/blog/big-data/2016/02/understanding-bigquerys-rapid-scaling-and-simple-pricing">2</a>), and even on ephemeral VMs (<a href="https://medium.com/@thetinot/google-clouds-spot-instances-win-big-and-you-should-too-5b244ca3facf#.soec2n94i">1</a>). One product that really excites me is Google Cloud <a href="https://cloud.google.com/dataproc/">Dataproc</a> — Google’s managed Hadoop, Spark, and Flink offering. In what seems to be a fully commoditized market at first glance, Dataproc manages to create significant differentiated value that bodes to transform how folks think about their Hadoop workloads.</p><p><strong>Jobs-first Hadoop+Spark, not Clusters-first</strong></p><p>Typical mode of operation of Hadoop — on premise or in cloud — require you deploy a cluster, and then you proceed to fill up said cluster with jobs, be it MapReduce jobs, Hive queries, SparkSQL, etc. Pretty straightforward stuff.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/968/1*lAnz_EsKDRwbpOcj3AjaRA.png" /><figcaption>The standard way of running Hadoop and Spark.</figcaption></figure><p>Services like Amazon EMR go a step further and let you run ephemeral clusters, enabled by separation of storage and compute through <a href="http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html">EMRFS</a> and S3. This means that you can discard your cluster while keeping state on S3 after the workload is completed.</p><p>Google Cloud Platform has two critical differentiating characteristics:</p><ul><li>Per-minute billing (Azure has this as well)</li><li>Very fast VM boot up times</li></ul><p>When your clusters start in well under 90 seconds (under 60 seconds is not unusual), and when you do not have to worry about wasting that hard-earned cash on your cloud provider’s pricing inefficiencies, you can flip this cluster-&gt;jobs equation on its head.<strong> You start with a job, and you acquire a cluster as a step in job execution.</strong></p><p>If you have a MapReduce job, as long as you’re okay with paying the 60 second initial boot-up tax, rather than submitting the job to an already-deployed cluster, you submit the job to Dataproc, which creates a cluster on your behalf on-demand. A cluster is now a means to an end for job execution.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wL9UfMcHQS_DEIvDslxorw.png" /><figcaption>Demonstration of my exquisite art skills, plus illustration of the jobs before clusters concept realized with Dataproc.</figcaption></figure><p>Again, this is only possible with Google Dataproc, only because of:</p><ul><li>high granularity of billing (per-minute)</li><li>very low tax on initial boot-up times</li><li>separation of storage and compute (and ditching HDFS as primary store).</li></ul><p>Operational and economic benefits are obvious and easily realized:</p><ul><li>Resource segregation though tenancy segregation avoids non-obvious bottlenecks and resource contention between jobs.</li><li>Simplicity of management — no need to actually manage the cluster or resource allocation and priorities through things like YARN resource manager. Your dev/stage/prod workloads are now intrinsically separate — and what a pain that is to resolve and manage elsewhere!</li><li>Simplicity of pricing — no need to worry about rounding up to nearest hour.</li><li>Simplicity of cluster sizing — to get the job done faster, simply ask Dataproc to deploy more resources for the job. When you pay per-minute, you can start thinking in terms of VM-minutes.</li><li>Simplicity of troubleshooting — resources are isolated, so you can’t blame the other tenants on your problems.</li></ul><p>I’m sure I’m forgetting others. Feel free to leave a comment here to add color. Best response gets a collectors’ edition Google Cloud Android figurine!</p><p>Dataproc is as close as you can get to serverless and cloud-native pay-per-job with VM-based architectures — across the entire cloud space. There’s nothing even close to it in that regard.</p><p>Dataproc does have a 10-minute minimum for pricing. Add the sub-90 second cluster creation timer, and you rule out many relatively lightweight ad-hoc workloads. In other words, this works for big serious batch jobs, not ad-hoc SQL queries that you want to run in under 10 seconds. I write on this topic <a href="https://cloud.google.com/blog/big-data/2016/02/understanding-bigquerys-rapid-scaling-and-simple-pricing">here</a>.(do let us know if you have a compelling use case that leaves you asking for less than a 10-minute minimum).</p><p><strong>The rest of the Dataproc goodies</strong></p><p>Google Cloud doesn’t stop there. There’s a few other benefits of Dataproc that truly make your life easier and your pockets fuller:</p><ul><li><a href="https://cloud.google.com/custom-machine-types/"><strong>Custom VMs</strong></a> — if you know the typical resource utilization profile of your job in terms of CPU/RAM, you can tailor-make your own instances with that CPU/RAM profile. This is really really cool, you guys.</li><li><strong>Preemptible VMs</strong> — I <a href="https://medium.com/@thetinot/google-clouds-spot-instances-win-big-and-you-should-too-5b244ca3facf#.4yknmlbzx">wrote</a> on this topic recently. Google’s alternative to Spot instances is just great. Flat 80% off, and Dataproc is smart enough to repair your jobs in case instances go away. I <a href="https://medium.com/@thetinot/google-clouds-spot-instances-win-big-and-you-should-too-5b244ca3facf#.1i91128u4">beat this topic to death in the blog post</a>, and in my biased opinion it’s worth a read on its own.</li><li><strong>Best pricing in town.</strong> Google Compute Engine is <a href="http://fortune.com/2016/01/08/google-amazon-cloud-price-war/">the industry price leader </a>for comparably-sized VMs. In some cases, up t0 40% less than EC2.</li><li><strong>Gobs of ephemeral capacity</strong> — Yes, you can run your Spark jobs on thousands of Preemptible VMs, and we won’t make you sign a big commitment, as <a href="https://news.ycombinator.com/item?id=13259575">this</a> gentleman found out (TL;DR: running 25,000 Preemptible VMs) .</li><li><strong>GCS is fast fast fast</strong> — When ditching HDFS in favor of object stores, what matters is the overall pipe between storage and instances. Mr. Jornson details performance characteristics of GCS and comparable offerings <a href="http://blog.zachbjornson.com/2015/12/29/cloud-storage-performance.html">here</a>.</li></ul><p><strong>Dataproc for stateful clusters</strong></p><p>Now if you are running a stateful cluster with, say Impala and Hbase on HDFS, Dataproc is a nice offering here too, if for some reason you don’t want to run Bigtable + BigQuery.</p><p>If you are after the biggest baddest disk performance on the market, why not go with something that resembles RAM more than SSD in terms of performance — Google’s Local SSD? Mr. Dinesh does a great job comparing Amazon’s and Google’s offerings <a href="https://medium.com/google-cloud/new-google-cloud-ssds-have-amazing-price-to-performance-2a58e7d9b433#.xzje06qss">here</a>. Cliff notes — Local SSD is really, really, really good — really.</p><p>Finally, Google’s Sustained Use Discounts automatically rewards folks who run their VMs for longer periods of time, up to 30% off. No contracts and no commitments. And, thank goodness, no managing your Reserved Instance bills.</p><p>You win if you use Google’s VMs for short bursts, and you win when you use Google for longer periods.</p><p><strong>Economics of Dataproc</strong></p><p>We discussed how Google’s VMs are typically much cheaper through Preemptible VMs, Custom VMs, Sustained Use Discounts, and even lower list pricing. <a href="https://thehftguy.com/2016/11/18/google-cloud-is-50-cheaper-than-aws/">Some folks find the difference to be 50% cheaper</a>!</p><p>Two things that studying Economics taught me (put down your pitchforks, I also did Math) — the difference between soft and hard sciences, and the ability to tell a story with two-dimensional charts.</p><p>Let’s assume a worst-case scenario, in which EMR and Dataproc VM prices are equal. We get this chart, which hopefully requires no explanation:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*95RNaWBdG6uj1X8DzCAp_Q.png" /><figcaption>Which line would you rather be on?</figcaption></figure><p>If you believe our good friend thehftguy’s <a href="https://thehftguy.com/2016/11/18/google-cloud-is-50-cheaper-than-aws/">claims</a> that Google is 50% cheaper (after things like Preemptible VMs, Custom VMs, Sustained Use Discounts, etc), you get this compelling chart:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*HPhL1tMd4AGUUrpzpionfQ.png" /><figcaption>Same chart, but with some more aggressive assumptions.</figcaption></figure><p>When you’re dishing out your precious shekels to your cloud provider, think of all this extra blue area that you’re volunteering to pay that’s entirely spurious. This is why many of Dataproc’s customers don’t mind paying egress from their non-Google cloud vendors to GCS!</p><p><strong>Summary</strong></p><p>Google Cloud has the advantage of a second-comer. Things are simpler, cheaper, and faster. Lower-level services like instances (GCE) and storage (GCS) are more powerful and easier to use. This, in turn, lets higher-level services like Dataproc be more effective:</p><ul><li><strong>Cheaper</strong> — per-minute billing, Custom VMs, Preemptible VMs, sustained use discounts, and cheaper VMs list prices.</li><li><strong>Faster</strong> — rapid cluster boot-up times, best-in-class object storage, best-in-class networking, and RAM-like performance characteristics of Local SSDs.</li><li><strong>Easier</strong> — lots of capacity, less fragmented instance type offerings, VPC-by-default, and images that closely follow Apache releases.</li></ul><p>Fundamentally, Dataproc lets you think in terms of jobs, not clusters. You start with a job, and you get a cluster as just another step in job execution. This is a very different mode of thinking, and we feel that you’ll find it compelling.</p><p>You don’t have to take my word for it — good folks at O’Reilly had <a href="https://www.oreilly.com/ideas/spark-comparison-aws-vs-gcp">this</a> to say about Dataproc and EMR.</p><p>Find me on twitter at @thetinot . Happy to chat further!</p><figure><a href="http://bit.ly/HackernoonFB"><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0hqOaABQ7XGPT-OYNgiUBg.png" /></a></figure><figure><a href="https://goo.gl/k7XYbx"><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Vgw1jkA6hgnvwzTsfMlnpg.png" /></a></figure><figure><a href="https://goo.gl/4ofytp"><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*gKBpq1ruUi0FVK2UM_I4tQ.png" /></a></figure><blockquote><a href="http://bit.ly/Hackernoon">Hacker Noon</a> is how hackers start their afternoons. We’re a part of the <a href="http://bit.ly/atAMIatAMI">@AMI</a> family. We are now <a href="http://bit.ly/hackernoonsubmission">accepting submissions</a> and happy to <a href="mailto:partners@amipublications.com">discuss advertising &amp; sponsorship</a> opportunities.</blockquote><blockquote>If you enjoyed this story, we recommend reading our <a href="http://bit.ly/hackernoonlatestt">latest tech stories</a> and <a href="https://hackernoon.com/trending">trending tech stories</a>. Until next time, don’t take the realities of the world for granted!</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*35tCjoPcvq6LbB3I6Wegqw.jpeg" /></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9f0ed183fda3" width="1" height="1"><hr><p><a href="https://hackernoon.com/why-dataproc-googles-managed-hadoop-and-spark-offering-is-a-game-changer-9f0ed183fda3">Why Dataproc — Google’s managed Hadoop and Spark offering is a game changer.</a> was originally published in <a href="https://hackernoon.com">Hacker Noon</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>