Amazon’s most profitable division — AWS
Headquarters Seattle, Washington
Industry Internet & Digital Media
Valuation $469 billion as of May 2017
Last year Amazon was ranked as the smartest company on the planet by MIT.
Call out a request and AI-powered Alexa will play your favorite song or order you a pizza. And Amazon Web Services just keeps growing.
Amazon Web Services
The company’s cloud-computing operation, also deserves notice as the leader and Amazon’s fastest growing and most profitable division. The company launched it’s cloud offering Simple Storage Service (S3) in March 2006. Later, added Elastic Compute Cloud(EC2) in the same year.
If you are a Dropbox user, have you ever wondered where your files reside? Does Dropbox own a public or private cloud? Let me break the ice, it started on AWS. What happened next is an interesting story about how Dropbox has evolved from simple file sharing service based on AWS S3.
Last year, Netflix completed their 8 year long cloud migration by scrapping last bits of their data center. Netflix Technology Blog showcases all their technological innovations built on cloud platform to be completely fault tolerant, a special shout out to their Simian Army. Recently, they migrated their billing server on AWS which required a database with ACID compliance, hence RDBMS was an apt choice for transactional data which was previously hosted on Oracle Database(on-premise). Database migration blog link.
Business Intelligence & Big Data
The value of data increases exponentially when it is quickly analyzed to derive meaningful insights. The four phases of typical BI & Big Data infrastructure has been listed and the services offered by AWS have been explained succinctly.
1 . Data Orchestration
With the massive proliferation of data, automating workflows can help ensure that necessary activities take place when required and where required to drive the analytic processes.
AWS provides three services which can be used to build analytic solutions that are automated, repeatable, scalable, and reliable. They are Amazon Simple Workflow Service (Amazon SWF), AWS Data Pipeline, and AWS Lambda. All three are designed for highly reliable execution of tasks, which can be event-driven, on-demand, or scheduled.
2. Data Collection
Siphoning or migrating data from multiple sources is one of the prime concerns in the data warehousing environment. The need for reliable ways to move data of higher volumes at the highest velocity possible is ever-increasing.
AWS provides various tools to collect and store data in it’s own native storage services.
- Kinesis Firehouse — reliable option to gather streaming data
- Direct Connect — provides a dedicated connection between on-premise and AWS’ datacenter by passing the internet for faster data transfer
- Storage Gateway — a gateway which links your environment with AWS. This lets you build a hybrid cloud architecture with storage on-premises and seamless, elastic cloud back end
- Database migration service — to move your already existing database onto Redshift and other database offerings
- Snowball — Petabyte scale data transferring appliance
- Snowmobile — moving a data center has never been so easy!
3 . Data Preparation
Data processing to derive insights can be broadly categorized into batch and stream processing.
- Batch processing involves running EMR (Elastic Map Reduce) jobs on data with Hadoop ecosystem tools like Spark, Hive, Pig and other data processing frameworks. It involves normalizing the data to run queries on varying datasets for analytic workloads. Likewise, AWS also offers services for relational database called RDS(Relational database service) which is a managed service for SQL Server, MySQL, MariaDB, OracleDB, PostgreSQL and AWS’ own MySQL backed AWS Aurora which offers low-cost and highly reliable service. Amazon DynamoDB managed NoSQL database service which offers push-button scaling which can instantly increase the compute power.
- Stream processing can be easily setup on AWS using Kinesis or Kinesis Firehouse. Data generated from various sources are sent in small packets over time which can be processed sequentially and incrementally record-by-record basis or set up sliding time windows. The results of which can be easily stored on Amazon S3 or often used Amazon Redshift. Once stored on Redshift, traditional BI tools can be used to analyze the data as it can leverage MPP (Massive parallel processing).
In situations where live streaming data and information from databases which are updated with lower latencies containing semi-structured or structured data, can be analyzed together by orchestrating the process correctly.
AWS also supports third-party ETL like tools such as Attunity CloudBeam, Informatica & Mattilion for data collection, processing and storing. AWS Glue is a fully managed ETL service which is yet to be live, but simplifies the process into three steps — build a data catalog; generate and edit transformations(python based); lastly, schedule and run jobs.
4. Data Visualization
Now that we have stored the data in a normalized format and aggregated through ETL process in a final repository. It’s time to represent data in a pictorial and graph-like report, so that human mind can better comprehend.
- Tableau server works best with AWS Redshift to build interactive dashboards
- SAP BusinessObjects is a full suite of BI on AWS
- TIBCO Jaspersoft embedded BI works natively on AWS RDS & Redshift
- Looker Analytics platform works brilliantly by querying the Redshift natively without the need for an intermediate server
“How do I SPICE up my data?,” you ask?
QuickSight — It’s fast, easy to use, cloud powered business intelligence tool. Login, point to the data source and visualize. Create dashboards in minutes. Simple.
QuickSight automatically discovers all customer data available in AWS data stores — S3, RDS, Redshift, EMR, DynamoDB, etc. — and then “infers data types and relationships, issues optimal queries to extract relevant data [and] aggregates the results,” according to Amazon. This data is then automatically formatted, “without complicated ETL,” and made accessible for analysis through a parallel, columnar, in-memory calculation engine called “Spice”. Spice serves up analysis “within seconds… without writing code or using SQL.” Data can also be consumed through a “SQL-like” API by partner tools running on Amazon including Domo, Qlik, Tableau and Tibco-Jaspersoft. A tool called “Autograph” automatically suggests best-fit visualisations, with options including pie charts, histograms, scatter plots, bar graphs, line charts and storyboards. You can also build live dashboards that will change in real-time along with the data. QuickInsight will also be supported with iOS and Android mobile apps. Sharing options will include “single-click” capture and bundling data and visualisations (not just screen shots) so collaborators can interact with, drill down on, and change visualisations. These can be shared on intranets, embedded in applications or embedded on public-facing websites.
— Reinvent 2015
With “Volume, Velocity & Variety” data being generated or available, the need for processing them simultaneously is constrained by the infrastructure’s ability to setup environment and scale, deploy & process the existing solutions or be agile enough to try out new things. The power of AWS lies in building production grade systems within few days, scaling them in minutes and terminating when not required. The compute capacity offered is practically infinite!