Let’s review one by one giving real examples using the Python client library.
The principal API for core interaction. Using this API you can interact with core resources as datasets, views, jobs, and routines. Up today exists 7 client libraries: C#, Go, Java, Node.js, PHP, Python, and Ruby.
For this example, I will use the python client library for the BigQuery API on my personal computer. Consider that you need to have python already installed.
Exist many technologies to make Data Enrichment, although, one that could work with a simple language like SQL and at the same time allow you to do a batch and streaming processing, there are few and one of them is Dataflow on Google Cloud.
Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends, including Apache Flink, Apache Spark, Google Cloud Dataflow and Hazelcast Jet. [Github, Apache Beam]
I’ve used BigQuery every day with small and big datasets querying tables, views, and materialized views. During this time I’ve learned some things, I would have liked to know since the beginning. The goal of this article is to give you some tips and recommendations for optimizing your costs and performance.
BigQuery: Serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility [Google Cloud doc].
GCP Project: Google Cloud projects form the basis for creating, enabling, and using all Google Cloud services including managing APIs, enabling billing, adding and removing collaborators, and managing permissions for Google Cloud resources…
Learn how to start handling security, scalability, access, and documentation in a modern Data API.
Imagine, you’ve developed a simple python data API following many tutorials, and now some questions come to your mind.
This post aims to answer these questions. Let’s start with the architecture we will develop. …
In this article, I’ll show you a simple way to build in minutes a few Data APIs for exploiting data from a BigQuery dataset. These APIs will be deployed with dockers using a GCP serverless service called Cloud Run.
The idea behind is to work with serverless components. First, let’s understand these services and their purpose on the architecture.
Imagine you’d developed a transformation process in a local Spark and you want to schedule it so a simple Cron Job would be sufficient. Now think that after that process you need to start many other like a python transformation or an HTTP request and also this is your production environment so you need to monitor each step
Did that sound difficult? Only with Spark and Cron Job, yes, but thanks we have Apache Airflow.
Airflow is a platform to programmatically author, schedule and monitor workflows [Airflow docs].
In our case, we need to make a workflow that runs a…
Imagine you want to start building some data pipelines in Spark or implement a model with Spark ML, the first step before anything is to deploy a Spark Cluster to make it easy you could set up in minutes a Dataproc cluster, It’s a fully-managed cloud service that includes Spark, Hadoop, and Hive. Now imagine doing it many times, reproducing it in other projects or your organization want to make your Dataproc configurations a standard.
After following the 12 weeks of preparation recommended by Google, I passed the exam for Associate Cloud Engineer, here is what I’ve learned that could help you.
This story began 3 months ago when I was like every day checking my Linkedin feed and I saw a post from Google Cloud about the Certification Challenge. The first time I read I was considering getting a cloud specialization and was wondering which of the three main competitors should I.
First, at that time the decision wasn’t technically since I didn’t have deep experience in Azure, AWS or GCP just basic projects…
Making easy to analyze billions of rows
In order to have a clear understanding of Apache Druid, I’m going to refer what the official documentation says:
Apache Druid (incubating) is a real-time analytics database designed for fast slice-and-dice analytics (“OLAP” queries) on large data sets. Druid is most often used as a database for powering use cases where real-time ingest, fast query performance, and high uptime is important.
Learn by doing and never fear to failure.
This phrase could resume all my experience in the MIT Deep Technology Bootcamp and additionally to that, intense describe well each day at the classroom. Although the intention of this article is not to be an experience-describer I want to give some thoughts during the explanation of definitions, topics, and trends. In the end, the idea of sharing this is to give some inspiration explaining a big picture of deep technologies where is possible to go further especially if you are starting in the Data Science or AI world.