So, you are a data scientist, you work with data and need to explore it and run some analytics on the data before jumping into running extensive machine learning algorithms.
Let’s start by examining what Serverless is.
According to Wikipedia, serverless computing is a cloud computing execution model where the cloud provider manages the server and dynamically allocates the resource needed to finish the task.
That means, as users, we are in charge of the logic only. We don’t need to take care of the servers, capacity planning, or maintenance operations scale. It doesn’t mean they are not happening; they…
Today, I’m excited to share a project I’ve worked on for the last couple of months. An interview-based video series: “Tech Exceptions”, highlighting Startups technical stories.
In the first episode, we interview Alex Kendall, CEO of Wayve, a startup that develops artificial intelligence for autonomous vehicles capable of driving in any urban environment, anywhere in the world. Alex shares his personal journey and what lead him to start Wayve, how Wayve is building the largest autonomous driving academy in the world, their challenges with processing huge amounts of data, and how it all connected to video and VR games.
To become a data engineer, there are various skills and software to learn, but the basics are the same!
Today, there are multiple data engineers’ certificates and courses that you can take.
Here are the free ones; they cover the basics, which are the most important part to grasp and understand, and later introduce specific Azure technologies.
They are free and will help you cover the basics and Get Started.
Later it will be easier for you to jump into any technology related to data engineering.
If there is a framework that super excites me, it’s Apache Spark.
If there is a conference that excites me, it’s the Spark & AI summit.
This year, with the current COVID-19 pandemic, the North America version of the Spark & AI summit is online and free. No need to travel, buy an expensive flight ticket, pay for accommodation and conference fees. It’s all free and online.
One caveat, it is in pacific timezone(PDT) friendly hours. I kind of wish the organizers would take a more global approach.
Having said that, the agenda and content look promising!
The 2019 edition of ITNEXT Summit took place on Oct 30, 2019, at Amsterdam, NL.
The event had around 450+ passionate attendees that came to learn from great experts and get inspired.
There were 3 tracks:
Each track was led and curated by an MC:
First thing first, what is TensorFrames?
TensorFrames is an open source created by Apache Spark contributors. Its functions and parameters are named the same as in the TensorFlow framework. Under the hood, it is an Apache Spark DSL (domain-specific language) wrapper for Apache Spark DataFrames. It allows us to manipulate the DataFrames with TensorFlow functionality. And no, it is not pandas DataFrame, it is based on Apache Spark DataFrame.
..but wait, what is TensorFlow (TF)?
TensorFlow is an open-source software library for dataflow and differentiable programming across a range of tasks. …
This is a step by step tutorial on how to get new Spark TensorFrame library running on Azure Databricks.
Big Data is a huge topic that consists of many domains and expertise. All the way from DevOps, Data Engineers to Data Scientist, AI, Machine Learning, algorithm developers and many more. We all struggle with massive amounts of data. When we deal with a massive amount of data, we need the best minds and tools. This is where the magical combination of Apache Spark and Tensor Flow takes place and we call it TensorFrame.
Apache Spark took over the Big Data…
Apache Kafka’s real-world adoption is exploding, and it claims to dominate the world of stream data. It has a huge developer community all over the world that keeps on growing. But, it can be painful too. So, just before jumping head first and fully integrating with Apache Kafka, let’s check the water and plan ahead for painless integration.
Apache Kafka is an open source framework for asynchronous messaging and it’s a distributed streaming platform. It is TCP based. The messages are persisted in topics. Message producers are called publishers and message consumers are called subscribers.
Consumers can subscribe to one…
Apache Spark is quickly gaining steam both in the headlines and real-world adoption. Top use cases are Streaming Data, Machine Learning, Interactive Analysis and more. Many known companies uses it like Uber, Pinterest and more. So after working with Spark for more than 3 years in production, I’m happy to share my tips and tricks for better performance.
Lets start :)
UDF (user defined function) :
Column-based functions that extend the vocabulary of Spark SQL’s DSL.
From the Spark Apache docs:
“Use the higher-level standard Column-based functions with
Dataset operators whenever possible before reverting to
using your own custom UDF…
Even if you spend a lot of time working with data at scale, you might not be aware of Predicate Pushdown and its importance when building products. Wonder what is it good for and why? read this.
Predicate Pushdown gets its name from the fact that portions of SQL statements, ones that filter data, are referred to as predicates. They earn that name because predicates in mathematical logic and clauses in SQL are the same kind of thing — statements that, upon evaluation, can be TRUE or FALSE for different values of variables or data.
It can improve query performance…
👩💻 Software Developer 📚 Blogger 🗣️ Speaker 💫 1 of 25 influential women in Software Development