We often think that we need data points to do Data Storytelling. In some cases, we have to present some facts to an audience, and if we can't build charts out of that information, we usually resort to the same-old bullet points slides. We might try improving it by adding images and icons to illustrate the points in each slide. But we often forget to tell the story behind that list of facts. In this post I want to:
Okay, let's start with the truth about my background: I’m not a designer, neither a data journalist nor a data scientist. I'm a data engineer, and my daily activities rarely include having to plot charts or conduct any kind of analysis.
But even without having lots of experience, I managed to win data visualisation competitions promoted by Kaggle for three years in a row. I finished in third place in 2018 and 2019 and got the first prize in 2020 out of more than 300 different competitors. …
We used to get files from the software that controls Gousto’s factory once a day via an SFTP server: several CSV files containing atomic data for each box that went through the production line on the previous day, such as the timestamps of when a box starts and exits the line. This data was used by Gousto's operations teams to measure our Supply Chain performance and detect issues on production lines.
We had an ingestion pipeline composed of a Lambda Function that moves files from the SFTP server to our Data Lake in S3 plus a job triggered by Airflow…
I created a repo to deploy Airflow on AWS following software engineering best practices. You can go straight there if you don’t feel like reading this post. But I do describe some things you might find useful here.
Run a docker-compose command and voíla, you have Airflow running on your local environment, and you are ready to develop some DAGs. After some time you have your DAGs (and Airflow) prepared for deployment on a production environment. Then you start searching for instructions on how to deploy Airflow on AWS. Here’s what you’ll probably find:
You want to implement a new streaming pipeline on your workplace and need to show your managers a Proof Of Concept. This POC should allow you to demonstrate some of the functionalities, in this case, generate real-time metrics. However, there is a limitation: you can’t use production data on a POC. How do you solve that?
If your answer was to generate fake events, then you are right. It will probably be the best solution. Then you create some random events and push then to the pipeline. You’ll soon realize that random is not the same as fake. …
First, you finish a few online courses. Then study the Iris dataset and solve the Titanic challenge. Finally, schedule some scripts on Windows task scheduler. Anyone who works with data somehow went through those situations. If you got a job in data, soon you realize that you can’t run a critical business script on your laptop. The one that stays on 24/7 in the office. So we go after benchmark, watch some talk or read some post and hear something like:
We use Airflow to orchestrate ETL and machine learning at our company.
So we start studying Airflow to learn…
Completar alguns cursos online, estudar o dataset Iris, resolver o desafio do Titanic, agendar scripts no agendador de tarefas do Windows. Quem trabalha com dados, de alguma forma já passou por estas situações. Se você conseguiu um emprego na área, vai perceber com o tempo que não dá pra rodar um script crítico para o negócio no próprio laptop que fica ligado 24 horas por dia no escritório. Então fazemos um benchmark, assistimos alguma palestra, lemos algum post e ouvimos coisas como:
Lá na empresa usamos Airflow pra orquestrar as tarefas de ETL e machine learning.
Então nós, aspirantes a…
Há alguns dias escrevi sobre como consegui uma vaga de engenheiro de dados em Londres e sobre como "me encontraram" no LinkedIn ao pesquisarem por Data Engineer Brazil. E essa na verdade foi a segunda vez que consegui emprego pelo LinkedIn. No meu emprego anterior, fui convidado para o processo seletivo após pesquisarem por Kaggle.
Ter um perfil completo no LinkedIn com certeza foi um dos diferenciais que me ajudaram a conseguir essas oportunidades. Então revolvi compartilhar algumas ideias sobre como estruturei o meu perfil no LinkedIn e como acredito que isso me ajudou. …
Há um ano eu tinha acabado de começar minha jornada como engenheiro de dados na EmCasa, uma startup de compra e venda de imóveis que atua no Rio de Janeiro e São Paulo. Essa, inclusive, foi a minha primeira experiência como engenheiro de dados, que eu descrevo um pouco mais aqui neste post. Eu fui contratado em fevereiro e trabalhava remoto do meu studio de 30 m² em Curitiba. Estava muito feliz com a empresa e nem passava pela minha cabeça procurar outro emprego.
Earlier this week I won the third prize (U$ 6.000) on Kaggle’s Machine Learning and Data Science Survey Competition with my notebook Is there any job out there? Kaggle vs Glassdoor.
While the competition evaluated several aspects of my submission, I think that the way I displayed information was a central piece in getting awarded. Here I want to share a little bit of the thought process behind building the data visualizations, with step by step instructions.
My idea was to measure how many people are working in different positions (job titles) comparing different countries.