Query Dataset Using DuckDB

Working with Dataset — Part 7: Query Dataset Using DuckDB

Published in

Geek Culture

6 min readDec 5, 2022

It feels like everyone has been talking about big data in the cloud for years, including myself when I speak to recruiters and hiring managers about my daily work with big data. Big data can be challenging, as you may face daily challenges just loading data into your data lake. For instance, inferring a schema for supposedly periodic data from a single source system can be difficult when the schema seems to change randomly on a daily basis.

Even after successfully loading data into your data lake, you must contend with the idiosyncrasies of big data processing technology like Spark to parallel process and analyze large datasets. Despite these efforts, you may still encounter issues with the veracity of the data, such as noise, inconsistencies, and incompleteness, making it challenging to trust the accuracy and reliability of the results

As I work with big data, I often yearn for the days when I was working with more manageable datasets to solve business problems, instead of navigating through a sea of conflicting data. This is becoming increasingly challenging as organizations transition to cloud-only analytics platforms like DataBricks, where even basic data processing tasks using SQL require the creation of cluster VMs or DBUs.

There is an alternative. Do you know your laptop is a really powerful machine where you can do most of your data analytics tasks?

Query Dataset Using DuckDB

Working with Dataset — Part 7: Query Dataset Using DuckDB

Written by Sung Kim