In a world of big data, a reliable streaming platform is a must. That’s where Kafka comes in. You already have it installed and configured with Docker. If that’s not the case, read this article or watch this video before continuing.
Today you’ll learn all about Kafka Topics, console Producers, and Consumers. You’ll master the Kafka shell, and by the end of the article you’ll be ready for more advanced examples, such as working with Kafka in Python.
The best part is — the video guide is available once again:
Today’s article covers the following:
In a world of big data, a reliable streaming platform is a must. That’s where Kafka comes in. And today, you’ll learn how to install it on your machine and create your first Kafka topic.
Want to sit back and watch? I’ve got you covered:
Today’s article covers the following topics:
You can install Kafka on any OS, like Windows, Mac, or Linux…
Everyone and their grandmother know what a CSV file is. But is it the optimal way to store data? Heck no. It’s probably the worst storage format if you don’t plan to view or edit data on the fly.
If you’re storing large volumes of data, opting for CSVs will cost you both time and money.
Today you’ll learn about five CSV alternatives. Each provides an advantage, either in read/write time or in file size. Some are even better in all areas.
Let’s set up the environment before going over the file formats.
You’ll need a couple of libraries to…
Storing data in the cloud can cost you a pretty penny. Naturally, you’ll want to stay away from the most widely known data storage format — CSV — and pick something a little lighter. That is, if you don’t care about viewing and editing data files on the fly.
Today you’ll learn about one of the simplest ways to store almost anything in Python — Pickle. Pickling isn’t limited to datasets only, as you’ll see shortly, but every example in the article is based on datasets.
In Python, you can use the
pickle module to serialize objects and save them…
CSV isn’t the only available data storage format. In fact, it’s likely the last one you should choose if you don’t plan to view and edit the data on the fly. Going with CSV would be a long and expensive mistake if you plan to dump large datasets and use automation for processing.
Picture this — you collect large volumes of data and store them in the cloud. You didn’t do much research on file formats, so you opt for CSVs. Your expenses are through the roof! A simple tweak can reduce them by half, if not more. …
File and folder management with Python’s
os module is a nightmare. Yet, it’s an essential part of every data science workflow. Saving reports, reading configuration files, you name it — there’s no way around it.
Picture this — you spend weeks building an API around your model, and it works flawlessly, at least on your machine. Once deployed, it’s a whole different story. Your API fails in unexpected places or even won’t run, as absolute paths you’ve hardcoded simply don’t exist.
There’s a no-brainer solution. The
pathlib library comes by default with Python 3.4 and above. It’s by far the…
Deep learning has come a long way in recent years. The practitioners are now way beyond simple image classification tasks. It’s becoming easier to detect or even segment objects of interest, both in images and video. Computer vision has come a long way, but some things haven’t changed in years. Visualization is one of them.
Pharos is a free library for visualizing advanced computer vision datasets — think object detection. It builds a Flask web app around your dataset, thus making the exploration effortless.
By using Pharos, you can easily explore how the dataset was labeled and decide if there…
CSV is not the only data storage format out there. In fact, it’s likely the last one you should consider. If you don’t plan to edit the saved data manually, you’re wasting both time and money by sticking to it.
Picture this — you collect large volumes of data and store them in the cloud. You didn’t do much research on file formats, so you opt for CSVs. Your expenses are through the roof! A simple tweak can reduce them by half, if not more. That tweak is — you’ve guessed it — choosing a different file format.
The world of data science IDEs can be overwhelming. You can go from plain text editors for ultimate simplicity to IDEs so feature-rich they will make your head spin. Analysis paralysis gets even worse if you’re willing to pay for a piece of coding software.
You want something simple, yet capable. You want something professional and feature-rich, yet not overwhelming. Does it ring a bell? It likely won’t be a one-time decision.
The short answer is — there’s no one-size-fits-all solution. It’s a personal preference. I have mine, but I’ll try to stay unbiased as possible while comparing these four.
Data science is hard. You’ll have to learn a handful of libraries as a beginner, even to solve the most fundamental tasks. Adding insult to injury, the libraries change and get updated constantly, and there’s almost always a better tool for the job.
The problem of not knowing which tool to use is simple to understand — it results in failing completely or not doing a task optimally. What’s also dangerous is not knowing libraries well enough. You end up implementing algorithms from scratch, completely unaware there’s already a function for that. Both cost you time, nerves, and potentially money.