Building machine learning models isn’t easy. Heavy datasets and tricky data formats. A ton of hyper-parameters, models, optimization algorithms. Not talking about generic programming adventures like debugging, handling exceptions, and logging.
It is especially true for the R&D (Research and Development) style of work when models, approaches (and sometimes, even the data itself!) can change very quickly. You wouldn’t like to invest a large amount of time and effort to build something that could become irrelevant very soon. But you also don’t want to turn your project into a pile of Jupyter notebooks and ad hoc scripts, with a ton…
Exceptions mechanism is widely adopted among modern programming languages, and Python is not an exception. (Pun intended!) Though this topic could seem obvious, like wrap your code blocks with try-catch clauses, and that’s all, there are some minor but important details, and taking them into account should make your code a bit cleaner.
In this post, I’m going through some guidelines on how to structure error processing in Python that I derived from my personal experience.
It is very tempting to write the following code, and I’ve seen it several times in projects where I was a collaborator.
When working on machine leanring projects, you need to write lots of ad-hoc code snippets: downloading and reading/unpacking the data, converting the data into a convenient format, building training pipelines, and many more. You probably reuse some of these scripts, but the rest of them you’ll throw away after a few iterations, or when switching to another task.
You need somehow parameterize your scripts, i.e., set up input/output folders, tune the training parameters, or replace one architecture with another. For this purpose, you should write a proper argument parser that converts your CLI parameters into script variables or function parameters…
Some time ago I attended to a coding interview for the position of Data Scientist at one start-up. I felt myself well-prepared and confident, practicing lots of programming puzzles, coding various Machine Learning techniques from scratch, and having several years of programming experience under the belt. What can go wrong?
Unfortunately, I failed at the thing that doesn’t have any relation to the Gradient Descent methods or Time Complexity analysis. No, the failure was related to something very different and much more complicated. It was a Tic-Tac-Toe game!
I bet at this moment, some of you just close this story…
When working on data analytical projects, I usually use
Jupyter notebooks and a great
pandas library to process and move my data around. It is a very straightforward process for moderate-sized datasets which you can store as plain-text files without too much overhead.
However, when the number of observations in your dataset is high, the process of saving and loading data back into the memory becomes slower, and now each kernel’s restart steals your time and forces you to wait until the data reloads. So eventually, the CSV files or any other plain-text formats lose their attractiveness.
Several months ago I started exploring PyTorch — a fantastic and easy to use Deep Learning framework. In the previous post, I was describing how to implement a simple recommendation system using MovieLens dataset. This time I would like to focus on the topic essential to any Machine Learning pipeline — a training loop.
The PyTorch framework provides you with all the fundamental tools to build a machine learning model. It gives you CUDA-driven tensor computations, optimizers, neural networks layers, and so on. However, to train a model, you need to assemble all these things into a data processing pipeline.
Recently I’ve started watching fast.ai lectures — a great online course on Deep Learning and its applications. In one of his lectures, the author discusses the building of a simple neural network based recommendation system with application to the MovieLens dataset. While the lecture is an excellent source of information on this topic, it mostly relies on the library developed by the authors to run the training process. The library is quite flexible and provides several levels of abstractions.
However, I strongly wanted to learn more about the PyTorch framework which sits under the hood of authors code. In this…
The “classical” machine learning algorithms usually expect training dataset in the format of two matrices — a matrix with samples and an array with targets. However, how to deal with datasets where observations don’t have a fixed length?
Consider the following case. You have a dataset of files where each file contains a single observation, and you don’t know the length of each file in advance. It is not possible to “flatten” each file into a vector of digits and feed these vectors into training algorithm because their lengths don’t match. …
Modern deep learning architectures show quite good results in various fields of artificial intelligence. One of them is images classification. In this post, I am going to see if one could achieve an accurate classification of images by applying out-of-the-box ImageNet pre-trained deep models from Keras Python package.
Note: The full notebook with dataset’s analysis and models training scripts could be found here.
The analyzed dataset comes from the Dog Breed Identification competition hosted on Kaggle. It contains approximately 10,000 labeled samples belonging to 120 classes, composed from pictures from ImageNet dataset, and the same amount of testing data.
Software Developer & AI Enthusiast. Working with Machine Learning, Data Science, and Data Analytics. Writing posts every once in a while.