Time series modeling is the statistical study of sequential data (may be finite or infinite) dependent on time. Though we say time. But, time here may be a logical identifier. There may not be any physical time information in a time series data. In this article, we will discuss how to model a stock price change forecasting problem with time series and some of the concepts at a high level.
We will take Dow Jones Index Dataset from UCI Machine Learning Repository. It contains stock price information over two quarters. Let’s explore the dataset first:
We often see many techniques discussed here & there about solving problems with ML. But when it comes to putting all of them into production, we don’t see that much traction, and people still have to rely on some public cloud providers or open source for that. In this article, we will discuss ML models to be used in production, and the system architectures for supporting it. We will see how can we do that without having any public cloud provider.
Mostly all ML models are either mathematical expressions, equations or data structures (tree or graph). Mathematical expressions have coefficients, some variables, some constants, some parameters of probability distributions (distribution-specific parameters, standard deviations or mean). …
Classifying image data is one of the very popular usages of Deep Learning techniques. In this article, we will discuss the identification of flower images using a deep convolutional neural network.
For this, we will be using PyTorch, TorchVision & PIL libraries of Python
The required dataset for this problem can be found at Kaggle. It contains a folder structure & flower images inside it. There are 5 different types of flowers. The folder structure looks like below
In this article, we will discuss different text classification techniques to solve the BBC new article categorization problem. We will also discuss different vector space models to represent text data.
We will be using Python, Sci-kit-learn, Gensim and the Xgboost library for solving this problem.
Data for this problem can be found from Kaggle. This dataset contains BBC news text and its category in a two-column CSV format. Let’s see what’s there
Everyday users of stackoverflow.com posts many technical questions and all those get tagged with different topics. In this article, we will discuss a classification model that can automatically tell which tags can be attached to an unanswered question.
Obviously, there are multiple tags that can be associated with a question. So, ultimately this problem becomes ‘classifying a question and attaching class labels to it’. By Machine Learning theory, it is a ‘Multi-Label classification’ problem.
We already discussed about different theoretical techniques and accuracy metrics required for multi-label models in the below article.
The above one is a pre-requisite for the current discussion. Readers are requested to go through that before this current article. …
Classification techniques probably are the most fundamental in Machine Learning. The majority of all online ML/AI courses and curriculums start with this.
In normal classification, we have a model defined, which classifies or tags a data instance with only one class label. Definitely, in the class set, there can(will) be multiple class labels, but the classifier will choose only one(best) among those.
Now, the question is: Can a data instance be classified/tagged with multiple possible class labels from the set? How the model should be designed and how can we calculate accuracy for that model? …
Regression is one of the most fundamental techniques in Machine Learning. In simple terms, it means, ‘predicting a continuous variable by other independent categorical/continuous variables’. Challenge comes, when we have high-dimensionality i.e. too many independent variables. In this article, we will discuss a technique of regression modeling with high-dimensional data using Principal Components and ElasticNet. We will also see how to save that model for future use.
We will use Python 3.x as the programming language and ‘sci-kit learn’, ‘seaborn’ as libraries for this article.
Data used here can be found at the UCI Machine Learning Repository. Dataset name is “Relative location of CT slices on axial axis Data Set”. This one contains extracted features of medical CT scan images for various patients (male & female). Features are numerical in nature. As per UCI, the goal is ‘predicting the relative location of a CT slice on the axial axis of the human body’. …
Apache Spark nowadays is quite popular to scale up any data processing application. For Machine Learning also, it provides a library called ‘MLlib’ . It is a distributed programming approach to solve ML problems. In this article, we will see how to integrate this MLlib with PySpark and techniques of using Doc2Vec with PySpark for solving text classification problems.
Before going ahead, we need to know what is ‘Doc2Vec’. It is an NLP model to describe a text or document. It converts a text into a vector of numerical features to be used in any ML algorithm. Basically, it is a feature engineering technique. It tries to understand the context of documents by random sampling of words and trains a neural network with those. Hidden layer vectors of the neural network become document vectors a.k.a ‘Doc2Vec’. There is another technique called ‘Word2Vec’ which also works on similar principals. But instead of documents/texts, it works on word corpus and provides vectors for words. …
In Machine learning, classification problems with high-dimensional data are really challenging. Sometimes, very simple problems become extremely complex due this ‘curse of dimensionality’ problem.
In this article, we will see how accuracy and performance vary across different classifiers. We will also see how, when we don’t have the freedom to choose a classifier independently, we can do feature engineering to make a poor classifier perform well.
For this article, we will use the “EEG Brainwave Dataset” from Kaggle. This dataset contains electronic brainwave signals from an EEG headset and is in temporal format. …
About