Data at scale. What does this term mean? The term “Data at Scale” refers to moving from a small repository to large repository by increasing volume of data and retaining tools and usage.
Data are facts collected for analysis or reference.
Data Mining is a process of discovering patterns in large data sets involving methods like statistics, Machine Learning etc.
Decision Support is a system that supports business or organizational decision-making activities often using complex algorithm.
Machine Learning is the study of computer algorithms that improve automatically through experience
Using data processed with algorithms for aggregating data for ingestion to…
In my previous article, I have highlighted 4 algorithms to start off in Machine Learning: Linear Regression, Logistic Regression, Decision Trees and Random Forest. Now, I am creating a series of the same.
The equation which defines the simplest form of the regression equation with one dependent and one independent variable: y = mx+c.
Where y = estimated dependent variable, c = constant, m= regression coefficient and x = independent variable.
Let's just understand with an example:
For Data Scientists nowadays, there are many options to choose from in producing statistical visualisation. Let’s say we have Python as the most popular one. With Python we can perform pretty much anything from Machine Learning classification and regression, Deep Learning for computer vision, NLP, up to Audio Analysis. Aside from Python, we can perform Machine Learning algorithms through many other languages, such as JAVA, Scala, Lisp, C++, or C#. However, it is undeniable that R is the second most sought after skill from Data Scientist, at least up to 2021 according to LinkedIn as mentioned in the link below:
…
By Zachary Galante — Senior Data Science Student at Bryant University
A very popular algorithm, in Machine Learning, is the Decision Tree Classifier. In this article, the Banknote dataset will be used to illustrate the capabilities of this model.
Decision Trees
A decision tree is a basic machine learning algorithm that can be used for classification problems. From a high level, a decision tree starts with a basic statement at the top of the tree, and then based on if that statement is True or False, it will then move down a different path to the next condition. …
If you were to gather a group of scientists from 1962 and ask them about their outlooks on the future and potential of artificial intelligence in solving computationally hard problems, the consensus would be generally positive.
If you were to ask the same group of scientists a decade later in 1972, the consensus would appear quite different, and contrary to the nature of scientific progress, it would be a lot more pessimistic.
We can attribute this change in attitude to the rise and fall of a single algorithm: the perceptron.
The perceptron algorithm, first proposed in 1958 by Frank Rosenblatt…
One of the main applications of unsupervised learning is market segmentation. This is when we don’t have labelled data available all the time, but it’s important to segment the market so that people can target individual groups. This is very useful in advertising, inventory management, implementing strategies for distribution, and mass media. Let’s go ahead and apply unsupervised learning to one such use case to see how it can be useful.
We will be dealing with a wholesale vendor and his customers. We will be using the data available at https://archive.ics.uci.edu/ml/datasets/Wholesale+customers. …
When we discussed the k-means algorithm, we saw that we had to give the number of clusters as one of the input parameters. In the real world, we won’t have this information available. We can definitely sweep the parameter space to find out the optimal number of clusters using the silhouette coefficient score, but this will be an expensive process! A method that returns the number of clusters in our data will be an excellent solution to the problem. DBSCAN does just that for us.
We will perform a DBSCAN analysis using the sklearn.cluster.DBSCAN function. We will use the same…
We have built different clustering algorithms, but haven’t measured their
performance.
A good way to measure a clustering algorithm is by seeing how well the clusters are separated. Are the clusters well separated? Are the data points in a cluster that is tight enough?
We need a metric that can quantify this behaviour. We will use a metric called the silhouette…
In part I, we discussed how to evaluate binary-class classification models using Recall, Precession, Accuracy, and F1-Score. Here, we will see how we can apply those metrics to a multi-class classification model.
As seen in part I, we can build the confusion matrix for a multi-class model as well as the binary-class model. But as it may become more complex when there are too many classes, we can separate each class in a single confusion matrix to make calculations and visualizations easier.
Note: Of course we will not do that manually for each classification problem we work with, but this…