Understand the characteristics of the XGBoost Algorithm

https://unsplash.com/photos/1vxPDvnjt48

Since 2016, Tianqi Chen published a paper about an extreme algorithm, XGBoost become more and more popular in data science world. To understand the behind math calculation, you could find my extreme detailed explain here.

In this article, we will focus on the special characteristics of XGBoost and explain why XGBoost could work so efficiently.

I. Approximate Greedy Algorithm

As we all know, one of the advantages of XGBoost is that it can effectively deal with large datasets.

Think about the process, when we use XGBoost to do classification/regression, we always start from an initial guess and calculate the similarity score and gain to…


Understand the formulas in XGBoost

https://unsplash.com/photos/XdYXIfOEv5I

These days, XGBoost gets more and more popular and used widely in data science, especially in competitions like those on Kaggle. This extreme implementation of gradient boosting created by Tianqi Chen was published in 2016. (Find the article here.)

However, many people may find the equations in XGBoost seems too complicated to understand. This article will explain the math behind in a simple way to help you understand this algorithm.

Brief Review of XGBoost

Before we start to talk about the math, I would like to get a brief review of the XGBoost regression.

At first, we put all residuals into one leaf and…


Simple tutorial to explain in SAS studio

Photo by Andrés Dallimonti on Unsplash

It is very common that we get several data set and have to use them create a comprehensive report to help companies to make data driven decisions. In SAS programming, combining observations from two or more data sets into a new data set is also very common.

In this article, we will use examples to explain different ways to combine data in SAS.

One to One Merging

To understand the basic combination method, we always think about one to one merging.

What is one to one merging? Let’s say we have to combine data A and data B and we need the output data…


Multiple Intermediate Visualizations of Superstore Data

Photo from: https://images.app.goo.gl/8MAK99PXRY5cRaL39

Nowadays, Tableau becomes more and more popular. More than half of the data related job position require candidates have Tableau experience and some of them even require for advanced Tableau skills.

This article will continue to explore into the superstore data as my last Medium post here and create more visual charts to tell the business story.

Data Description

This data contains data of trade in different region in USA from 2013 to 2018, it contains columns like, sales volume, profit, discount, products name and other orders related information.

There are 4 different region sheets which…


Text Classification With Python

https://images.app.goo.gl/YEYTJoty2EH8S46d7

People received tons of emails everyday and many of them could be spam, so how could we detect the spam emails and reduce our time to check them one by one?

This article is about using Naive Bayes algorithm to build a machine learning classification model to detect the spam email.

Data Description

The data is from here: Kaggle website.

There are 2500 ham and 500 spam emails in the dataset. You may also notice that all the numbers and URLs were converted to strings as Number and URL respectively. This is the simplified spam and ham dataset.


Dimension Reduction of Big data file in R

https://images.app.goo.gl/Z4ftdZwRLbHP8XmL6

It is common that we get a big data file with a large number of columns when we are dealing with problems like regression or classification. In this situation, do we really need such large dimensions in our analysis?

For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved (underlying) variables. In this way, we could do factor analysis searches for such joint variations in response to unobserved latent variables.

Factor analysis is able to summarize the information contained in a larger number of…


Understand how fully connected feed-forward ANN works

https://images.app.goo.gl/woWsQbN6RbHJznH77

The U.S. recently became the country with the highest number of deaths from Covid-19. Based on most death cases, patients who admitted to the ICU are at high risk of dying.

But in general, not every patient faces that high risk for getting into ICU. ICU mortality rates differ widely depending on the underlying disease process, with death rates as low as 1 in 20 for patients admitted following elective surgery, and as high as 1 in 4 for patients with respiratory diseases. …


Simple CNN Classification Project by Using Tensorflow

https://images.app.goo.gl/fuAmirtbXnsG9vrD8

Background Overview

Recently, a new and terrible virus called covid-19 caused huge effects around the whole world. Meanwhile, more and more people begin to pay attention to strengthening their own immunity.

Do you know what cell plays an important role in human immunity system? The answer is white blood cell.

General speaking, human blood is made up of red blood cells, white blood cells, platelets, and plasma. White blood cells account for only about 1% of people blood, but their impact is important. White blood cells are also called leukocytes. In a sense, they are…


DOE project of Designing a race car by using Lego blocks

https://images.app.goo.gl/ZsWofNvFBXANV65G8

During the self-quarantine time, many board games become popular among families. Puzzles, Ludo, Monopoly and Lego blocks, the sales volume of these board game toys increased a lot since March. However, as one of the most popular toys in USA, Lego has so many different theme blocks which could fit almost all age.

Think about a simple game, use Lego blocks to build a race car and put it on a ramp and let it run to ground. …


Simple case study for e-commerce marketing analysis

https://images.app.goo.gl/WFD43zY9FiFchRrq9

Recent years, more and more companies started e-commerce business through Internet. To monitoring marketing and revenue performance, companies collect the data from their website and use data software such as SQL to analyze and improve their e-commerce strategy.

Which page is the landing page for most customers? How many customers finally place orders on the website? How could the company improve their website to increase the conversion rate? There are many questions for analysts to explore. For this article, I found a virtual case data file for this topic.

Case Description

This Maven Fuzzy Factory…

Sydney Chen

Machine Learning Learner | Data Analyst | Data Science Interest | LinkedIn: linkedin.com/in/sydneychen-/ | Engineering Background | github.com/SydneyChen2

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store