# Why XGBoost Is So Effective?

## Understand the characteristics of the XGBoost Algorithm

Since 2016, Tianqi Chen published a paper about an extreme algorithm, XGBoost become more and more popular in data science world. To understand the behind math calculation, you could find my extreme detailed explain here.

In this article, we will focus on the special characteristics of XGBoost and explain why XGBoost could work so efficiently.

# I. Approximate Greedy Algorithm

As we all know, one of the advantages of XGBoost is that it can effectively deal with large datasets.

Think about the process, when we use XGBoost to do classification/regression, we always start from an initial guess and calculate the similarity score and gain to…

## Understand the formulas in XGBoost

These days, XGBoost gets more and more popular and used widely in data science, especially in competitions like those on Kaggle. This extreme implementation of gradient boosting created by Tianqi Chen was published in 2016. (Find the article here.)

However, many people may find the equations in XGBoost seems too complicated to understand. This article will explain the math behind in a simple way to help you understand this algorithm.

# Brief Review of XGBoost

Before we start to talk about the math, I would like to get a brief review of the XGBoost regression.

At first, we put all residuals into one leaf and…

# Several Different Ways to Combine Datasets in SAS

## Simple tutorial to explain in SAS studio

It is very common that we get several data set and have to use them create a comprehensive report to help companies to make data driven decisions. In SAS programming, combining observations from two or more data sets into a new data set is also very common.

In this article, we will use examples to explain different ways to combine data in SAS.

# One to One Merging

To understand the basic combination method, we always think about one to one merging.

What is one to one merging? Let’s say we have to combine data A and data B and we need the output data…

Multiple Intermediate Visualizations of Superstore Data

Nowadays, Tableau becomes more and more popular. More than half of the data related job position require candidates have Tableau experience and some of them even require for advanced Tableau skills.

This article will continue to explore into the superstore data as my last Medium post here and create more visual charts to tell the business story.

# Data Description

This data contains data of trade in different region in USA from 2013 to 2018, it contains columns like, sales volume, profit, discount, products name and other orders related information.

There are 4 different region sheets which…

# Text Mining by Using Naive Bayes — Spam Email Classification

## Text Classification With Python

People received tons of emails everyday and many of them could be spam, so how could we detect the spam emails and reduce our time to check them one by one?

# Data Description

The data is from here: Kaggle website.

There are 2500 ham and 500 spam emails in the dataset. You may also notice that all the numbers and URLs were converted to strings as Number and URL respectively. This is the simplified spam and ham dataset.

# Factor Analysis in R — Airport Quarterly Passenger Survey

Dimension Reduction of Big data file in R

It is common that we get a big data file with a large number of columns when we are dealing with problems like regression or classification. In this situation, do we really need such large dimensions in our analysis?

For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved (underlying) variables. In this way, we could do factor analysis searches for such joint variations in response to unobserved latent variables.

Factor analysis is able to summarize the information contained in a larger number of…

# Introduction to ANN Algorithms — ICU Prediction Model in Python

## Understand how fully connected feed-forward ANN works

The U.S. recently became the country with the highest number of deaths from Covid-19. Based on most death cases, patients who admitted to the ICU are at high risk of dying.

But in general, not every patient faces that high risk for getting into ICU. ICU mortality rates differ widely depending on the underlying disease process, with death rates as low as 1 in 20 for patients admitted following elective surgery, and as high as 1 in 4 for patients with respiratory diseases. …

# Understanding CNN in Python — Blood Cell Classification

Simple CNN Classification Project by Using Tensorflow

# Background Overview

Recently, a new and terrible virus called covid-19 caused huge effects around the whole world. Meanwhile, more and more people begin to pay attention to strengthening their own immunity.

Do you know what cell plays an important role in human immunity system? The answer is white blood cell.

General speaking, human blood is made up of red blood cells, white blood cells, platelets, and plasma. White blood cells account for only about 1% of people blood, but their impact is important. White blood cells are also called leukocytes. In a sense, they are…

# How to Design a Race Lego Car? — Designed Experiment and Supporting Analysis in Minitab

DOE project of Designing a race car by using Lego blocks

During the self-quarantine time, many board games become popular among families. Puzzles, Ludo, Monopoly and Lego blocks, the sales volume of these board game toys increased a lot since March. However, as one of the most popular toys in USA, Lego has so many different theme blocks which could fit almost all age.

Think about a simple game, use Lego blocks to build a race car and put it on a ramp and let it run to ground. …

# E-commerce Traffic and Website Analysis by Using MySQL — (1)

Simple case study for e-commerce marketing analysis

Recent years, more and more companies started e-commerce business through Internet. To monitoring marketing and revenue performance, companies collect the data from their website and use data software such as SQL to analyze and improve their e-commerce strategy.

Which page is the landing page for most customers? How many customers finally place orders on the website? How could the company improve their website to increase the conversion rate? There are many questions for analysts to explore. For this article, I found a virtual case data file for this topic.

# Case Description

This Maven Fuzzy Factory…

## Sydney Chen

Machine Learning Learner | Data Analyst | Data Science Interest | LinkedIn: linkedin.com/in/sydneychen-/ | Engineering Background | github.com/SydneyChen2

Get the Medium app