Data Analyst job market in China

Ben Niu
5 min readFeb 25, 2019

--

A Horry story about uncleaned data

Part 1 Data cleaning

Horrible uncleaned data is anywhere, the reasons for the production of messy data are varied. Different devices, languages, use, industries produce different forms of data and the professional or unprofessional way to collect and store data also cause the mess of data.

The following dataset is scarped from a recruitment website. Since it is in Chinese the initial dataset in the csv form looks like this.

Encoding and system languages caused data unreadable

This dataset is readable when it open with notepad with UTF-8 encoding. Then I imported dataset into Jupyter notebook.

Here is a list of the various fields owned by the dataset, a total of 6876, of which companyLabelList, businessZones, secondType, positionLables have some empty values. Company id and post id are numbers, others are strings.

Because the dataset is larger, if we only want to browse the part, we can use the head function to display the data of the header. The default is 5, and the parameters can be set freely. You can also use the tail function to display the tail of dataset.

The dataset is relatively tidy, but the main dirty part is Salary. We will set it to two columns later. First, we will check that if a duplicated part of this data exists.

unique() function returns the only value in the postionId column. len() function calculates the total number of unique value. Since it shows 5031 unique values in the postionId and 6876 values for the total. We have to drop the duplicated values by drop_duplicates.

I have tried to use google translate API and Youdao API to translate the dataset, however, because of both API have daily limits for the amount of translation. I used replace() function to translate some key values from Simple Chinese to English.

Next, we are going to process the Salary columns. The purpose is to calculate the lower salary limit and the salary ceiling.

There is no special pattern for the salary content, there are lowercase k, upper case K, and the phrase “above k”. “Above k” can only be the same as the upper and lower limits.

Here we have to write a function to apply to the data.

Once we get both tops and bottom Salary limits, we make a new column ‘avgSalary’.

At this point, the data cleaning part is completed. Cut out what we want for subsequent analysis.

Part 2 Data analysis

Let us do statistical describe first.

The average salary of the data analyst is 17k, the median is 15k, the difference between the two is small, the maximum salary is 75k, it should be the level of the data scientist or data analysis director. The standard deviation is 8.99k, which has certain volatility. Most analysts are paid between 17+ and 9k.

The general classification data uses value_counts, and the numerical data uses describe(), which is the two most commonly used statistical functions.

Speaking of so many words, it is still not intuitive enough, we speak with charts.

Because the original data comes from the recruitment website, the salary is easy to concentrate in a certain interval, not the response of the real salary (10–20k interval, according to the calculation formula of this article, it will only fall rudely at 15k, not evenly distributed ).

Now observe the impact of different cities and different education on salary. The box plot is the best way to observe.

From the chart, we see that Beijing’s data analysts are paid more than other cities, especially the median. Shanghai and Shenzhen are a little later, and Guangzhou is not even as good as Hangzhou.

From the academic background, doctoral salary is far ahead, although in the top area is not as good as undergraduate and master’s degree, we need to follow up analysis.

Looking at the working years, the gap in wages has further widened, and graduates and work for many years are not in a gradient. Although there is no data comparison in other industries, it can be confirmed that the data analyst’s career path is still quite bright.

Average salary calculated by city and education

In different cities, the highest salary for a doctoral degree is in Shenzhen, and the highest salary for a master’s degree is in Hangzhou. Beijing’s comprehensive salary is the best.

--

--

Ben Niu

DiDi Global|Information Science at CU-Boulder | Analytics at UChicago | Get in Touch https://www.linkedin.com/in/ben-niu-5314b2107/