Data format for MATLAB, R and Python codes of data analysis, and sample data set

DataAnalysis For Beginner
5 min readAug 14, 2016

--

I release MATLAB, R and Python codes for regression, classification, variable selection, visualization, clustering, data domain estimation and so on.

Data set must be loaded to run the codes, and must be prepared as csv file.

In this article, I summarize data format of the csv file. The data format for MATLAB, R, and Python is all the same.

Regression and variable selection

The csv files required in regression are given as follows:
■csv file of training data set [data.csv],
■csv file of test data set 1 [data_prediction1.csv],
■csv file of test data set 2 [data_prediction2.csv].

The csv file required in variable selection is given as follows:
■csv file of training data set [data.csv].

I explain the data format of each csv file next. The sample data set can be downloaded here.

http://univprofblog.html.xdomain.jp/code/SampleDataForRegression.zip

It is better to check the codes using the sample data set first.

■csv file of training data set [data.csv]

The top column is variable names and the leftmost row is sample names as shown by the figure below.

The leftmost variable is an objective variable Y, and then, explanatory variables X follow. You can decide sample names and variable names freely, but they must be different names.

■csv file of test data set 1 [data_prediction1.csv]

Test data set 1 includes Y as well as X. The types and the number of X must be the same as those of data.csv. If you cannot prepare this data set, please copy data.csv and rename it as data_prediction1.csv.

The top column is variable names and the leftmost row is sample names as shown by the figure below.

You can decide sample names and variable names freely, but they must be different names.

■csv file of test data set 2 [data_prediction2.csv]

Test data set 2 does not include Y. The types and the number of X must be the same as those of data.csv. If you cannot prepare this data set, please copy only X of data.csv and rename it as data_prediction2.csv.

The top column is variable names and the leftmost row is sample names as shown by the figure below.

You can decide sample names and variable names freely, but they must be different names.

Estimated values of Y are saved as “PredictedY2.csv” for this test data set 2.

Classification

The csv files required in classification are given as follows:
■csv file of training data set [data.csv],
■csv file of test data set 1 [data_prediction1.csv],
■csv file of test data set 2 [data_prediction2.csv].

I explain the data format of each csv file next. The sample data set can be downloaded here.

http://univprofblog.html.xdomain.jp/code/SampleDataForClassification.zip

This is the famous iris data set of Edgar Anderson. It is better to check the codes using the sample data set first.

■csv file of training data set [data.csv]

The top column is variable names and the leftmost row is sample names as shown by the figure below.

The leftmost variable is an objective variable Y, and then, explanatory variables X follow. You can decide sample names and variable names freely, but they must be different names. Y can be characters.

■csv file of test data set 1 [data_prediction1.csv]

Test data set 1 includes Y as well as X. The types and the number of X must be the same as those of data.csv. If you cannot prepare this data set, please copy data.csv and rename it as data_prediction1.csv.

The top column is variable names and the leftmost row is sample names as shown by the figure below.

You can decide sample names and variable names freely, but they must be different names.

■csv file of test data set 2 [data_prediction2.csv]

Test data set 2 does not include Y. The types and the number of X must be the same as those of data.csv. If you cannot prepare this data set, please copy only X of data.csv and rename it as data_prediction2.csv.

The top column is variable names and the leftmost row is sample names as shown by the figure below.

You can decide sample names and variable names freely, but they must be different names.

Estimated values of Y are saved as “PredictedY2.csv” for this test data set 2.

Visualization, clustering and data domain estimation

The csv files required in visualization, clustering and data domain estimation

are given as follows:
■csv file of training data set [data.csv].

I explain the data format of each csv file next. The sample data set can be downloaded here.

http://univprofblog.html.xdomain.jp/code/SampleDataForUnsupervisedMethod.zip

This is a data set of savings for each country. It is better to check the codes using the sample data set first.

■csv file of training data set [data.csv]

The top column is variable names and the leftmost row is sample names as shown by the figure below.

You can decide sample names and variable names freely, but they must be different names.

--

--

DataAnalysis For Beginner

I am a data scientist in Japan, and will share my knowledge and programming codes about data analysis with you. The codes are written in R, Python and MATLAB.