Data format for MATLAB, R and Python codes of data analysis, and sample data set
I release MATLAB, R and Python codes for regression, classification, variable selection, visualization, clustering, data domain estimation and so on.
Data set must be loaded to run the codes, and must be prepared as csv file.
In this article, I summarize data format of the csv file. The data format for MATLAB, R, and Python is all the same.
Regression and variable selection
The csv files required in regression are given as follows:
■csv file of training data set [data.csv],
■csv file of test data set 1 [data_prediction1.csv],
■csv file of test data set 2 [data_prediction2.csv].
The csv file required in variable selection is given as follows:
■csv file of training data set [data.csv].
I explain the data format of each csv file next. The sample data set can be downloaded here.
http://univprofblog.html.xdomain.jp/code/SampleDataForRegression.zip
It is better to check the codes using the sample data set first.
■csv file of training data set [data.csv]
The top column is variable names and the leftmost row is sample names as shown by the figure below.
The leftmost variable is an objective variable Y, and then, explanatory variables X follow. You can decide sample names and variable names freely, but they must be different names.
■csv file of test data set 1 [data_prediction1.csv]
Test data set 1 includes Y as well as X. The types and the number of X must be the same as those of data.csv. If you cannot prepare this data set, please copy data.csv and rename it as data_prediction1.csv.
The top column is variable names and the leftmost row is sample names as shown by the figure below.
You can decide sample names and variable names freely, but they must be different names.
■csv file of test data set 2 [data_prediction2.csv]
Test data set 2 does not include Y. The types and the number of X must be the same as those of data.csv. If you cannot prepare this data set, please copy only X of data.csv and rename it as data_prediction2.csv.
The top column is variable names and the leftmost row is sample names as shown by the figure below.
You can decide sample names and variable names freely, but they must be different names.
Estimated values of Y are saved as “PredictedY2.csv” for this test data set 2.
Classification
The csv files required in classification are given as follows:
■csv file of training data set [data.csv],
■csv file of test data set 1 [data_prediction1.csv],
■csv file of test data set 2 [data_prediction2.csv].
I explain the data format of each csv file next. The sample data set can be downloaded here.
http://univprofblog.html.xdomain.jp/code/SampleDataForClassification.zip
This is the famous iris data set of Edgar Anderson. It is better to check the codes using the sample data set first.
■csv file of training data set [data.csv]
The top column is variable names and the leftmost row is sample names as shown by the figure below.
The leftmost variable is an objective variable Y, and then, explanatory variables X follow. You can decide sample names and variable names freely, but they must be different names. Y can be characters.
■csv file of test data set 1 [data_prediction1.csv]
Test data set 1 includes Y as well as X. The types and the number of X must be the same as those of data.csv. If you cannot prepare this data set, please copy data.csv and rename it as data_prediction1.csv.
The top column is variable names and the leftmost row is sample names as shown by the figure below.
You can decide sample names and variable names freely, but they must be different names.
■csv file of test data set 2 [data_prediction2.csv]
Test data set 2 does not include Y. The types and the number of X must be the same as those of data.csv. If you cannot prepare this data set, please copy only X of data.csv and rename it as data_prediction2.csv.
The top column is variable names and the leftmost row is sample names as shown by the figure below.
You can decide sample names and variable names freely, but they must be different names.
Estimated values of Y are saved as “PredictedY2.csv” for this test data set 2.
Visualization, clustering and data domain estimation
The csv files required in visualization, clustering and data domain estimation
are given as follows:
■csv file of training data set [data.csv].
I explain the data format of each csv file next. The sample data set can be downloaded here.
http://univprofblog.html.xdomain.jp/code/SampleDataForUnsupervisedMethod.zip
This is a data set of savings for each country. It is better to check the codes using the sample data set first.
■csv file of training data set [data.csv]
The top column is variable names and the leftmost row is sample names as shown by the figure below.
You can decide sample names and variable names freely, but they must be different names.