Stanford Research Series: Structure Type Classification Based on Tax Assessor Data

Gideon Mendels
Comet
Published in
12 min readNov 6, 2019

Author: Yue (Major) Zeng, majorzeng@stanford.edu

1. Background and previous work

In regional earthquake simulation, damage occur when the of ground motion demand exceeds the strength capacity of each buildings. It is often difficult to assess the strength of millions of buildings in an area. The common practice is to sort buildings into groups with the same load bearing systems, like wood frame or concrete shear wall systems, which is the key feature that defines the strength of a building. These groups are called structural types. Building in each group have similar structural behavior and can be modeled with a single prototype building [1]. However, the structural type is something that’s hidden behind the walls, deep in the design drawings and not openly available for most of the time. The most accurate information of a building’s structural type come from in situ assessment of a professional engineer. However, the sheer number of buildings exists in any urban area makes such method infeasible.

The purpose of this project is to explore machine learning tools that would automate the process of assigning the structural type given certain information of the building.

Past studies have attempted the same task using various remote sensing data. The data used in the study are various physical base features, (high, size, texture) extracted from Lidar and high-resolution satellite images. The classification model implement feature select, oversampling and multi-class classification in sequence. [2] This project follows similar procedure and explored various weight balancing, feature selection and classification scheme on a different dataset based on field survey that includes a broader range of building information.

2. Data processing

The data used in this study is tax assessor file of San Mateo county for the year of 2016. It includes 128137 rows and 149 columns, 71 with numerical values and 78 with strings. Amount all columns, 39 columns have less than 1% of examples filled and are dropped due to extremely low completion. 15 columns with string information contains more than 128 (0.1% of number of examples) unique values, thus it’s determined to be specific information, not classifications and dropped. The resulting data frame has 51 numerical columns and 44 string columns.

The columns “CONSTRUCTION TYPE” directly indicate the structural type, which is the information this study tries to predict. Thus, the column “‘CONSTRUCTION TYPE” is treated as the label.

Among 128137 examples, 23143 examples are labeled, and 104994 examples are not labeled. The set of data that’s labeled is used for this study. All columns that contains string are treated as categorical information and converted to numerical values by one hot encoding. The result data includes 23143 examples, each with 16 labels and 531 features. All features are normalized. The labeled date set is randomly shuffled and partitioned to be 70% training set, 15% validation set and 15% test set. 3. Attempts to handle imbalanced data set For the training data set of 16201 examples, the 16 construction type labels, their definition and number of examples in each category are shown in the Table 1 below:

It is observed is the majority of the building are either wood or frame. All other types of structure are less than 5% of the examples, which could be considered rarer event. It is because San Mateo county is by majority a residential area and 99% of American single family home are built with wood structure. Such dataset is very imbalanced and its’ expected that any model would have low accuracy on prediction the “rare event” classes. However, even if the percentage of building that’s concrete, steel or masonry are small, it’s very important to capture them correctly since they have very different structural behaviors than wooden buildings, which is important in damage simulation.

Three methods are considered here in the study to overcome the issue of imbalanced data set 1) random over sampling 2) random under sampling and 3) training with weighted samples [1].

For the random sampling methods, the sampling goal is to make the number examples in concrete, masonry and steel classes at least 10% of that in wood classes. A simple binary relevance classifier with logistic regression is used to make prediction and score accuracy [2]. The model builds 16 logistic regression classifier and compute the probability of an examples belong to each category: fi (x), i ∈ {1,2, … ,16}. Then it predicts the example belongs to the categories with the highest probability:

There are totally 69 examples in concrete, masonry or steel class in the validation set. All 531 features are used in the training. The evaluation criteria are test accuracy and number of samples misclassified in concrete, steel and masonry structure classes (both false positive and false negative).

All three methods are implemented along a baseline case where no manipulation of examples is done. The results are summarized in the Table 2 below. It is shown that there is no improvement in both test accuracy and number of misclassified examples in rare categories due to any of the methods. Rather, when attempted to raise the weight of rare events in the data, all three methods misled the model to classify many more wood structure examples into the rare categories. As a result, the false positive examples in the rare categories overwhelms the true positive examples and make the classification less reliable.

4. Feature Selection Attempts

The number of features for each example in this dataset is not trivial (531). Even though it’s not large enough to be an issue with computing power available, it is hard to collect these many features in the field. The dataset used here comes from US tax assessor, which is a very intensive survey. It’s very hard to collect such detailed data in other occasions. On the other hand, because this data was collected for a different purpose than classifying structure types, there any many features that are repetitive or not very relevant. In this section, the study explores two different ways to rank the importance of features and made a series of experiments to try to find the number of top ranked features needed to produce a relatively accurate classification.

4.1 Feature Ranking

In practice, filter feature selection methods is the easiest to implement, since it doesn’t depend on the type of classifier used later in the process. Two filter methods are considered here in this study to rank the importance of features: chi square method and one-off method. Chi square test is amained to test the independence of two events, in this case one specific feature and one label. The chi square score is calculated:

The scores and ranks are plotted in Figure 1 and Figure 2 below for two methods. For chi-square test, the score exponentially decays as the rank increase. Features ranking higher than 150 is likely to have little relevance. For the one-off score, features ranking from 25 to 450 essentially have the same score. Only the top 25 ranked features demonstrate more importance from the rest of the set.

To understand the results of these ranking experiments, all features are sorted into 10 groups according to type of information they provide. These categories, total features in each category and the number of features ranked top 50 by the 2 ranking methods are summered in Table 3 and Figure 1 below.

Chi-square test ranked many features regarding location of building, architecture layout (number of bedrooms, basement and garage) or structure component (frame, wall and foundation) in top 50, indicating that those are relevant information for prediction structural type. Also, a large percent of features describing non-structural components (air conditioning, fire place and sewer) or size of the building are selected, indicating that those categories are important as well. One-off rest ranked many features of owner status or building location in top 50, indicating that those are important information category. Ranking from these two methods disagrees on the importance of many feature categories. However, both ranked only 1 feature about the value of the building into top 50, indicating that the value of the building is not very related to its’ structural type and thus the strength of the building.

5.2 Comparing of Ranking Methods

3 different feature selection methods: 1) chi-square scoring, 2) one-off testing and 3) manual selection from domain knowledge are tested in the following experiments. The same number of top ranked features are draw by each method and used as input of the same baseline binary relevance classifier with logistic regression model. The result of their accuracy and number of misclassified rare events are summarized in Table 4, Figure 4 and Figure 5 below.

The result shows that all three have the same high accuracy when using to 150 features. However, as the number of feature decrease, the accuracy of the one-off selected model decreases significantly, while the chi-square model and manual selection model remains at high performance. For the number of misclassified rare events, increase significantly as the number of feature decrease, for the manual selection model and one-off test model, while the performance of chi-square model remains strong. Over all, chi-square is the strongest feature selection methods among the three and thus is carries on to the next section.

6. Prediction Methods

The next section explores the performance of 5 different multi-label classification algorithms on this dataset: 1) binary relevance classifier with logistic regression, 2) K-nearest neighbor classifier, 3) decision tree classifier, 4) random forest classifier and 5) 2 layer neural network. K-nearest neighbor classifier: for each test example x_ with p features, the algorithm search for its k nearest neighbors (Xj1, Xj2,,…,Xjx) form the training set, with the distance defined by:

The model predicts the label of test examples x_ the most frequent true label among K-nearest training examples [6].

It’s implemented with sklearn library, function K-nearest neighbor classifier [7]. Through iteration, the best hyper parameter k is determined to be 5. Decision Tree Classifier: The algorithm repetitively search for attribute with the highest information gain: H = − ∑ iPi log Pi partition the data into subsets [8], demonstrated in Figure 6 below [9]. The algorithm is implemented with sklearn library, function decision tree classifier [8].

Random Forest Classifier: it’s a meta estimator that fits a number of decision tree classifiers on various subsamples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The algorithm is implemented with library, sklearn function random forest classifier [9]

Multilayer perceptron neutral network: the model is a single 2-layer neural network with sigmoid function as activation function of the hidden layer and soft max for output. The loss function in optimization is the sum of cross entropy loss for all labels [10] . The algorithm is implemented through library, sklearn NN-MLP function. Through iteration, the hyperparameter number of neurons in hidden layer is chosen to be 40. This method takes noticeably longer time than the previous ones.

Each classifier model is run with 10, 20 until 70 features with top relevance according to chi-square ranking. Their test accuracy and number of misclassified rare event are summarized in Table 5, Figure 8 and Figure 9 below.

The results show that neural network, k-nearest neighbor and random forest methods performs consistently better than baseline binary relevance classification. Decision tree methods performs well with the minimum number of features; however, it’s performs deteriorates as the number of features increase. Random forest performs better than baseline but slightly less than k-earnest neighbor or neural network. It’s is also less stable.

Over all, K-nearest neighbor and neural network are the better methods among testes, while neural network required more higher computing power.

7. Prediction for the full dataset

The k-nearest neighbor model and neural network model are retrained with all 23143 labeled features and

Prediction from both methods are highly skewed towards to common class “frame” with k-nearest neighbor prediction 98.2% and neural network predicts 95.7% belongs to that category. Neural network predicts slightly higher number of buildings in concrete and masonry structural type than k-nearest neighbor. Both model seems not able to classify rare events well. However, since there is no label available, it’s not possible to evaluate accuracy such prediction. Even the prediction result has class distribution different from the the labeled set, and such skewed class composition does not inspire too much confidence, this prediction is still possible. Since the labeled are manually logged and its data is sparse, it’s possible that building in certain rare structural types are more likely to be placed a label than commonly seem ones.

8. Conclusion and future work

In conclusion, this study explored various methods in feature selection, sample balancing and multiclass classification aiming to predict structural type from a tax assessor building information. It shows that: none of weight rebalancing schemes improves prediction accuracy; chi-square ranking is a better method for feature selection among tested ones; k-nearest neighbor and neural network are the better classification algorithm. However, none of the model truly solve the problem of imbalanced classes and do not give satisfactory performance in prediction rare classes. In the future, a search for statistical information on structural type distribution of the region would serve as a validation for the prediction result of the full building stock. More data about concrete, steel and masonry building in residential dominant areas could eb collected to improve classification in those structure types. More advanced wrapper type of feature selection could be used integrated into the classification model and more advanced weight penalty functions would also be embedded in the optimization process to yield better performed classifiers.

Project Code

https://github.com/MajorZengatStanfordedu/CS229-Project

Works Cited

[ 1 ] C. A. R. V. W. a. W. T. H. Kircher, “ HAZUS earthquake loss estimation methods,” Natural Hazards Review, vol. 7, no. 2, pp. 45–59, 2006.

[ 2 ] C. P. P. A. M. M. S. W. E. M. L. T. &. T. H. Geiß, “Estimation of seismic building structural types using multi-sensor remote sensing and machine learning techniques,” ISPRS journal of photogrammetry and remote sensing, no. 104, pp. 175–188., 2015.

[ 3 ] V. Kumar, “Use sample weight in multi-label classification,” [Online]. Available: https://stackoverflow.com/questions/49534490/use-sampleweight-in-multi-label-classification.

[ 4 ] scikit-learn, “sklearn.multiclass.OneVsRestClassifier,” [Online]. Available: https://scikitlearn.org/stable/modules/generated/sklearn.multiclass.OneVs RestClassifier.html. [Accessed 01 06 2019].

[ 5 ] N. Stanford, “Chi2 Feature selection,” [Online]. Available: https://nlp.stanford.edu/IR-book/html/htmledition/featureselectionchi2-feature-selection-1.html.

[ 6 ] scikit.learn, “sklearn.feature_selection.chi2,” [Online]. Available: https://scikitlearn.org/stable/modules/generated/sklearn.feature_selection. chi2.html#sklearn.feature_selection.chi2.

[ 7 ] S. NLP, “Mutual-Information,” [Online]. Available: https://nlp.stanford.edu/IR-book/html/htmledition/mutualinformation-1.html#mifeatsel.

[ 8 ] L. E. Peterson, “K-nearest neighbor,” Scholarpedia 4.2.

[ 9 ] scikit-learn, “class sklearn.neighbors.KNeighborsClassifier,” [Online]. Available: https://scikitlearn.org/stable/modules/generated/sklearn.neighbors.KNeig hborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.

[ 1 0 ] S. R. a. D. L. Safavian, “A survey of decision tree classifier methodology,” IEEE transactions on systems, man, and cybernetics , no. 21.3 , pp. 660–674, 1991.

[ 1 1 ] A. Navlani, “Decision Tree in Python,” [Online]. Available: https://www.datacamp.com/community/tutorials/decisiontree-classification-python.

[ 1 2 ] scikit-learn, “sklearn.tree.DecisionTreeClassifier,” [Online]. Available: https://scikitlearn.org/stable/modules/generated/sklearn.tree.DecisionTree Classifier.html#sklearn.tree.DecisionTreeClassifier.

[ 1 3 ] scikit-learn, “sklearn.ensemble.RandomForestClassifier,” [Online]. Available: https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.Rando mForestClassifier.html#sklearn.ensemble.RandomForestClassifi er.

[ 1 4 ] scikit-learn, “sklearn.neural_network.MLPClassifier,” [Online]. Available: https://scikitlearn.org/stable/modules/generated/sklearn.neural_network. MLPClassifier.html.

[ 1 5 ] scikit-learn, “sklearn.feature_selection.mutual_info_classif,” [Online]. Available: https://scikitlearn.org/stable/modules/generated/sklearn.feature_selection. mutual_info_classif.html#sklearn.feature_selection.mutual_info _classif.

--

--

Gideon Mendels
Comet
Editor for

Co-founder/CEO of Comet.ml — a machine learning experimentation platform helping data scientists track, compare, explain, reproduce ML experiments.