Cleaning the data

Becky Zhu
unpack
Published in
4 min readApr 27, 2021

Dirty data contain mistakes as spelling or punctuation errors, incorrect data associated with a field, incomplete or outdated data, or even data that has been duplicated in the database. It even costs data scientist 90% of their time to clean the data. So cleaning the data is an important part of work for a project.

What is Data Cleaning?

Data cleaning is the process of modifying data to ensure that it is free of irrelevances and incorrect information. Also known as data cleansing, it entails identifying incorrect, irrelevant, incomplete, and the “dirty” parts of a dataset and then replacing or cleaning the dirty parts of the data.

Why Clean Data?

Improving data quality through data cleaning can eliminate problems like expensive processing errors, manual troubleshooting, and incorrect invoices.

Business enterprises can achieve a wide range of benefits by cleansing data and managing quality which can lead to lowering operational costs and maximizing profits.

Although sometimes thought of as boring, data cleansing is very valuable in improving the efficiency of the result of data analysis. It generally helps to improve data quality. The process of data cleansing may involve the removal of typographical errors, data validation, and data enhancement. This will be done until the data meet the data quality criteria.

What’s the criteria of quality data?

There are five criteria of quality data.

1. Validity

The degree to which the measures conform to defined business rules or constraints. Validity is fairly easy to ensure now, when the database technology developed. Invalid data comes mainly from two reasons:

Where constraints were not implemented or where inappropriate data-capture technology was used.

Data constraints fall into the following categories.

a. Data-Type Constraints — values in a particular column must be of a particular datatype,

b. Range Constraints: typically, numbers or dates should fall within a certain range. That is, they have minimum and/or maximum permissible values.

c. Mandatory Constraints: Certain columns cannot be empty.

d. Unique Constraints: A field, or a combination of fields, must be unique across a dataset.

e. Set-Membership constraints: The values for a column come from a set of discrete values or codes.

f. Foreign-key constraints: This is the more general case of set membership. The set of values in a column is defined in a column of another table that contains unique values.

g. Regular expression patterns: Occasionally, text fields will have to be validated this way.

h. Cross-field validation: Certain conditions that utilize multiple fields must hold.

2. Accuracy:

It means the data right. The conformity of a measure to a standard or a true value. Accuracy is very hard to achieve through data cleaning in the general case, because it requires accessing an external source of data that contains the true value: such “gold standard” data is often unavailable.

3. Completeness:

The degree to which all required measures are known. Incompleteness is almost impossible to fix with data cleaning methodology

4. Consistency:

The degree to which a set of measures are equivalent in across systems. Inconsistency occurs when two data items in the data set contradict each other.

5. Uniformity:

The degree to which a set data measures are specified using the same units of measure in all systems. It is often quite easy to ensure this through data cleaning early in the process, but as the process moves along, and data is transformed and changed, it becomes far more difficult.

How to clean the data?

A simple procedure to clean the data:

1. Remove duplicate or irrelevant observations. Remove unwanted observations from your dataset, including duplicate observations or irrelevant observations.

2. Fix structural errors.

3. Filter unwanted outliers.

4. Handle missing data.

5. Validate and QA.

Tips of Data Cleaning in Classification Problem

In classification problem we can create a confusion matrix to see whether and where the classification problems occur.

We usually train a quick and simple model first and then use it to help us with our data cleaning.

The classification problems comes from:

1. Dataset problem: Images that aren’t bears at all, or are labeled incorrectly

2. Model problem: It isn’t handling images taken with unusual lighting, or from a different angle, etc

Each image is labeled with four things: prediction, actual (target label), loss, and probability. The probability here is the confidence level, from zero to one, that the model has assigned to its prediction:

plot_top_losses shows us the images with the highest loss in our. We can sort them by their loss to find the problem.

The loss is a number that is higher if the model is incorrect (especially if it’s also confident of its incorrect answer), or if it’s correct but not confident of its correct answer.

To change class and delete the data

We use a handy GUI for data cleaning in fastai called ImageClassifierCleaner that allow us to choose a category and the training versus validation set and view the highest-loss image in order, along with menus to allow image to be selected for removal or relabeling.

1. To change the class

To move images for which we’ve selected a different category, we would run:

for idx,cat in cleaner.change(): shutil.move(str(cleaner.fns[idx]), path/cat)

2. To delete the data

We should choose <delete> in the menu under this image and ImageClassifierCleaner doesn’t actually do the deleting or changing of labels for you; it just returns the indices of items to change. So, for instance, to delete (unlink) all images selected for deletion:

for idx in cleaner.delete(): cleaner.fns[idx].unlink

ImageClassifierCleaner in Fastai is helpful to solve this problems.

Conclusion

With the development of digitalization, data becomes more and more important in company’s asset. Each day we obtain all kinds of data from different resources. Most of time it contains either incorrect or irrelevancies. So we need to clean them before we use them, it will help us reduce our cost and enhance our efficiency and accuracy. This becomes an important thing. It should become a lifestyle of us.

--

--