Data Preparation for Machine Learning

A Value-Added Engineering Perspective

Jack Copper
The Startup
26 min readMay 11, 2020

--

The Data Preparation Maze

Preparing data is a fundamental activity in any machine learning project. Without adequate prepared data, machine learning algorithms can’t be trained, evaluated, or deployed. Data preparation often consumes well over half the budget of a machine learning project; consequently, it is important to gain as much value as possible from the data preparation process.

Note that throughout this document we use ‘model’ interchangeably with ‘machine learning algorithm’.

We present here a standardized approach to preparing data for training models. Our approach decouples preparing data from training models — making the approach equally applicable to preparing data for processing by an already-trained model.

Information generated during data preparation provides details of the characteristics of data — which in turn can inform decisions related to more effective collection of future raw data, and can offer insights when performance of an algorithm trained with the data does not meet expectations.

This document contains the following sections:

The Preliminaries — Introduces and explains fundamental concepts and requirements for preparing data.

Minimal Data Preparation — Outlines minimal capabilities and expected outcomes for a data preparation process.

Robust Data Preparation — Identifies additional data preparation processing that yields significantly more useful information about both the original raw data file as well as the clean prepared data file.

Best Practices — Summarizes the key concepts and processing approaches presented and suggests best practices.

Resources — Provides a brief introduction to the neuralstudio.ai automated machine learning platform — which generated the results used in explanations of concepts and strategies.

In the interest of clarity, and to avoid the clutter of tangential explanations related to accessing any particular database or other type of persistent storage, we base this discussion on data in the form of a ‘flat’ raw text file — a format that can be easily derived from any storage type or technology, and that is ubiquitous in machine learning projects. Subsequently, we will refer to an original baseline collection of data as a ‘raw data file’.

Furthermore, while our focus is preparing data, it bears mention that partitioning data (for example, into model training and model validation sets) is also an important activity in the full model development life cycle. Usually, partitioning involves simply holding out a portion of prepared data for model validation. More complex partitioning, such as may be required for time series data in which early portions of the series are used for training, and the latest portion is used for validation, or further sampling of prepared data to reflect desired distributions of values identified during the data preparation process, is beyond the current scope.

The Preliminaries

The primary motivation for preparing data is obvious: the core mathematics which underlie any machine learning algorithm require clean data. Proper data preparation ensures that data used in computations is of the correct type and is appropriately formatted so that training the model (as well as running the model when it is deployed) can proceed without sustaining exceptions.

Note that we draw an important distinction between ‘clean’ data, and ‘validated’ data.

A ‘clean’ data value (a) has a value that is not null (that is, the value is not empty), and (b) is correctly formatted for the particular field/source that produced the data.

In other words, if a field/data source is expected to contain/generate numeric values then a clean value is a number (meaning the value can be converted internally by a programming language to an integer or a floating-point primitive without causing an exception). If an exception occurs, that particular value representing the field/data source is treated as an arbitrary symbol (an alphanumeric character string). The final determination of whether a field is considered numeric or alphanumeric depends on the relative counts of occurrences of numbers and alphanumeric strings for each field in the raw data file.

A ‘validated’ data value has been further processed to confirm that the actual value (either as a number, or as a specific character string) lies within acceptable bounds, defined in the context of domain knowledge for a particular problem/environment.

Our focus here is the creation of clean data files from original raw (and potentially ‘messy’) data files. We acknowledge that validated data files are also important; however, producing a validated data file from a clean data file is relatively straightforward as long as the process which generates the clean data file produces information about the ranges of data values for every field in both raw and clean files.

Data Organization

As the first step in preparing data for use with machine learning algorithms, it is important to identify/confirm fundamental attributes of the data.

In a raw data file, data values are organized as lines (also called records), with individual values (either alphanumeric strings or numbers) contained in fields within records.

Special Characters

Fundamental attributes of a raw data file are defined by a small set of Special Characters.

The separation of data values into lines is based on a ‘Line Termination Style’. The separation of data lines into fields depends on a specific, unique within a file, ‘Field Delimiter’ character.

The Line Termination Style normally reflects the operating system in use when the file is created — either a version of Microsoft Windows using the Windows style of line termination; or MacOS/Linux/Unix, all of which use the same (but different from Windows) style of line termination.¹ However, the creator of a raw data file may explicitly set the Line Termination Style, and the creator of the raw data file always specifies the Field Delimiter.

As another Special Character, the creator of a raw data file may define a particular ‘Comment Character’, such as ! or #, to identify comment lines that should be ignored when the file is processed (that is, the lines should never be presented to a model). Comment lines permit including explanatory information in a raw data file. If a ‘Comment Character’ is defined, the data preparation process must ensure that comment lines are removed (not presented to a model), both during training and when a model is run.

Similarly, the creator of a raw data file may define a sequence of characters (for example, ‘-999’ or ‘N/A’) as an ‘Unknown Value Marker’ to indicate that a particular data value is not known.

An Unknown Value Marker permits distinguishing between a truly missing value (for example, a respondent left a field blank on a form), and a value that is unknown (the respondent provided an illegible response in a field on a form). In general, to a machine learning algorithm the implication is the same — a field whose ‘value’ is missing or is the Unknown Value Marker, should not be used in training or when a model is run.

If an Unknown Value Marker is defined, all occurrences of the character sequence must be identified and processed appropriately.

In particular, if the Unknown Value Marker is a number such as -999, and records containing the Unknown Value Marker are not ignored (removed), statistics generated on a per-field basis will not be accurate because the numeric value will incorrectly influence calculations. As a consequence, performance of models trained with the data will not reflect real causal relationships. Appropriate handling of Unknown Value Markers as well as missing values entails pre-processing the original raw data file, and removing (ignoring) all records with fields whose values are either missing or are the Unknown Value Marker.

The final important issue related to Special Characters is how numeric values are represented. Integer representation is generally straightforward — integers have an optional sign (+ or -) and a value based on only the digit characters 0–9.² The representation of real numbers, on the other hand, requires a ‘Decimal Separator’ character (either ‘.’ or ‘,’). Usually the Decimal Separator is determined by the Locale or Region setting of the operating system; however, depending on how the raw data file was created, the Decimal Separator of the file may not match the Decimal Separator of the operating system which can lead to unexpected errors during processing.

Additionally, both integers and real numbers can be represented in scientific notation (which means a value could include the non-digit ‘e’ or ‘E’ character), and integers and real numbers can each have an associated ‘Currency Symbol’ and/or a ‘Thousands Separator’ character.

The Thousands Separator must be consistent with the Decimal Separator, and this is always the case when numeric values are generated by a programming language or saved from a spreadsheet. However, problems can arise when numeric data in a spreadsheet using a US Locale contains numbers which have the US Thousands Separator (comma), and the spreadsheet is saved as a CSV (comma separated value) file. Microsoft Excel and other spreadsheets will surround such values with double-quote marks, which while nominally ensuring the value is correct, can potentially cause problems when the file is read by other software that automatically treats values within quotation marks as alphanumeric strings.

The following image summarizes information that describes basic characteristics of a raw data file.

Several points to note with respect to this information.

First, there is no reference to a Thousands Separator character. Normally, any numeric values containing Thousands Separator characters in the raw data file would be ‘silently’ converted to numbers (integers or real numbers with the correct magnitude) when processed and placed in the final prepared data file, regardless of the language used to implement the data preparation software. If for some reason numeric values containing Thousands Separator characters are not handled correctly, the data preparation software in use has a serious deficiency.

Second, if a Currency Symbol is associated with a data value in the raw data file, it is quite possible that the value would be treated as character string, rather than a number. While this is likely not what is expected or intended, it highlights another important aspect of data preparation. A report produced by the data preparation process should contain a section that identifies the Field Type (numeric or string) of every field, for review by the model builder. As a general rule, machine learning algorithms expect numbers or strings — algorithms do not inherently ‘know’ about currency, so values which represent monetary values and contain a Currency Symbol should actually be represented simply as integer or real numbers of the proper magnitude, without any associated Currency Symbol.

While these fundamental attributes of data files are often taken for granted, every data preparation process, whether internally developed or implemented as a component of a machine learning framework, should routinely produce a report that explicitly identifies the special characters which effectively define the organization of data in the final prepared data file. Having this information at hand can help more quickly resolve issues when unexpected data preparation problems arise due to low-level data format issues.

Fields and Records

For ease of human identification and communication, fields are naturally referred to by their ordinal number, consecutively from 1 to the total number of fields in a record. Normally, field numbers do not explicitly appear in a raw data file unless they happen to have been chosen as field names. The creator of a raw data file specifies the names, if any, for fields. Field names must be separated by the same Field Delimiter character that also separates data values, and by extension, cannot contain the Field Delimiter character.

The following image illustrates these concepts, using an excerpt from an Excel spreadsheet (the data is artificial data that does not represent real persons).

The un-numbered top row in the image, containing 1, 2, 3 … is an Excel artifact which indicates how fields are separated, and the corresponding field numbers (although the Field Delimiter character employed is not visible). This row would not be included when the spreadsheet is saved.

The second row in the image, identified by the number 1 (1 is an Excel artifact, not data), contains names for each field. When the spreadsheet is saved, this row would comprise the first line in the saved file (the Excel artifact 1 would not appear in the file).

The other rows in the image, identified by the Excel artifact numbers 2, 3, and 4, contain values for each field in sequential data records.

The creator of a raw data file specifies the Field Delimiter character, either through software, or if the file is created from a spreadsheet, when the spreadsheet is saved.

To avoid issues when raw data files are transferred between computer systems and potentially used in different ‘Locales’, we suggest always saving data as ‘Tab’ delimited text files with a .txt file extension. When the raw data file is created the operating system will usually determine the Line Termination Style, unless the file is created by custom software that explicitly sets the Line Termination Style. Also, be aware of the Decimal Separator character used when a raw data file is created if the file may be transferred to a computer system which uses a different Locale. In particular, transferring data files between European and US computer systems can cause aggravating and sometimes hard to identify problems due to the use of different Decimal Separators.

Date and Time Fields

Raw data files often have fields which contain date and/or time values, either as Strings (meaning that the values contain non-numeric separators such as ‘-’ , ‘/’, or ‘:’) or numbers (for example, an integer value representing a time increment such as milliseconds or seconds from a fixed point in time). Either format will almost certainly not be interpretable by a machine learning algorithm unless some transformation is applied.

While identifying specific transformation methodologies is beyond out scope, in part because the transformations typically result in adding fields to records in the original raw data file, we will note that it is important to first identify what underlying information is expected to have utility from date/time values.

Eliciting cyclical or seasonal information generally requires transforming date/time values to yield continuous values which periodically repeat, corresponding to the ‘natural’ cycle of the other data values.

Eliciting information that reflects particular points in time (for example, week-day versus week-end behaviors or activities) generally requires a transformation that yields categories corresponding to days of the week.

Eliciting information that reflects sporadic activity (for example, the influence of holidays or special events) is often best captured by a Boolean transformation (a field such as ‘isHoliday’, with a value of ‘true’ or ‘false’).

Eliciting information that reflects the duration of activities requires first converting date/time fields to ‘pure’ numeric values, and then computing the difference between consecutive values, in whatever units (minutes, hours, days, etc.) are appropriate for the particular domain.

Ignored Fields

Based on human domain knowledge, in many situations the original raw data file includes fields which would not be useful for a machine learning algorithm.

For example, in the spreadsheet referenced previously, experience shows that the values of fields which contain names or addresses will have no causal relationship to the target value in a training data file — to the algorithm the values would be interpreted as many random, almost always unique, strings.

In the simplest case, ‘ignoring’ such fields means removing them from the raw data file; the resulting initial portion of a data record presented to an algorithm during training would then be as illustrated below.

Note that now ZIP is field 1, and that in general, when fields are ignored the relative positions of non-ignored fields would remain the same as in the original raw data file. More complex cases (ignoring non-contiguous fields in the interior of records) follow the same basic principle but require additional care to ensure that the order fields are presented to the model during training is maintained when the trained model is subsequently run.

The key point with respect to ignored fields is that during data preparation a model builder may and often does decide to ignore certain fields in a raw data file; if so, ignored fields must then be accounted for in subsequent modeling steps.

When fields in a raw data file are ignored (removed) and the file is used in training, it is also critical that the organization of training data records be maintained in downstream processing. In particular, when a model trained using revised records (with ignored fields removed) is deployed and subsequently processes new data, if the source of new data by default produces records in the form of the original raw data file, those records must be modified (fields ignored in training data must be removed from new data) before the model can correctly process new data in a record.

In addition, as discussed further below, the data preparation process itself may identify other fields that would not be useful in training a machine learning algorithm; whether to present such fields to the algorithm during training is a separate decision.

Minimal Data Preparation

To recap progress to this point, the fundamental attributes that define fields and records are assumed to have been confirmed, and any fields that the model builder decided to ignore are assumed to have been identified.

To create a prepared data file, at a bare minimum a data preparation process then should iteratively:

  • Read a record from the raw data file (a record is identified when a Line Termination Style character sequence is detected);
  • Ignore but count empty records (records which consist only of the Line Termination Style character sequence);
  • Ignore but count records that start with a Comment Character (if specified);
  • Confirm that the record has the expected number of fields (count records which do not);
  • Check each non-ignored field to confirm that it contains a value (count occurrences of missing values);
  • Check each non-ignored field value to confirm that expected numeric values are numbers (count occurrences when values expected to be numbers are not);³
  • If all field count and field value confirmations succeed, write the revised record to the clean prepared file (excluding any ignored fields when writing the record), and count the good record.

This minimal processing would yield a clean prepared data file that may be adequate if the original raw data file was relatively clean. However, should errors occur when the prepared data file is used for training models, there would be little information to help resolve issues, nor would there be summary information, including statistics, about values in the clean file that could help identify reasons for sub-optimal algorithm performance after training concludes.

These shortcomings will be addressed in the next section.

Robust Data Preparation

In addition to the minimal results identified in the preceding section, robust data preparation includes:

  1. generating detailed statistics for all non-empty, clean values for each fundamental Field Type for all fields (minimum, mean, maximum, and standard deviation for continuous-value numeric values; discrete class counts for discrete numeric or string values);
  2. providing automated analysis of clean field values in order to identify fields (in addition to those identified by the modeler) that would not be useful for training a machine learning algorithm; and
  3. Attempting to automatically create an auxiliary clean data file in the event that the original raw data file contains too few complete clean data records to adequate train a machine learning algorithm (as discussed further below).

While producing this additional information requires multiple passes through a raw data file, unless the original file is known with certainty in advance to be clean, the benefits quickly become apparent and out-weigh the minor increased processing costs when the subject raw data file is ‘messy’. Statistics provide insights into the volatility of field values; identifying min-max ranges can guide developing strategies for dealing with outliers.

In addition, generating detailed information about field values highlights yet another issue — related to the representation of numeric values — which should be addressed through Robust Data Preparation. Often integer codes are used to differentiate between elements of classes or categories (for example, 1 means single, 2 means married, etc.). While this approach is straightforward for humans to interpret, integer code assignments can cause issues when interpreted by machine learning algorithms. Rather than ‘recognizing’ that a set of integers should be treated as discrete classes or categories, an algorithm may incorrectly treat the set as continuous numbers. This can dramatically affect whether or not the algorithm can discover whatever analytic utility the subject field might have.

This issue is addressed in neuralstudio.ai® (refer to the Resources section) by automatically applying the same heuristics during data preparation that are used by the core neuralstudio.ai machine learning engine to structure data before using it to train neural networks. Put simply, the heuristics generate histograms of values, and if a field has fewer unique integer values than some threshold⁴, the field values are treated as discrete categories rather than (relatively) continuous numbers.

We now review two possible ‘messy’ data scenarios — either the original raw data file yields some minimum⁵ number of clean complete records, such that the resulting prepared data file can be directly employed to train models, or the original raw data file does not and some additional principled heuristics must be applied in order to obtain the required minimum number of training records.

Scenario 1 — Data Preparation Yields Sufficient Clean Records

In this scenario, the report produced by a robust data preparation process provides information that both confirms (presumably) the model builder’s general expectations about data values as well as provides a rationale for undertaking future data collection efforts should model performance not be satisfactory.

Summary Information

The following table indicates basic summary counts that offer an ‘at-a-glance’ overview of the results of processing and errors detected. Particularly when a model builder is working with data supplied by another party, these summary counts comprise ‘sanity checks’ — if the number of total data records or the number of clean data records differ significantly from what were expected, the discrepancies should be resolved before continuing with training models.

Basic Data Preparation Summary Counts

In the next table, summary counts for IGNORED fields are provided. Note that while that this table reflects the analysis of field values that neuralstudio.ai® (refer to the Resources section) data preparation performs (one additional field was flagged ‘IGNORE’ as a result of too many unique string values), in general any robust data preparation process should report similar information.

Summary IGNORED Field Information

Field Detail Information

Summary field information should be supported by detailed information on a per-field basis, as illustrated below. This information allows the model builder to confirm that Field Type identified by the data preparation process for each field is consistent with what is expected; in addition it explicitly identifies fields that are not useful for modeling. Any anomalies or questions that result can be addressed before models are trained (or run) using the data.

Detailed Field Status Information

Field detail information should also include statistics for continuous-value and discrete-value fields, as illustrated in the following tables.

Continous-Value Field Statistics

Statistics for continuous-value fields will identify potentially volatile fields that may adversely impact model stability. For example, if two different models demonstrate comparable performance on training data, and Sensitivity Analysis (beyond our scope here) suggests that one model is influenced more by a field with a larger standard deviation than another model, other factors being equal the alternate model is the better choice for deployment.

Discrete-Value Field Statistics

Statistics for discrete-value fields will identify fields in which the discrete-value counts are skewed. This may be remedied by over-sampling/under-sampling as appropriate for skewed class counts, or it may suggest effort is needed to acquire additional training data containing more of the under-represented classes. The issue to resolve is that, depending on the size of the prepared data file relative to individual discrete value counts, under-represented classes may in effect be ignored when a machine learning algorithm is trained, which usually is not intended or desired.

Scenario 2 — Data Preparation Yields Insufficient Clean Records

In the event there are insufficient complete clean records in the original raw data file, some of the summary information described previously will not be available. However, a robust data preparation process should provide as much information as possible so that the model builder has guidance for obtaining additional data.

The following table illustrates important error summary information (note that some columns are not shown due to space constraints).

Summary Error Information

The Raw Lines Processed value (83003) indicates the total number of lines in the raw data file. Generally, the number of potential data records is 1 less (83002 in this file), since the first line in a file usually contains field names.

The Total Fields value (272) indicates the total number of fields in records in the raw data file.

The Field Errors per Record line indicates unique counts of errors in records, in ascending order. The Records line indicates the number of records which have the corresponding number of errors. In this example,

  • 204 field errors occurred in 3135 records;
  • 205 field errors occurred in 8343 records;
  • . . .
  • 222 field errors occurred in 1045 records; and so forth (error counts between 207 and 222, and after 224, are not shown).

This information suggests the level of effort that would be required to make the raw data file usable. Every record had at least 204 errors; if those records could be corrected (and if the errors overlapped errors in other records, which is often not the case) the resulting file would contain 3135 records and could possibly be used, at least for proof-of-concept models.

The Field Number and Field Name lines indicate, respectively, the number and name of each field in the data file; the Missing Values line indicates the number of records in which the corresponding field value was missing (in other words, there was no value for the field). Fields 20 and 21 in the above image (and additional fields not shown) were missing data in every record. Clearly for this particular file, missing data is the problem.

Whenever the number of missing values for a field approaches the number of data records in a file, unless there is a clearly defined way to obtain missing values, we recommend completely eliminating (removing) such fields from the raw data file before restarting the data preparation process. If for some reason the fields cannot be removed, they should be designated as Ignored by the model builder when the data preparation process is re-run.

Information about Ignored fields, as illustrated in the following table, is also important when the data preparation process does not yield a sufficient number of clean records — particularly if the process performs additional analysis of field values (as is the case with neuralstudio.ai).

IGNORED Field Information (All Fields not Shown)

The above image further substantiates that missing data is the primary problem in this file. Out of 272 total fields, 221 had too many missing values.

In general, fields ignored due to almost constant values (Code 4) should be reviewed. It is possible that many missing values in such fields masked the fact that multiple valid values occurred in the field, but the distribution was so skewed that initial data preparation processing ignored the values that occurred infrequently. Fixing missing values in these fields might result in some of the fields in fact being usable and relevant.

Fields with too many unique strings (Code 9) are most likely ignored appropriately, but domain knowledge would confirm that.

Fields ignored by user (Code 13) are not a concern — they were explicitly ignored by the model builder.

Model builder domain knowledge would determine whether to attempt to fix problems with the data in the fields ignored due to Analysis, or to accept the neuralstudio.ai assessment and ignore the fields in follow-up data preparation processing (the information here would indicate to the model builder which fields to designate ‘Ignore’ in follow-up processing).

Obtaining Sufficient Clean Data from a Messy Raw Data File

Fixing bad data values in a raw data file can be a time-consuming and laborious task; an automated approach can be extremely helpful. The following paragraphs describe the methodology used by neuralstudio.ai.

If, after the initial pass through a raw data file, neuralstudio.ai determines that there are insufficient clean data records, it will automatically attempt to create a new data file derived from the original raw data file. Replacement values will be generated for all missing values in approximately half of the fields in the original raw data file that were not flagged to ignore (either by the model builder, or during initial neuralstudio.ai processing).

The replacement strategy is a principled but simple automated process which transforms into a usable form a file that otherwise would not be suitable for use with machine learning algorithms.

For fields that contain continuous numeric values, the replacement for any missing value for a particular field is the average of all non-empty values for the field in the original raw data file.

For fields that contain discrete (numeric or alphanumeric) values, the replacement value for a particular field is the mode of all non-empty discrete values in the original raw data file.

To assist in assessing the quality of the original raw data file, neuralstudio.ai always generates statistics based on non-empty values of non-ignored fields. In addition, if neuralstudio.ai determines that a new file can be created, it generates statistics for non-ignored fields in the new file.

These statistics offer quick ‘sanity checks’ for raw data file quality, and if a new file is created, also allow confirming that distributions of values in the new file are not radically different from the original raw data file. The tables below contain statistics for a very messy original raw data file, and the corresponding newly created file. The right-most column in each table indicates the number of field values in original raw file records that were changed as corresponding records were added to the new file.

Continuous-Value Field Statistics and Replacements (All Fields not Shown)

Continuous field statistics comprise the minimum, mean, maximum, and standard deviation for all non-empty continuous-value fields. In the image above, no Field_19 values were changed (all records in the original raw data file had valid values for Field_19). For Field_38, 69 values were changed in records in the new file, and 6895 Field_172 values were changed. A ‘changed’ value means that an empty (missing) value for a field in the raw file was replaced with the average of all non-empty values for the particular field. For example, 69 empty Field_38 values were replaced by ‘0.02’, the mean Field_38 value, to make 69 complete records which were placed in the new file,

Discrete-Value Class Counts and Replacements (All Fields not Shown)

Discrete field class counts consist of counts of the occurrence of each non-empty discrete (numeric or alphanumeric) value in a field (a ‘value’ is a discrete class label). In the image above, Field_132 contained 10 unique class labels (which happened to be numbers); one value was changed in the new file. Field_135 comprised 17 unique class labels (which also happened to be numbers); 35 values were changed in records in the new file. A ‘changed’ value means that an empty (missing) value for a field in the raw file was replaced by the class label representing the mode of all non-empty values for the particular field. For example, one empty Field_132 value was replaced with the label ‘0’ to make a complete record which was placed in the new file.

Unless there are large discrepancies between statistics for fields in the original raw data file and statistics for fields in the new file, and counts of class labels in the raw file and counts of class labels in the new file, the new file can be used to train machine learning algorithms, while recognizing that the training data is not entirely representative of real-world conditions.

Best Practices

Preparing a clean data file suitable for training machine learning algorithms requires paying attention to the details of how data is stored and how data values are represented. This task is separate from identifying the relevance of data to a particular problem domain (that is, identifying what data actually influences the training target of a machine learning algorithm).

For all but the simplest of machine learning projects, model development is an iterative process. As an important step in model development, data preparation can also require iteration. Proper execution of a robust data preparation strategy can eliminate unnecessary data preparation iterations. Without question, process automation is important; at the same time, human domain knowledge is required to resolve ambiguities in the data as well as to ultimately decide what data is important and what data can be ignored.

Remember data preparation begins at the source!

If data is acquired directly from sensors, ensure that hardware is properly maintained — sensors, communication equipment, connectors and connections. Eliminating spurious values at the source is often the most cost-effective strategy.

If data is generated by information from forms or documents completed by humans (which are then either transcribed by humans or optical character recognition software, or which serve directly or indirectly as a front-end for a database), build sanity checks into software that collects/aggregates the data. Fully test one-off custom software with the same rigor that is applied to full applications (yes this can be onerous — another reason for using standardized processes). Do not simply assume custom software is performing as expected. If at all possible, do not permit empty values in data; if empty values can occur, define a consistent method for identifying and handling them before the empty values make it into back-end storage.

Standardize, preferably at the organization level but definitely at the machine learning project level, the structure of raw data files. Software that generates training data files from database queries should use the same fundamental attributes (Field Delimiter, Decimal Separator, Line Termination Style) as comparable training data files saved from a spreadsheet for the same project.

If training data is aggregated by extracting data from one or more databases into an intermediate file before presentation to a machine learning algorithm, and the intermediate file is very large, during the extraction process duplicate the first several thousand records (for example, the first 1000 to 10000) and place them in a separate tab-delimited file. Then as a sanity check, open the small separate file in a spreadsheet and confirm that the basic line and field organization is correct.

While opening the separate file in a spreadsheet is a quick ‘user friendly’ way to identify obvious data issues, spreadsheets can also hide structural anomalies that may cause issues when the file is used by a machine learning algorithm. As a second sanity check, open the separate file in a plain text editor capable of visually rendering non-printing characters like ‘Tab’ and ‘Linefeed’, and confirm that non-printing characters are not inappropriately duplicated — particularly at the ends of lines (each pair of duplicate consecutive Field Delimiter characters indicates that a value is missing between the Field Delimiter characters).

Do not use numbers (very large, negative, zero, etc.) to flag missing or bad data, unless it is certain that data preparation processing will guarantee that the machine learning algorithm will ignore (that is, never ‘see’) the chosen flag value. Otherwise, if the algorithm treats the missing data flag as a number, whatever value was chosen will adversely impact any statistics calculations related to the corresponding numeric field.

Consider how the machine learning algorithm will be deployed when preparing data for training it. If the algorithm will be deployed in a batch mode (processing new data in a file structured exactly like the training data file), the data preparation process used for training data would apply without changes. If the algorithm will be deployed for use in an online (real or near-real time) environment, it may be necessary to not only prepare data, but to also invoke additional validation mechanisms (see the introduction) before presenting data to the algorithm. In turn, this may entail developing custom software to implement validation strategies.

Finally, as we have attempted to make abundantly clear, every detail about the structure of data, and every step in the data preparation process, should be logged or included in a report. When data is clean, such detailed information may seem unnecessary or superfluous. But if the information is produced routinely in a standardized process, when the process is applied to data that is not clean, the details will make resolving issues much easier.

Of course, that assumes the information which is generated is read!

Resources

The information supplied in tables which highlight or explain the data preparation concepts we have introduced was automatically generated by the Basic Data Preparation component of neuralstudio.ai. neuralstudio.ai is a neural network-based automated machine learning platform that opens the development and deployment of optimized neural network solutions to individuals and organizations that may not have extensive knowledge of or experience with neural networks and machine learning technologies. neuralstudio.ai is available around the world through the Microsoft Azure cloud.

neuralstudio.ai is designed from the ground up to support the entire life cycle of neural network development — training, evaluation, deployment, and performance monitoring. In addition to Basic Data Preparation, neuralstudio.ai offers

  • Enhanced Data Preparation, which extends the concepts introduced in Robust Data Preparation by producing individual neural network models to generate replacement values for every faulty field in a data record;
  • One-off neural network training for relatively quick proof-of-principle models;
  • Optimized ensembles of neural networks, where first individual network hyper-parameters are optimized by a genetic algorithm, and then ensembles of individual networks are further optimized by genetic algorithm to identify and rank the best performing ensembles;
  • Comprehensive deployment facilities ranging from execution of models in the cloud, to on-premises execution in Excel spreadsheets or custom enterprise applications, to execution in embedded systems using the NeuralWorks® runtime engine.

To learn more, we invite you to visit neuralstudio.ai and sign up for a free Guest account. If you have questions about data preparation in general, or the capabilities of neuralstudio.ai, you can reach us at help@neuralstudio.ai.

We look forward to helping you along your machine learning path.

Footnotes

[1] At a character level, Windows uses a <carriage-return><linefeed> character sequence to terminate lines, while MacOS/Linux/Unix use only a <linefeed> character.

[2] Acknowledging that numbering systems other than the decimal system, such as the hexadecimal system, use additional characters.

[3] Two passes through a raw data file are required in order to determine which fields should be considered numeric fields.

[4] neuralstudio.ai implements a default threshold of 64 unique values.

[5] What constitutes a ‘minimum number’ is highly dependent on the specific problem domain; however, fewer than 50 records would arguably be insufficient in any real-world problem domain and is the minimum required by neuralstudio.ai.

Acknowledgements

The Data Preparation Maze image was created by Gerd Altmann.

We thank Alain Fuser, CEO of NEHOOV, Eleanor Hanna, Data Scientist at Valassis, and Jamal Clarke, Computer Scientist at NeuralStudio, for instructive comments.

--

--

Jack Copper
The Startup

I am the founder and Managing Director of NeuralStudio SEZC, a provider of AI technology, products, and services based in the Cayman Islands.