# Data Preprocessing Techniques for R Script Modeling and Design

Excerpt from the book *SAP HANA Advanced Data Modeling* by Anil Babu Ankisettipalli, Hansen Chen, and Pranav Wankawala. Used with permission of SAP PRESS. All rights reserved.

*This blog discusses data preprocessing techniques for R script modeling and design within SAP HANA.*

In the figure above, you’ll notice that both SAP HANA and the R server run on two different machines, and it is important to understand the data transfer between these two entities. The general rule for executing algorithms is that they run in the database closer to the data to avoid data movement and to reduce overhead. Following the same principle by leveraging the native algorithms available in PAL avoids needing to transfer data between the servers.

However, not all algorithms and statistical techniques are natively available in SAP HANA. Therefore, it is imperative to keep in mind the cost of transferring this data between the servers before deciding to execute a process in R. Unsupervised learning techniques are the most common techniques used that call for large data transfer between the servers.

We should consider some data preprocessing techniques, such as sampling with random replacement, sampling without random replacement, and stratification sampling, to reduce the amount of data to transfer in these circumstances. In this section, we will focus primarily on data transfer. Of the sampling techniques, stratification sampling poses the least amount of sampling errors, because it maintains the proportions of data distribution and ensures that each subgroup is represented in the sampled data. Let’s look into stratification sampling further in the next section.

**1 Stratified Random Sampling**

*Stratification* is the process of identifying and dividing populations into subgroups based on values chosen categorically. Each of these subgroups are referred to as *strata*or *stratum*. Within each stratum, systematic or random samplings are performed to randomly select portions of the data population.

In SAP HANA, there are two methods for stratification in PAL. The first is through a sampling function, and the second is through a partition function that divides the dataset into training, testing, and validation datasets. For reducing the dataset, consider the sampling function. To start, generate a wrapper function **SAMPLING_TEST_PROC**with the **PAL SAMPLING** function using the Application Function Library (AFL) framework.

Below shows the required parameters for stratification. The **PERCENTAGE** parameter provides the sampling size. The value of 0.5 as an input is expected to provide half of the population as the sampling output. **STRATA_NUM** is the number of subgroups to be considered by the algorithm, and **COLUMN_CHOOSE** is the categorical column to be used for subgrouping.

The next table shows the distribution of the distinct values for the chosen column for stratification. There are two distinct values and the population size of each value in the X6 column (arbitrary data column).

Below shows the distribution of the X6 column after stratification sampling has been performed. Because we have chosen to sample 50% of values, stratification ensures that the proportions of the population are the same; the output in the table below shows half the original population for each value of X6.

**2 Random Sampling with Replacement**

In this technique, the sampling algorithm with replacement will replace the item selected, and hence the probability of selection for each item remains the same for every item selected for the sampling.

In order to run this technique, use the same wrapper you generated previously and the same inputs expected for **SAMPLING_METHOD**. Let’s set this value to 4 and rerun the algorithm. Here is the selected sample distribution:

**3 Random Sampling without Replacement**

In this technique, each sampling selection algorithm will not replace the item selected, and hence the probability of each item selected is reduced by an order of 1. Update the **SAMPLING_METHOD** to 5 and run the algorithm. Below depicts the distribution of X6 after sampling without replacement.

**4 Systemic Sampling**

In this technique, the algorithm selects all elements randomly. The **SAMPLING_METHOD** is 6 and below.

With nearly 1,600 rows, each technique has a similar output, with deviation around 10%. However, stratification sampling is able to precisely sample, keeping the distribution of the stratification column. In addition to these sampling techniques, SAP HANA also supports sampling based on row offsets (First N, Last N, Middle N, and Every Nth), and each of these aforementioned techniques works on a row offset.

Each of the techniques discussed enables you to reduce the transfer of datasets by maintaining the homogeneity of the data. The recommended approach is to run these techniques in SAP HANA using logical models such as attribute views, analytic views, calculation views, and/or predictive analytics preprocessing algorithms.

Excerpt from the book *SAP HANA Advanced Data Modeling* by Anil Babu Ankisettipalli, Hansen Chen, and Pranav Wankawala. Used with permission of SAP PRESS. All rights reserved.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Sign up for SAP PRESS newsletters to get even more SAP tips and tricks delivered to you monthly!