Cleaning Data in Stata

An introduction to essential Stata commands for generating, recoding, and rescaling variables.

Published in

The Stata Gallery

10 min readDec 23, 2023

The data we get are rarely ready for the analyses we want to conduct. Oftentimes we need to first “clean” (aka, “pre-process”) the data so that the analyses we conduct are statistically appropriate.

For example, we might need to recode certain values of a variable, create a new variable, or rescale a variable so that the results that we get are as interpretable and informative as possible. This often involves working with both numeric variables as well as “string” variables (that is, variables that feature any non-numeric characters, such as country names).

This can be a time-consuming process — sometimes far more time-consuming than doing the actual analysis. Unfortunately, it is difficult to find Stata guides that succinctly and clearly show how to do these fundamental tasks as efficiently and carefully as possible. This guide covers these various techniques so that you can begin analyzing your data.

A quick disclaimer: data cleaning is simultaneously the most boring and most important part of data analysis. If you make a mistake in the cleaning process — which is extremely easy to do — your results will no longer mean what you think they mean, and that is really, really bad. Therefore: Always triple-check your cleaning against the original data.

OK, let’s get started!

First, of course, we will upload the exhilarating automobile data set from good ol’ 1978:

sysuse auto.dta, clear

set obs 74 // this is optional, but ensures you generate the correct number of values

GENERATING VARIABLES & REPLACING/RECODING VALUES

Here are some of the most essential techniques for generating and recoding variables.

Use the -gen- command to generate a new numeric variable that is blank (that is, missing all values)

gen numeric_var=.

Generate a new string (that is, non-numeric) variable that is blank

gen string_var=" "

Important to remember: values of string variables must always appear within quotes, if even the value happens to be numeric. For example, you’d refer to “Seven” or “7”, rather than Seven or 7.

Remember, if you ever make a mistake, simply drop the new variable using -drop- and start over:

drop string_var

Quickly generate a single “dummy”variable based upon another variable (categorical or continuous)

Note: dummy variables are binary variables that are typically coded as 0 and 1 to distinguish one subset of observations from another subset of observations.

gen highprice=price>=5000 
/*Note:  This assigns a 1 to any observation for which price>=5000, and 0 otherwise.

*Important: This will assign 0s to observations that are missing a value for "price".  

Thus, a safer strategy is to specify "if var is not blank"  */
gen highprice2=price>=5000 if price!=.

Quickly generate a series of dummy variables based on a categorical variable using -tab- with the “gen( )” option

tab foreign, gen(foreign_dummy) 
/* Notice: the new variables' labels tell you which value equals 1

Duplicate a variable using -clonevar-

Note: this is a really good habit to develop as it is much safer than changing the original variable (having the original is how you can check that you did your cleaning correctly!). Plus, this command very conveniently copies the variable and value labels!

Numeric variable example:

* Syntax is:  clonevar newvar=oldvar

clonevar foreign2=foreign // notice: copies variable AND value labels as well

String variable example:

clonevar make2=make

Use “replace” and “if” to replace values of an existing variable

Example 1: Fill in values based on values of another variable (and then rename the variable to give it a more informative name):

replace numeric_var=0 if mpg<21.2973 // make numeric_var=0 if mpg is less than mean (21.2973)
replace numeric_var=1 if mpg>=21.2973 // make numeric_var=1 if mpg is greater than or equal mean

rename numeric_var OverUnderAvgMPG // renaming syntax:  rename oldname newname

Example 2: Generate a variable, based on another variable, for a particular subset of the data using the “if” subcommand:

gen numeric_var_foreign=.
replace numeric_var_foreign=numeric_var if foreign==1
/* Here, we're making a new variable (numeric_var_foreign). It will be
equal to "numeric_var" but ONLY if an observation satisfies the condition
that "foreign" is equal to 1.  */

Using multiple conditional arguments (be extra careful with these — always check against original data afterward!):

gen numeric_var2a=turn if OverUnderAvgMPG==1 & foreign==1 // must satisfy BOTH conditions

gen numeric_var2o=turn if OverUnderAvgMPG==0 | foreign==0 // must satisfy AT LEAST ONE condition

Example 3: Make an observation blank (“missing”) based on some identifying variable (note, for Stata, “.” means “missing”):

replace numeric_var_foreign=. if make=="Toyota Celica"

Example 4: Quickly make a categorical variable from different ranges of a continuous variable using the -inrange(x, min, max)- suboption:

gen numeric_var3=.
replace numeric_var3=1 if inrange(trunk, 1, 10) // syntax: inrange(var, min, max)
replace numeric_var3=2 if inrange(trunk, 11, 20)  
replace numeric_var3=3 if inrange(trunk, 21, 25)

Example 5: Same procedure, but using -inlist(x, “name”, “name”)- which calls values from a string variable and tells Stata to use only values that are listed within “insist(make, )”

gen numeric_var4=1 if inlist(make, "AMC Concord", "AMC Pacer", "AMC Spirit")
replace numeric_var4=2 if inlist(make, "Toyota Celica", "Toyota Corolla", "Toyota Corona")

Example 6: Replace values of a string variable:

Here we will first use -input- to create a new string variable and input string values, one of which (“Three”) will be incorrect:

input str10(string_var2) // "string_var2" will be the name of the new string variable
"1" 
"2" 
"Three" 
"4" 
"5"
end

Now we’ll replace the incorrect value:

replace string_var2 = "3" if string_var2=="Three"

/* Notice: The 3 is in quotes. We can't do "replace string_var2==3" because 
string_var2 is a string. We'll change it into a numeric variable below. */

Make a numeric variable from an existing string variable by using the wonderful -encode- command

Here, we’ll first split up the “make2” values that we created above to end up with new variables. Each new variable will contain just one word from the original make2 variable’s values. For example, if a vehicle name had three words, then make21 will have the first word, make22 will have the second word, and make23 will have the third word:

split make2, parse(, " ") // spacing is very important here!
tab make21 // notice:  Stata calls it "make21" b/c it uses the first word of "make2"

Example of what -split , parse- accomplished.

Now we will make a numeric version of our new “make21” variable using the -encode- command:

encode make21, gen(Make_numeric) // generate a numeric version of "make21"

If we then use the -tab- option, we can see the labels of our new numeric variable:

And if we add the “, nolabel” option, we can see the numeric values that are “underneath” the labels (tip: to make this even easier, see the discussion of the -fre- command below!):

Make a string variable from an existing numeric variable using -decode- or -tostring-

Example using -decode- :

decode Make_numeric, gen(Make_String) // note this will be identical to "make21"

Example using -tostring- :

tostring Make_numeric, gen(Make_String2) // uses original varname as variable label

Make a string variable numeric but without generating a new variable (exercise caution!)

*First create a string variable
input str10(string_var3) // "string_var3" will be the name of the new string variable
"1" 
"2" 
"3" 
"4" 
"5"
end

destring string_var3, replace  

* This will work fine because all values are actually numeric


* But what if all values aren't numeric?


input str10(string_var4) // "string_var4" will be the name of the new string variable
"1" 
"2" 
"???" 
"4" 
"5"
end

* If they are not all numeric, you can add the "force" option but this
will DELETE any non-numeric values/cells (be careful!)

destring string_var4, replace force

Using -recode- (often more efficient than using gen/replace, especially with categorical variables that have only a few values)

Example without generating a new variable (always use extreme caution changing original variables!):

recode foreign (0=1 "Domestic Cars") (1=0 "Foreign Cars") (9=.)
* Notice: Syntax is (old value = new value "label you want on new value")
* Note: Included "(9=.)" to show how to omit useless values from the original variable

Example with “generate” option (doesn’t change original variable; instead generates new variable with recoded values):

recode foreign (0=1 "Domestic Cars") (1=0 "Foreign Cars") (9=.), gen(Domestic)
* Notice: Syntax is (old value = new value "label you want on new value")
* Note: Included "(9=.)" to show how to omit useless values from the original variable

We can then check our coding by executing -codebook- for this variable. Notice, foreign cars now = 0, and domestic cars now =1. They also have the new value labels we created.

RESCALING VARIABLES

Once our our variables’ values are fine, we might want to rescale the variable in some way. Here are very common forms of rescaling and how to do them as efficiently as possible.

Rescaling a continuous variable to range from 0 to 1 using “returned” (“r( )”) results

This is particularly helpful for regression analysis. Specifically, if we do this for our independent variables, the interpretation of a “one-unit increase in X” will now be equivalent to “going from the lowest value to highest value observed in the sample”, which is often much easier to comprehend and can make graphs (like “coefplots”, which I cover in detail here) look much neater as all the independent variables will be on 0-to-1 scales.

The code below will create a new variable that is a rescaled (0 to 1) version of an existing variable:

sum price // obtains min and max values to be used in next line of code

gen price_01= (price-r(min)) / (r(max)-r(min)) // note: requires previous command
/*Notice:  Minimum value is being subtracted from price, resulting in lowest value 
being equal to 0. Then, this new scale is divided by the difference between the max
and min values, which will be equal to the highest value.  This produces a continuous
scale ranging from exactly 0 to exactly 1.  */

Reversing a variable’s values (e.g., instead of from ranging from low to high, it is reversed to go from high to low)

Example using returned results to generate a new variable that is a reversed version of another variable:

sum weight // obtains the min and max values needed for next line
gen lightness= -weight + r(min) + r(max) // make var negative then add original min and max
/*Very Important: Notice the minus sign in front of "weight" */

Example of generating a new variable that is a reversed version of another variable using user-written “revrs” command. Caution: the minimum value of the new variable will always be equal to 1.

ssc install revrs, replace // install revrs package

revrs foreign // Note: will keep value labels but makes lowest value = 1
tab rr_foreign // Notice:  has value labels

Example of generating a new variable that is a reversed version of another variable using user-written “omscore” command. Note: the new variable will not have any value labels.

net install dm7, from("http://www.stata.com/stb/stb7") // installs "omscore"

omscore foreign // Note: will begin at lowest value of foreign
tab revforeign // Notice:  does not have value labels

Mean-centering a variable using returned results

Here we are generating a new variable that is a mean-centered (i.e., the mean is subtracted of every value, making the new mean equal to 0) version of another variable.

sum headroom // obtains mean needed for next line

gen headroom_meancentered=headroom-(r(mean)) // subtracts mean from every value of headroom

Rescaling a variable in standard-deviation units

Example of generating a new variable that is a standard-deviation-unit version of another variable using returned results:

sum length // obtains mean and sd needed for next line

gen length_SDUs=(length-r(mean))/r(sd) // subtracts mean from each value, then divides by the SD

Example of generating a new variable that is a standard-deviation-unit version of another variable using -egen- command:

egen length_SDUs2=std(length)

Generating a natural log of a variable

Example of generating a new variable that is a (natural) logged version of another variable:

gen price_logged=ln(price)

Generating a variable raised to a particular power

Example of generating new variables that are squared and cubed versions of another variable, respectively:

gen mpg_squared=mpg^2      

gen mpg_cubed=mpg^3

BONUS: USEFUL LABEL-RELATED COMMANDS YOU MIGHT NOT KNOW

-codebook-, -labelbook-, and -varmanage- are really useful commands for accessing variables’ label information:

codebook, compact // see all variables in data set, their means/min/max & variable labels

labelbook Make_numeric // see ALL value labels contained within a particular set of value labels 
labelbook // see ALL value labels for EVERY set of labels in data set (plus other info)
*Note: I used to use "codebook" for this, but when there are many values, it doesn't show all the labels

varmanage // access all variables to manually change variable labels and/or value labels

The user-written -fre- package is excellent for seeing all value labels alongside their numeric values (in addition to their frequencies, percentages, etc.). Essentially it is as if you could combine the output of -tab varname- AND -tab varname, nolabel- at the same time. (Contrast this with -codebook-, which will only show a handful of value labels when there are many values.)

ssc install fre, replace  

fre Make_numeric

This gives the following output. Notice that we get to see the numeric values just to the left of the value labels, which can help us avoid silly mistakes when cleaning!

Hope this was helpful for you, and thanks for reading!