Data Envelopment Analysis (DEA)Visualization using Stata

Fahad Mirza
The Stata Gallery
Published in
10 min readApr 3, 2024

Note: This walkthrough will not go through basics of Stata and will assume that the reader is somewhat familiar in the use of matrices, loops, and macros.

Data Envelopment Analysis is a nonparametric method that quantifies the relative efficiency of decision making units (DMUs) or multiple entities , which form a “best-practice frontier”. Each DMU is responsible for the conversion of an input to an output.

Unlike parametric methods that require a predefined production or cost function, DEA is a non-parametric approach that relies on observed data to compare input and output combinations. Its advantages include minimal assumptions, the ability to handle multiple inputs and outputs, and computational simplicity due to its formulation as a linear program.

DEA Plot using Stata — Colored by groups

The relative efficiency score of the conversion is then calculated based on observed data and basic assumptions for the resolution of an optimization model.

Each DMU is assigned an efficiency score. Score of less than 1 is considered to be relatively inefficient while all DMU/s with the value of 1 form the efficiency frontier.

There are multiple models when it comes to analysis using DEA:

  1. Input-Oriented
    - Single Input / Single Output
    - Single Input / Multiple Output
    - Multiple Input / Single Output
    - Multiple Input / Multiple Output
  2. Output-Oriented
    - Single Input / Single Output
    - Single Input / Multiple Output
    - Multiple Input / Single Output
    - Multiple Input / Multiple Output
  3. Non-Oriented

This Stata guide uses an example of the Input-Oriented Method (Single Input / Single Output Case) for data visualization.

We start by ensuring that DEA is installed in Stata. This command is written by Yong-Bae Ji and Choonjoo Lee from “Korea National Defense University” (Republic of Korea)

* Run all 3 commands in one go
net sj 10-2 st0193
net install st0193
net get st0193

* To know more about DEA, type the following after installation completes:
help dea

* We start by loading out dummy dataset for Single Input / Output Plot.
* The decision making units here are the countries.
* These DMU can be anything other than countries depending on your requirement.

* Copy and paste the following code in your Do-File then execute to load data.
clear
input str7 groups str5 dmu double(input output)
"Group A" "AFG" 87.26849 4.22
"Group B" "ALB" 118.11832 7.58
"Group C" "DZA" 76.00941 48.96
"Group B" "ASM" 66.616165 22.03
"Group B" "AND" 64.18191 26.6
"Group A" "AGO" 60.44937 49.65
"Group D" "AIA" 156.67883 72.22
"Group A" "ATA" 85.44653 28.57
"Group B" "ATG" 98.68203 15.9
"Group A" "ARG" 54.8189 33.33
"Group D" "ARM" 74.772865 86.96
"Group B" "ABW" 96.94016 9.84
"Group A" "AUS" 60.90819 14.87
"Group B" "AUT" 46.20413 16.67
"Group A" "AZE" 87.98172 6.41
"Group A" "BHS" 64.04943 29.66
"Group C" "BHR" 105.42506 47.55
"Group D" "BGD" 186.1537 68.18
"Group B" "BRB" 56.88528 23.23
"Group B" "BLR" 43.24974 7.06
"Group C" "BEL" 72.01537 48.59
"Group B" "BLZ" 49.50985 2.96
"Group B" "BEN" 40.3767 2.4
"Group D" "BMU" 45.39432 24.32
"Group B" "BTN" 88.99753 1.96
"Group A" "BOL" 47.25929 27.47
"Group A" "BES" 74.20091 1.96
"Group D" "BIH" 115.73698 35.42
"Group C" "BWA" 188.0303 30.48
"Group C" "BVT" 157.8457 52.81
"Group D" "BRA" 89.52934 47.34
"Group C" "IOT" 122.30964 17.14
"Group D" "BRN" 113.85688 34.83
"Group B" "BGR" 217.32883 10.25
"Group B" "BFA" 35.89531 0
"Group D" "BDI" 132.8121 56.76
"Group D" "CPV" 162.208 48.09
"Group C" "KHM" 87.63231 53.75
"Group C" "CMR" 220.98065 16.83
"Group A" "CAN" 57.21156 15.69
"Group C" "CYM" 88.98106 45.5
"Group B" "CAF" 39.21181 9.09
"Group D" "TCD" 84.36317 17.24
"Group C" "CHL" 85.77451 34.95
"Group B" "CHN" 80.08671 10.25
"Group D" "CXR" 111.36538 41.51
"Group B" "CCK" 89.9909 5.41
"Group D" "COL" 89.88329 70.04
"Group D" "COM" 99.03506 82.99
"Group D" "COD" 119.52612 59.38
"Group A" "COG" 45.7907 11.59
"Group A" "COK" 88.1432 52.09
"Group D" "CRI" 178.4655 60.53
"Group B" "HRV" 280.29263 5.45
"Group A" "CUB" 50.6408 17.65
"Group C" "CUW" 98.11392 37.84
"Group D" "CYP" 80.50137 45.89
"Group B" "CZE" 94.16136 13.66
"Group B" "CIV" 94.63183 22.08
"Group D" "DNK" 263.59613 59.76
"Group A" "DJI" 47.06966 15.29
"Group A" "DMA" 45.76461 8.77
"Group B" "DOM" 56.64094 4.17
"Group A" "ECU" 61.08718 6.54
"Group A" "EGY" 42.89222 8.73
"Group D" "SLV" 101.47802 55.06
"Group C" "GNQ" 110.22458 42.16
"Group B" "ERI" 76.02513 10
"Group C" "EST" 109.55764 39.87
"Group B" "SWZ" 45.35803 17.96
"Group D" "ETH" 139.70619 100
"Group B" "FLK" 50.17251 16.86
"Group B" "FRO" 95.53053 9.4
"Group B" "FJI" 55.02819 2.49
"Group D" "FIN" 72.223854 59.6
"Group C" "FRA" 89.78141 42.16
"Group A" "GUF" 46.61728 10.05
"Group D" "PYF" 84.1601 59.52
"Group B" "ATF" 31.897926 7.66
"Group B" "GAB" 67.84356 14.74
"Group D" "GMB" 132.42421 32.56
"Group A" "GEO" 47.21345 16.46
"Group D" "DEU" 137.07115 50.29
"Group D" "GHA" 111.32855 53.49
"Group B" "GIB" 52.27685 29.67
"Group B" "GRC" 84.34722 26.79
"Group B" "GRL" 66.34722 9.45
"Group B" "GRD" 60.80746 15.53
"Group C" "GLP" 100.96398 46.2
"Group B" "GUM" 65.245605 9.16
"Group C" "GTM" 85.55104 38.14
"Group A" "GGY" 47.93858 23.76
"Group B" "GIN" 52.44144 4.21
"Group D" "GNB" 120.0591 67.57
"Group C" "GUY" 115.57858 48.8
"Group B" "HTI" 49.65616 4.49
"Group D" "HMD" 116.94597 56.9
"Group D" "VAT" 89.44913 35.56
"Group C" "HND" 57.05596 73.58
"Group C" "HKG" 63.07741 40.53
"Group A" "HUN" 46.61403 14.85
"Group A" "ISL" 48.89816 20.46
"Group C" "IND" 70.88245 43
"Group C" "IDN" 94.14224 58.62
"Group A" "IRN" 50.46558 39.58
"Group C" "IRQ" 85.0009 58.29
"Group C" "IRL" 220.47366 56.25
"Group B" "IMN" 58.88473 14.12
"Group A" "ISR" 47.89618 41.03
"Group C" "ITA" 85.50446 42.37
"Group A" "JAM" 40.43304 34.32
"Group B" "JPN" 85.6799 16.73
"Group D" "JEY" 93.74564 71.15
"Group D" "JOR" 93.07504 59.05
"Group D" "KAZ" 132.68983 58.11
end

Executing the code above leads to the following view in our data browser window:

This confirms that our data is loaded correctly and is ready for use.

When setting up data for analysis, it is important to note that the “DMU” variable should be explicitly named in the data otherwise the code will not execute. The names for input and output variables on the other hand are your choice.

The DEA command has a number of options and can be viewed in detail using the following line of code: help dea

We have a number of options that can be specified and include:

  1. Returns to Scale (rts): Constant returns to scale (CRS), Variable Returns to Scale (VRS), Decreasing Returns to Scale (drs), or Non-Increasing Returns to Scale(nirs) can be specified.
  2. Orientation: Can specify either input or output oriented DEA based on the model that you wish to use.
  3. Stage: By default, two-stage DEA is set and this can be changed to single-stage DEA
  4. Trace: When executing the dea command, a log file is automatically generated and with the option trace, you can save entire sequence in it for an in-depth overview.
  5. Saving: This option means you can then save the results with a name that you can specify.

For this blog, I will be using the following example to generate plot:

  1. Variable Returns to Scale
  2. Input Oriented
  3. Single-Stage DEA

The code for this example is as follows:

dea input = output , rts(vrs) ort(in) stage(1) saving(dummy_data)

As this is a single-input and single-output model, we will specify the variable name followed by equal sign and then the output variable. Options are then specified after the comma.

Once executed, a matrix of results is formed by the name of r(dearslt) and can be viewed using the code return list right after the code is executed.

In the matrix we have 115 rows and 119 columns that contains all the results from the DEA.

The next step is to save this matrix of results into a self specified matrix using the code:

matrix results = r(dearslt)

which saves the output matrix from the code into the matrix by the name results .

From this matrix, we will extract the values of Theta that essentially rank DMU by efficiency. For that to happen, we need to extract a subset of the matrix results and to do that we have the code:

matrix theta = results[1..., 2]

* The Syntax matrixname[1..., 2] is interpreted as all rows from columns
* 1 and 2. The first two dots are interpreted as "till" and the last dot
* is interpreted as "last". This is documented in -help matrix extraction-.

A preview of the matrix allows us to see the name of the DMU along with the efficiency score. All values of 1 mean that the DMU is efficient and lies on the frontier: matrix list theta

In order to make use of this information, we then save this theta matrix as a variable using the command svmat theta which creates a variable theta1 within our dataset.

With our main variable now in place within the dataset, we now create a rank based on the theta1 variable using the following command:

svmat theta 
egen rank = rank(theta1) , field

As there can be multiple DMUs that take the most efficient value of 1, the field option helps us to keep these DMUs unique. The field option calculates the field rank of theta1 where the highest value is ranked 1, and there is no correction for ties.

This now brings us closer to plotting the graph however, to make the plot readable and easy to understand, we need to generate labels for the most efficient DMUs. To do so we simply use the following code:

 generate label_most_eff = dmu if theta1 == 1

The code below is a block that needs to be executed in one go and should be selected entirely if we want to look at the plot.

 quietly summarize input

local minx = `r(min)'
local maxx = `r(max)'


quietly summarize output

local miny = `r(min)'
local maxy = `r(max)'

preserve

keep if theta1 == 1

gsort -output

forvalues i = 1/`=_N' {

local area "`area' `=output[`i']' `=input[`i']' "

if `i' == 1 {

local line "`=output[`i']' `=input[`i']' `maxy' `maxx'"

}

}

restore


twoway ///
(scatteri 0 `minx' 0 `maxx' `maxy' `maxx' `area', recast(area) lwidth(0) color("255 127 14%15")) ///
(scatteri 0 `minx' 8.344371 `minx', recast(line) lcolor("255 127 14")) ///
(scatteri `line', recast(line) lcolor("255 127 14")) ///
(scatter output input if theta1 == 1, c(l) sort lcolor("255 127 14") ms(i) mlabel(label_most_eff) mlabpos(10) mlabsize(1.75)) ///
(scatter output input, msize(0.75) mcolor("31 119 180%60")) ///
, ///
scheme(white_tableau) ///
legend(off) ///
ylabel(0(10)100, labsize(2)) ///
yscale(range(0 100)) ///
xlabel(0(50)250 285, labsize(2)) ///
xscale(range(-30 100)) ///
xtitle("Input", size(2)) ///
ytitle("Output", size(2) orient(horizontal)) ///
title("{bf}DEA Visualization Example using Stata", size(2.5) pos(11)) ///
subtitle("Single-Input Efficiency Curve (DEA Input Oriented)", size(2) pos(11) margin(b=5)) ///
note("Data: Example Data" " ", size(1.5))

The code executes and visualizes the following plot:

DEA Plot: Single Input & Single Output

To summarize the code above, we start off by summarizing both the output and input variables and storing their minimum and maximum values in a local. Once done, we then preserve the data and keep only the most efficient values because they will form the frontier. The output values are sorted in descending order. Then, for the all the sorted efficient observations, we create a local that contains Y and X values to create an area plot by creating a boundary line. Within the loop, we also want to keep the first Y and X value to use in a line plot.

Once done, we then plot the graph using twoway by creating an area plot first, followed by a line plot that connects the lowest point on the frontier with x-axis. This process is then repeated for the highest point on the frontier where we connect that point and extend it to the most extreme observation horizontally.

This is then followed by plotting markers on the efficiency frontier and connecting them with a line. Once plotted we plot the other marker points which are not on the frontier to complete the graph.

This plot can further be created by plotting markers by group.

quietly summarize input

local minx = `r(min)'
local maxx = `r(max)'


quietly summarize output

local miny = `r(min)'
local maxy = `r(max)'

preserve

keep if theta1 == 1

gsort -output

forvalues i = 1/`=_N' {

local area "`area' `=output[`i']' `=input[`i']' "

if `i' == 1 {

local line "`=output[`i']' `=input[`i']' `maxy' `maxx'"

}

}

restore


* Making a version that colors scatters for each province.
twoway ///
(scatteri 0 `minx' 0 `maxx' `maxy' `maxx' `area', recast(area) lwidth(0) color("255 127 14%15")) ///
(scatteri 0 `minx' 8.344371 `minx', recast(line) lcolor("255 127 14")) ///
(scatteri `line', recast(line) lcolor("255 127 14")) ///
(scatter output input if theta1 == 1, c(l) sort lcolor("255 127 14") ms(i) mlabel(label_most_eff) mlabpos(10) mlabsize(1.75)) ///
(scatter output input if groups == "Group A", msize(1) mfcolor("255 255 255%70") mlcolor(black) mlwidth(*0.5)) ///
(scatter output input if groups == "Group B", msize(1) mfcolor("33 150 243%70") mlcolor(black) mlwidth(*0.5)) ///
(scatter output input if groups == "Group C", msize(1) mfcolor("96 125 139%70") mlcolor(black) mlwidth(*0.5)) ///
(scatter output input if groups == "Group D", msize(1) mfcolor("233 30 99%70") mlcolor(black) mlwidth(*0.5)) ///
, ///
scheme(white_tableau) ///
ylabel(0(10)100, labsize(2)) ///
yscale(range(0 100)) ///
xlabel(0(50)250 285, labsize(2)) ///
xscale(range(-30 100)) ///
xtitle("Input", size(2)) ///
ytitle("Output", size(2) orient(horizontal)) ///
title("{bf}DEA Visualization Example using Stata", size(2.5) pos(11)) ///
subtitle("Single-Input Efficiency Curve (DEA Input Oriented)", size(2) pos(11) margin(b=5)) ///
note("Data: Example Data" " ", size(1.5)) ///
legend(order(5 "Group A" 6 "Group B" 7 "Group C" 8 "Group D") size(2) row(1) pos(11) bmargin(t = -4 b = 3 l = -2))
DEA Plot: Single Input & Single Output — Colored by Groups

Conclusion

This concludes our guide on creating a DEA Plot using Stata with relative ease! I hope you enjoyed this guide. Please do share your versions.

Also, please do point out if there are any bugs with the code.

Happy Visualizing!

About the author

I am an Economist/Data Analyst by profession and a Stata addict quite frankly. Currently based in Islamabad, Pakistan and working as a Consultant at The World Bank.

Graduate from Lahore University of Management Sciences (LUMS) & National University of Computer & Emerging Sciences.

You can connect with me via GitHub, Medium, Twitter, LinkedIn or simply via email: 17160013@lums.edu.pk

--

--

Fahad Mirza
The Stata Gallery

Economist, Data Analyst, Data Visualizer & Stata Enthusiast. F1 Fanatic! My Twitter: https://twitter.com/theFstat