Variable Selection in Data Envelopment Analysis with ADEA Method
Fernando Fernandez-Palacin1
Manuel Munoz-Marquez2
2024-11-12
Introduction
Variable selection in Data Envelopment Analysis (DEA) is a crucial consideration that requires careful attention before the results of an analysis can be applied in a real-world context. This is because the outcomes can vary significantly depending on the variables included in the model. Therefore, variable selection is a fundamental step in every DEA application.
ADEA provides a measure known as “load” to assess the contribution of a variable to a DEA model. In an ideal scenario where all variables contribute equally, all loads would be equal to 1. For instance, if the load of an output variable is 0.75, it signifies that its contribution is 75% of the average value for all outputs. A load value below 0.6 indicates that the variable’s contribution to the DEA model is negligible.
For more information about loads see the help of the package or see (Fernandez-Palacin, Lopez-Sanchez, and Munoz-Marquez 2018) and (Villanueva-Cantillo and Munoz-Marquez 2021).
Let’s load and inspect the “tokyo_libraries” dataset using the following code:
data(tokyo_libraries)
head(tokyo_libraries)
#> Area.I1 Books.I2 Staff.I3 Populations.I4 Regist.O1 Borrow.O2
#> 1 2.249 163.523 26 49.196 5.561 105.321
#> 2 4.617 338.671 30 78.599 18.106 314.682
#> 3 3.873 281.655 51 176.381 16.498 542.349
#> 4 5.541 400.993 78 189.397 30.810 847.872
#> 5 11.381 363.116 69 192.235 57.279 758.704
#> 6 10.086 541.658 114 194.091 66.137 1438.746
Step wise variable selection
Two stepwise variable selection functions are provided. The first one eliminates variables one by one, creating a set of nested models. The following code sets up input and output variables and performs the call:
input <- tokyo_libraries[, 1:4]
output <- tokyo_libraries[, 5:6]
adea_hierarchical(input, output)
#> Loads nEfficients nVariables nInputs nOutputs
#> 6 0.4554670 6 6 4 2
#> 5 0.9901640 6 5 3 2
#> 4 0.8533008 3 4 2 2
#> 3 0.6574467 2 3 1 2
#> 2 1.0000000 1 2 1 1
#> Inputs Outputs
#> 6 Area.I1, Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 5 Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 4 Books.I2, Staff.I3 Regist.O1, Borrow.O2
#> 3 Books.I2 Regist.O1, Borrow.O2
#> 2 Books.I2 Borrow.O2
The load of the first model falls under the minimum significance level, indicating that Area.I1
can be removed from the model.
When a variable is removed, one would expect the load of all remaining variables to increase. However, this doesn’t occur after the second model. Therefore, the third model is inferior to the second, and there is no statistical rationale for selecting it.
To avoid that, a second step wise selection variable is provided, the new call is as follows:
adea_parametric(input, output)
#> Loads nEfficients nVariables nInputs nOutputs
#> 6 0.455467 6 6 4 2
#> 5 0.990164 6 5 3 2
#> 2 1.000000 1 2 1 1
#> Inputs Outputs
#> 6 Area.I1, Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 5 Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 2 Books.I2 Borrow.O2
In both cases, all variables are considered for removal, but the load.orientation
parameter allows for selecting which variables to include in the load analysis. You can choose input
for only input variables, output
for only output variables, or inoutput
, which is the default value for all variables. The following call only considers output variables as candidate variables for removal:
adea_parametric(input, output, load.orientation = 'output')
#> Loads nEfficients nVariables nInputs nOutputs
#> 6 1 6 6 4 2
#> Inputs Outputs
#> 6 Area.I1, Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
Both adea_hierarchical
and adea_parametric
return a list called models
, which contains all computed models and can be accessed using the following call:
m <- adea_hierarchical(input, output)
m4 <- m$models[[4]]
m4
#> [1] 0.2260062 0.6377375 0.5400548 0.5930209 0.9112849 0.7449643 0.6496709
#> [8] 0.5391304 0.8966427 0.7051438 0.5387076 0.7191553 0.6381740 0.7152620
#> [15] 0.8440736 0.5822710 1.0000000 0.7867065 1.0000000 0.8485716 0.7872304
#> [22] 0.6806063 1.0000000
where the number in square brackets represents the number of total variables in the model.
By default, when the print
function is used with an ADEA model, it displays only efficiencies. The summary
function provides a more comprehensive output:
summary(m4)
#>
#> Model name
#> Orientation input
#> Load orientation inoutput
#> Model load 0.853300754553448
#> Input load.Books.I2 1.14669924544655
#> Input load.Staff.I3 0.853300754553448
#> Output load.Regist.O1 0.853300754553448
#> Output load.Borrow.O2 1.14669924544655
#> Inputs Books.I2 Staff.I3
#> Outputs Regist.O1 Borrow.O2
#> nInputs 2
#> nOutputs 2
#> nVariables 4
#> nEfficients 3
#> Eff. Mean 0.721061510571196
#> Eff. sd 0.18362174896548
#> Eff. Min. 0.226006153331096
#> Eff. 1st Qu. 0.615379209850103
#> Eff. Median 0.715262000660375
#> Eff. 3rd Qu. 0.846322575997714
#> Eff. Max. 1
References
-
Universidad de Cádiz, fernando.fernandez@uca.es↩︎
-
Universidad de Cádiz, manuel.munoz@uca.es↩︎