Feature or Variable Selection in Data Envelopment Analysis with Linear Programming
Feature or Variable Selection in Data Envelopment Analysis with Linear Programming
Fernando Fernandez-Palacin1
Manuel Munoz-Marquez2
2024-11-12
Introduction
Variable selection in Data Envelopment Analysis (DEA) is a crucial consideration that requires careful attention before the results of an analysis can be applied in a real-world context. This is because the outcomes can vary significantly depending on the variables included in the model. Therefore, variable selection is a fundamental step in every DEA application.
One of the methods proposed for this is what is known as the feature selection method. This method constructs a linear programming problem to maximize some objective function related to the dmu efficiencies.
fsdea
function implements the feature selection method in the article (Benítez-Peña, Bogetoft, and Romero Morales 2020)
Let’s load and inspect the “tokyo_libraries” dataset using the following code:
data(tokyo_libraries)
head(tokyo_libraries)
#> Area.I1 Books.I2 Staff.I3 Populations.I4 Regist.O1 Borrow.O2
#> 1 2.249 163.523 26 49.196 5.561 105.321
#> 2 4.617 338.671 30 78.599 18.106 314.682
#> 3 3.873 281.655 51 176.381 16.498 542.349
#> 4 5.541 400.993 78 189.397 30.810 847.872
#> 5 11.381 363.116 69 192.235 57.279 758.704
#> 6 10.086 541.658 114 194.091 66.137 1438.746
Lets’t take inputs as: Area.I1
, Books.I2
, Staff.I3
and Populations.I4
variables and outputs as Regist.O1
and Borrow.O2
First, let’s do a standard DEA analysis with:”
dea <- dea(input, output)
dea
#> [1] 0.3500108 0.7918292 0.5733000 0.7186833 1.0000000 1.0000000 0.6967419
#> [8] 0.5803315 1.0000000 0.7051438 0.5689146 0.7583527 0.7474946 0.7215430
#> [15] 0.8440736 0.5822710 1.0000000 0.7867065 1.0000000 0.8485716 0.7872304
#> [22] 0.7849437 1.0000000
Summarizing DEA, we can calculate the average efficiency:
summary(dea)
#>
#> Model name
#> Orientation input
#> Inputs Area.I1 Books.I2 Staff.I3 Populations.I4
#> Outputs Regist.O1 Borrow.O2
#> nInputs 4
#> nOutputs 2
#> nVariables 6
#> nEfficients 6
#> Eff. Mean 0.775919227646031
#> Eff. sd 0.174702408743164
#> Eff. Min. 0.350010840234134
#> Eff. 1st Qu. 0.700942885344481
#> Eff. Median 0.784943740381793
#> Eff. 3rd Qu. 0.924285790399849
#> Eff. Max. 1
Reducing the number of variables
Suposse that we want a model with only 5 variables, then the following call does the job:
dea5v <- fsdea(input, output, nvariables = 5)
dea5v
#> [1] 0.3428216 0.7918292 0.5733000 0.7164871 1.0000000 1.0000000 0.6967419
#> [8] 0.5803315 1.0000000 0.7051438 0.5689146 0.7583527 0.7219103 0.7215430
#> [15] 0.8440736 0.5822710 1.0000000 0.7867065 1.0000000 0.8485716 0.7872304
#> [22] 0.7849437 1.0000000
#> Selected inputs : Books.I2, Staff.I3, Populations.I4
#> Selected outputs: Regist.O1, Borrow.O2
We can calculate the average efficiency for the new model by summarizing it:
summary(dea5v)
#>
#> Model name
#> Orientation input
#> nInputs 4
#> nOutputs 2
#> nVariables 6
#> nEfficients 6
#> Eff. Mean 0.77439880335151
#> Eff. sd 0.175803112249109
#> Eff. Min. 0.34282160923055
#> Eff. 1st Qu. 0.700942885344481
#> Eff. Median 0.784943740381793
#> Eff. 3rd Qu. 0.924285790399849
#> Eff. Max. 1
#> iSelected Books.I2, Staff.I3, Populations.I4
#> oSelected Regist.O1, Borrow.O2
Observe that average efficiency has decreased from 0.7849437 to 0.7849437. This could indicate that the new model is not an improvement over the previous one.
To delve deeper into the results, we can plot the efficiencies:
The graph reveals that most efficiencies are very similar, a fact that is confirmed by the correlation coefficient 0.9995329 which is very close to 1.
Reducing the number of outputs
In the previous case, reducing the number of variables led to a decrease in one input. We achieved the same result with the call fsdea(input, output, ninputs = 3)
. However, perhaps our goal is to reduce an output. Let’s achieve this with the following call:
dea1o <- fsdea(input, output, noutputs = 1)
dea1o
#> [1] 0.3026132 0.6425505 0.5733000 0.7186833 0.6733832 1.0000000 0.6967419
#> [8] 0.4476942 1.0000000 0.7051438 0.5336592 0.7583527 0.5915395 0.7215430
#> [15] 0.7832606 0.5822710 0.8451129 0.7867065 1.0000000 0.8485716 0.7387768
#> [22] 0.7849437 1.0000000
#> Selected inputs : Area.I1, Books.I2, Staff.I3, Populations.I4
#> Selected outputs: Borrow.O2
Observe that, again, average efficiency has decreased from 0.7849437 to 0.7276021.
To delve deeper into the results, we can plot the new efficiencies:
In this case, the differences are more significant compared to the previous two models. This is confirmed by a lower correlation coefficient 0.8910575.
All these could indicate that to enhance the model, it may be necessary to remove multiple variables simultaneously.
References
-
Universidad de Cádiz, fernando.fernandez@uca.es↩︎
-
Universidad de Cádiz, manuel.munoz@uca.es↩︎