Feature or Variable Selection in Data Envelopment Analysis with Linear Programming

Feature or Variable Selection in Data Envelopment Analysis with Linear Programming

Feature or Variable Selection in Data Envelopment Analysis with Linear Programming

Fernando Fernandez-Palacin1

Manuel Munoz-Marquez2

2024-11-12

Introduction

Variable selection in Data Envelopment Analysis (DEA) is a crucial consideration that requires careful attention before the results of an analysis can be applied in a real-world context. This is because the outcomes can vary significantly depending on the variables included in the model. Therefore, variable selection is a fundamental step in every DEA application.

One of the methods proposed for this is what is known as the feature selection method. This method constructs a linear programming problem to maximize some objective function related to the dmu efficiencies.

fsdea function implements the feature selection method in the article (Benítez-Peña, Bogetoft, and Romero Morales 2020)

Let’s load and inspect the “tokyo_libraries” dataset using the following code:

data(tokyo_libraries)
head(tokyo_libraries)
#>   Area.I1 Books.I2 Staff.I3 Populations.I4 Regist.O1 Borrow.O2
#> 1   2.249  163.523       26         49.196     5.561   105.321
#> 2   4.617  338.671       30         78.599    18.106   314.682
#> 3   3.873  281.655       51        176.381    16.498   542.349
#> 4   5.541  400.993       78        189.397    30.810   847.872
#> 5  11.381  363.116       69        192.235    57.279   758.704
#> 6  10.086  541.658      114        194.091    66.137  1438.746

Lets’t take inputs as: Area.I1, Books.I2, Staff.I3 and Populations.I4variables and outputs as Regist.O1 and Borrow.O2

input <- tokyo_libraries[, 1:4]
output <- tokyo_libraries[, 5:6]

First, let’s do a standard DEA analysis with:”

dea <- dea(input, output)
dea
#>  [1] 0.3500108 0.7918292 0.5733000 0.7186833 1.0000000 1.0000000 0.6967419
#>  [8] 0.5803315 1.0000000 0.7051438 0.5689146 0.7583527 0.7474946 0.7215430
#> [15] 0.8440736 0.5822710 1.0000000 0.7867065 1.0000000 0.8485716 0.7872304
#> [22] 0.7849437 1.0000000

Summarizing DEA, we can calculate the average efficiency:

summary(dea)
#>                                                      
#> Model name                                           
#> Orientation                                     input
#> Inputs       Area.I1 Books.I2 Staff.I3 Populations.I4
#> Outputs                           Regist.O1 Borrow.O2
#> nInputs                                             4
#> nOutputs                                            2
#> nVariables                                          6
#> nEfficients                                         6
#> Eff. Mean                           0.775919227646031
#> Eff. sd                             0.174702408743164
#> Eff. Min.                           0.350010840234134
#> Eff. 1st Qu.                        0.700942885344481
#> Eff. Median                         0.784943740381793
#> Eff. 3rd Qu.                        0.924285790399849
#> Eff. Max.                                           1

Reducing the number of variables

Suposse that we want a model with only 5 variables, then the following call does the job:

dea5v <- fsdea(input, output, nvariables = 5)
dea5v
#>  [1] 0.3428216 0.7918292 0.5733000 0.7164871 1.0000000 1.0000000 0.6967419
#>  [8] 0.5803315 1.0000000 0.7051438 0.5689146 0.7583527 0.7219103 0.7215430
#> [15] 0.8440736 0.5822710 1.0000000 0.7867065 1.0000000 0.8485716 0.7872304
#> [22] 0.7849437 1.0000000
#> Selected inputs : Books.I2, Staff.I3, Populations.I4
#> Selected outputs: Regist.O1, Borrow.O2

We can calculate the average efficiency for the new model by summarizing it:

summary(dea5v)
#>                                                
#> Model name                                     
#> Orientation                               input
#> nInputs                                       4
#> nOutputs                                      2
#> nVariables                                    6
#> nEfficients                                   6
#> Eff. Mean                      0.77439880335151
#> Eff. sd                       0.175803112249109
#> Eff. Min.                      0.34282160923055
#> Eff. 1st Qu.                  0.700942885344481
#> Eff. Median                   0.784943740381793
#> Eff. 3rd Qu.                  0.924285790399849
#> Eff. Max.                                     1
#> iSelected    Books.I2, Staff.I3, Populations.I4
#> oSelected                  Regist.O1, Borrow.O2

Observe that average efficiency has decreased from 0.7849437 to 0.7849437. This could indicate that the new model is not an improvement over the previous one.

To delve deeper into the results, we can plot the efficiencies:

The graph reveals that most efficiencies are very similar, a fact that is confirmed by the correlation coefficient 0.9995329 which is very close to 1.

Reducing the number of outputs

In the previous case, reducing the number of variables led to a decrease in one input. We achieved the same result with the call fsdea(input, output, ninputs = 3). However, perhaps our goal is to reduce an output. Let’s achieve this with the following call:

dea1o <- fsdea(input, output, noutputs = 1)
dea1o
#>  [1] 0.3026132 0.6425505 0.5733000 0.7186833 0.6733832 1.0000000 0.6967419
#>  [8] 0.4476942 1.0000000 0.7051438 0.5336592 0.7583527 0.5915395 0.7215430
#> [15] 0.7832606 0.5822710 0.8451129 0.7867065 1.0000000 0.8485716 0.7387768
#> [22] 0.7849437 1.0000000
#> Selected inputs : Area.I1, Books.I2, Staff.I3, Populations.I4
#> Selected outputs: Borrow.O2

Observe that, again, average efficiency has decreased from 0.7849437 to 0.7276021.

To delve deeper into the results, we can plot the new efficiencies:

In this case, the differences are more significant compared to the previous two models. This is confirmed by a lower correlation coefficient 0.8910575.

All these could indicate that to enhance the model, it may be necessary to remove multiple variables simultaneously.

References

Benítez-Peña, Sandra, Peter Bogetoft, and Dolores Romero Morales. 2020. “Feature Selection in Data Envelopment Analysis: A Mathematical Optimization Approach.” Omega 96: 102068. http://dx.doi.org/10.1016/j.omega.2019.05.004.

  1. Universidad de Cádiz, ↩︎

  2. Universidad de Cádiz, ↩︎

Última modificación: miércoles, 13 de noviembre de 2024, 09:52