Introduction to the Subset function

Although I’m a big fan of the tidyverse philosophy of handling and wrangling data, one must admit that there are quit powerful functions in base R. One of theses functions is subset() which returns a dataframe according to some defined subsetting properties. Let’s dive into one example using the simple mtcars data:

head(mtcars)  # A quick look at the mtcars data
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Suppose, we want to extract all the vehicules that have an mpg greater than 20:

subset(mtcars, mpg > 20)  # the first argument is the dataframe
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

You can observe that the subset function is flexible. We don’t have to specify the column name with the dollar sign (mtcars$mpg).

Let’s take another more complex example. We will extract all vehicules that have an mpg superior to 30 and a cyl equal to 4:

subset(mtcars, mpg > 30 & cyl == 4) # & <=> AND
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
## Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

As you can see, the subset function works smoothly with the R logical expressions.

Finally, there is another important argument that we must see. Suppose, we want to extract a specified column, let’s say that we’re solely interested in the wt variable. We can extract this column depending using the select argument:

subset(mtcars, mpg > 30 & cyl == 4, select = wt)
##                   wt
## Fiat 128       2.200
## Honda Civic    1.615
## Toyota Corolla 1.835
## Lotus Europa   1.513

In the same way, we can extract several columns :

subset(mtcars, mpg > 30 & cyl == 4, select = c(wt, disp, am))
##                   wt disp am
## Fiat 128       2.200 78.7  1
## Honda Civic    1.615 75.7  1
## Toyota Corolla 1.835 71.1  1
## Lotus Europa   1.513 95.1  1
Avatar
Mohamed El Fodil Ihaddaden
Ph.D candidate in Economics.

My research interests include Performance Management, Efficiency Analysis and Experimental Economics.

Related