Clean your Data with the janitor package

The janitor package in an excellent tool for cleaning a messy database and summarize information. We’ve all heard that data analysts allocate about 70% of their time to data cleaning and wrangling. With janitor, this is straightforward. In this tutorial, we’ll present some (not all !!!) janitor cool functions. Of course, we use the mtcars R base dataset to provide for reproducible examples. Let’s have a look to the variables.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Now, just for the fun, we rename our columns in a messy way :

library(tidyverse)


mtcars_messy <- mtcars %>% rename(
  "MILES PER! Galon" = mpg, 
  "NUMBER OF cylinders" = cyl, 
  "WEIGHT !!!!???" = wt
)

names(mtcars_messy)
##  [1] "MILES PER! Galon"    "NUMBER OF cylinders" "disp"               
##  [4] "hp"                  "drat"                "WEIGHT !!!!???"     
##  [7] "qsec"                "vs"                  "am"                 
## [10] "gear"                "carb"

One must admit that we usually have to deal with this kind of variables’ names. Tired of renaming them one by one ? one answer : janitor::clean_names()

library(janitor)

mtcars_cleaned <- mtcars_messy %>% clean_names()

names(mtcars_cleaned)
##  [1] "miles_per_galon"     "number_of_cylinders" "disp"               
##  [4] "hp"                  "drat"                "weight"             
##  [7] "qsec"                "vs"                  "am"                 
## [10] "gear"                "carb"

Pretty cool isn’t ? There is another interesting tool in janitor. The get_dupes function allows us to detect duplicates within a dataframe. Just to check how it works, we create duplicated rows:

mtcars <- rownames_to_column(mtcars, var = "car_name") # Transforming our rows' names to a column entitled "car_name"

mtcars_dupli <- mtcars %>% rbind(mtcars[5:7, ])    # We're  duplicating the 5, 6 and 7th rows

nrow(mtcars_dupli) # We have indeed 35 rows (3 more than the classic mtcars)
## [1] 35

How can we find duplicates ? easy !

mtcars_dupli %>% get_dupes(car_name) 
## # A tibble: 6 x 13
##   car_name dupe_count   mpg   cyl  disp    hp  drat    wt  qsec    vs    am
##   <chr>         <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Duster ~          2  14.3     8   360   245  3.21  3.57  15.8     0     0
## 2 Duster ~          2  14.3     8   360   245  3.21  3.57  15.8     0     0
## 3 Hornet ~          2  18.7     8   360   175  3.15  3.44  17.0     0     0
## 4 Hornet ~          2  18.7     8   360   175  3.15  3.44  17.0     0     0
## 5 Valiant           2  18.1     6   225   105  2.76  3.46  20.2     1     0
## 6 Valiant           2  18.1     6   225   105  2.76  3.46  20.2     1     0
## # ... with 2 more variables: gear <dbl>, carb <dbl>

As you see, the function even indicates the duplication counts (in our case 2 as we’ve added one duplicate to each observation)

I adorn_() tabyl()

With janitor, one can also get valuable information in a tabular form. The tabyl() function works approximately like the table base R but offers a much better aesthetic. Let’s check the differences:

mtcars_dupli$car_name %>% table()
## .
##         AMC Javelin  Cadillac Fleetwood          Camaro Z28 
##                   1                   1                   1 
##   Chrysler Imperial          Datsun 710    Dodge Challenger 
##                   1                   1                   1 
##          Duster 360        Ferrari Dino            Fiat 128 
##                   2                   1                   1 
##           Fiat X1-9      Ford Pantera L         Honda Civic 
##                   1                   1                   1 
##      Hornet 4 Drive   Hornet Sportabout Lincoln Continental 
##                   1                   2                   1 
##        Lotus Europa       Maserati Bora           Mazda RX4 
##                   1                   1                   1 
##       Mazda RX4 Wag            Merc 230           Merc 240D 
##                   1                   1                   1 
##            Merc 280           Merc 280C          Merc 450SE 
##                   1                   1                   1 
##          Merc 450SL         Merc 450SLC    Pontiac Firebird 
##                   1                   1                   1 
##       Porsche 914-2      Toyota Corolla       Toyota Corona 
##                   1                   1                   1 
##             Valiant          Volvo 142E 
##                   2                   1

not so georgous, I agree. Now, let’s see what tabyl() function does :

mtcars_dupli$car_name %>% tabyl()
##                    . n    percent
##          AMC Javelin 1 0.02857143
##   Cadillac Fleetwood 1 0.02857143
##           Camaro Z28 1 0.02857143
##    Chrysler Imperial 1 0.02857143
##           Datsun 710 1 0.02857143
##     Dodge Challenger 1 0.02857143
##           Duster 360 2 0.05714286
##         Ferrari Dino 1 0.02857143
##             Fiat 128 1 0.02857143
##            Fiat X1-9 1 0.02857143
##       Ford Pantera L 1 0.02857143
##          Honda Civic 1 0.02857143
##       Hornet 4 Drive 1 0.02857143
##    Hornet Sportabout 2 0.05714286
##  Lincoln Continental 1 0.02857143
##         Lotus Europa 1 0.02857143
##        Maserati Bora 1 0.02857143
##            Mazda RX4 1 0.02857143
##        Mazda RX4 Wag 1 0.02857143
##             Merc 230 1 0.02857143
##            Merc 240D 1 0.02857143
##             Merc 280 1 0.02857143
##            Merc 280C 1 0.02857143
##           Merc 450SE 1 0.02857143
##           Merc 450SL 1 0.02857143
##          Merc 450SLC 1 0.02857143
##     Pontiac Firebird 1 0.02857143
##        Porsche 914-2 1 0.02857143
##       Toyota Corolla 1 0.02857143
##        Toyota Corona 1 0.02857143
##              Valiant 2 0.05714286
##           Volvo 142E 1 0.02857143

We’re not done yet ! This function can also be applied to generate contingency tables. Suppose, we want to get the number of cylinders according to the type of transmission (0 = automatic, 1 = manuel) :

mtcars_dupli %>% tabyl(am, cyl)
##  am 4 6  8
##   0 3 5 14
##   1 8 3  2

We get the information but it’s not very pretty right ? don’t panic, the janitor package provides the adorn_ functions which generate a variety of summary information :

mtcars_dupli %>% tabyl(am, cyl) %>% 
  adorn_totals(where = c("row", "col"))  # get the rows and columns sum
##     am  4 6  8 Total
##      0  3 5 14    22
##      1  8 3  2    13
##  Total 11 8 16    35
mtcars_dupli %>% tabyl(am, cyl) %>% 
  adorn_totals(where = c("row", "col")) %>% 
  adorn_percentages("all") %>%            # getting the proportions
  adorn_pct_formatting() %>%              # formatting values in %
  adorn_ns(position = "rear") %>%         # diplaying counts "behind" the previously calculated percentages
  adorn_title(placement = "combined")     # display a "combined" title (am/cyl) (checkout "top" for another view)
##  am/cyl          4         6          8       Total
##       0  8.6%  (3) 14.3% (5) 40.0% (14)  62.9% (22)
##       1 22.9%  (8)  8.6% (3)  5.7%  (2)  37.1% (13)
##   Total 31.4% (11) 22.9% (8) 45.7% (16) 100.0% (35)

Finally, it’s possible to convert our results into great looking HTML tables using the kable function from the knitr package :

library(knitr)
mtcars_dupli %>% tabyl(am, cyl) %>% 
  adorn_totals(where = c("row", "col")) %>% 
  adorn_percentages("all") %>%            
  adorn_pct_formatting() %>%              
  adorn_ns(position = "rear") %>%         
  adorn_title(placement = "combined") %>% 
  kable() 
am/cyl 4 6 8 Total
0 8.6% (3) 14.3% (5) 40.0% (14) 62.9% (22)
1 22.9% (8) 8.6% (3) 5.7% (2) 37.1% (13)
Total 31.4% (11) 22.9% (8) 45.7% (16) 100.0% (35)

That’s it for this tutorial nevertheless I highly recommend taking a look to the official package’s vignette.

Avatar
Mohamed El Fodil Ihaddaden
Ph.D candidate in Economics.

My research interests include Performance Management, Data Envelopment Analysis and Artificial Intelligence applied to Economics.