Clean your Data with the janitor package

Last updated on Sep 15, 2019 6 min read

The janitor package in an excellent tool for cleaning a messy database and summarize information. We’ve all heard that data analysts allocate about 70% of their time to data cleaning and wrangling. With janitor, this is straightforward. In this tutorial, we’ll present some (not all !!!) janitor cool functions. Of course, we use the mtcars R base dataset to provide for reproducible examples. Let’s have a look to the variables.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Now, just for the fun, we rename our columns in a messy way :

library(tidyverse)


mtcars_messy <- mtcars %>% rename(
  "MILES PER! Galon" = mpg, 
  "NUMBER OF cylinders" = cyl, 
  "WEIGHT !!!!???" = wt
)

names(mtcars_messy)

##  [1] "MILES PER! Galon"    "NUMBER OF cylinders" "disp"               
##  [4] "hp"                  "drat"                "WEIGHT !!!!???"     
##  [7] "qsec"                "vs"                  "am"                 
## [10] "gear"                "carb"

One must admit that we usually have to deal with this kind of variables’ names. Tired of renaming them one by one ? one answer : janitor::clean_names()

library(janitor)

mtcars_cleaned <- mtcars_messy %>% clean_names()

names(mtcars_cleaned)

##  [1] "miles_per_galon"     "number_of_cylinders" "disp"               
##  [4] "hp"                  "drat"                "weight"             
##  [7] "qsec"                "vs"                  "am"                 
## [10] "gear"                "carb"

Pretty cool isn’t ? There is another interesting tool in janitor. The get_dupes function allows us to detect duplicates within a dataframe. Just to check how it works, we create duplicated rows:

mtcars <- rownames_to_column(mtcars, var = "car_name") # Transforming our rows' names to a column entitled "car_name"

mtcars_dupli <- mtcars %>% rbind(mtcars[5:7, ])    # We're  duplicating the 5, 6 and 7th rows

nrow(mtcars_dupli) # We have indeed 35 rows (3 more than the classic mtcars)

## [1] 35

How can we find duplicates ? easy !

mtcars_dupli %>% get_dupes(car_name)

##            car_name dupe_count  mpg cyl disp  hp drat   wt  qsec vs am gear
## 1        Duster 360          2 14.3   8  360 245 3.21 3.57 15.84  0  0    3
## 2        Duster 360          2 14.3   8  360 245 3.21 3.57 15.84  0  0    3
## 3 Hornet Sportabout          2 18.7   8  360 175 3.15 3.44 17.02  0  0    3
## 4 Hornet Sportabout          2 18.7   8  360 175 3.15 3.44 17.02  0  0    3
## 5           Valiant          2 18.1   6  225 105 2.76 3.46 20.22  1  0    3
## 6           Valiant          2 18.1   6  225 105 2.76 3.46 20.22  1  0    3
##   carb
## 1    4
## 2    4
## 3    2
## 4    2
## 5    1
## 6    1

As you see, the function even indicates the duplication counts (in our case 2 as we’ve added one duplicate to each observation)

I adorn_() tabyl()

With janitor, one can also get valuable information in a tabular form. The tabyl() function works approximately like the table base R but offers a much better aesthetic. Let’s check the differences:

mtcars_dupli$car_name %>% table()

## .
##         AMC Javelin  Cadillac Fleetwood          Camaro Z28   Chrysler Imperial 
##                   1                   1                   1                   1 
##          Datsun 710    Dodge Challenger          Duster 360        Ferrari Dino 
##                   1                   1                   2                   1 
##            Fiat 128           Fiat X1-9      Ford Pantera L         Honda Civic 
##                   1                   1                   1                   1 
##      Hornet 4 Drive   Hornet Sportabout Lincoln Continental        Lotus Europa 
##                   1                   2                   1                   1 
##       Maserati Bora           Mazda RX4       Mazda RX4 Wag            Merc 230 
##                   1                   1                   1                   1 
##           Merc 240D            Merc 280           Merc 280C          Merc 450SE 
##                   1                   1                   1                   1 
##          Merc 450SL         Merc 450SLC    Pontiac Firebird       Porsche 914-2 
##                   1                   1                   1                   1 
##      Toyota Corolla       Toyota Corona             Valiant          Volvo 142E 
##                   1                   1                   2                   1

not so georgous, I agree. Now, let’s see what tabyl() function does :

mtcars_dupli$car_name %>% tabyl()

##                    . n    percent
##          AMC Javelin 1 0.02857143
##   Cadillac Fleetwood 1 0.02857143
##           Camaro Z28 1 0.02857143
##    Chrysler Imperial 1 0.02857143
##           Datsun 710 1 0.02857143
##     Dodge Challenger 1 0.02857143
##           Duster 360 2 0.05714286
##         Ferrari Dino 1 0.02857143
##             Fiat 128 1 0.02857143
##            Fiat X1-9 1 0.02857143
##       Ford Pantera L 1 0.02857143
##          Honda Civic 1 0.02857143
##       Hornet 4 Drive 1 0.02857143
##    Hornet Sportabout 2 0.05714286
##  Lincoln Continental 1 0.02857143
##         Lotus Europa 1 0.02857143
##        Maserati Bora 1 0.02857143
##            Mazda RX4 1 0.02857143
##        Mazda RX4 Wag 1 0.02857143
##             Merc 230 1 0.02857143
##            Merc 240D 1 0.02857143
##             Merc 280 1 0.02857143
##            Merc 280C 1 0.02857143
##           Merc 450SE 1 0.02857143
##           Merc 450SL 1 0.02857143
##          Merc 450SLC 1 0.02857143
##     Pontiac Firebird 1 0.02857143
##        Porsche 914-2 1 0.02857143
##       Toyota Corolla 1 0.02857143
##        Toyota Corona 1 0.02857143
##              Valiant 2 0.05714286
##           Volvo 142E 1 0.02857143

We’re not done yet ! This function can also be applied to generate contingency tables. Suppose, we want to get the number of cylinders according to the type of transmission (0 = automatic, 1 = manuel) :

mtcars_dupli %>% tabyl(am, cyl)

##  am 4 6  8
##   0 3 5 14
##   1 8 3  2

We get the information but it’s not very pretty right ? don’t panic, the janitor package provides the adorn_ functions which generate a variety of summary information :

mtcars_dupli %>% tabyl(am, cyl) %>% 
  adorn_totals(where = c("row", "col"))  # get the rows and columns sum

##     am  4 6  8 Total
##      0  3 5 14    22
##      1  8 3  2    13
##  Total 11 8 16    35

mtcars_dupli %>% tabyl(am, cyl) %>% 
  adorn_totals(where = c("row", "col")) %>% 
  adorn_percentages("all") %>%            # getting the proportions
  adorn_pct_formatting() %>%              # formatting values in %
  adorn_ns(position = "rear") %>%         # diplaying counts "behind" the previously calculated percentages
  adorn_title(placement = "combined")     # display a "combined" title (am/cyl) (checkout "top" for another view)

##  am/cyl          4         6          8       Total
##       0  8.6%  (3) 14.3% (5) 40.0% (14)  62.9% (22)
##       1 22.9%  (8)  8.6% (3)  5.7%  (2)  37.1% (13)
##   Total 31.4% (11) 22.9% (8) 45.7% (16) 100.0% (35)

Finally, it’s possible to convert our results into great looking HTML tables using the kable function from the knitr package :

library(knitr)
mtcars_dupli %>% tabyl(am, cyl) %>% 
  adorn_totals(where = c("row", "col")) %>% 
  adorn_percentages("all") %>%            
  adorn_pct_formatting() %>%              
  adorn_ns(position = "rear") %>%         
  adorn_title(placement = "combined") %>% 
  kable()

am/cyl	4	6	8	Total
0	8.6% (3)	14.3% (5)	40.0% (14)	62.9% (22)
1	22.9% (8)	8.6% (3)	5.7% (2)	37.1% (13)
Total	31.4% (11)	22.9% (8)	45.7% (16)	100.0% (35)

That’s it for this tutorial nevertheless I highly recommend taking a look to the official package’s vignette.

Dr. Mohamed El Fodil Ihaddaden, Ph.D

My research interests include Performance Management, Efficiency Analysis and Experimental Economics.