The janitor package in an excellent tool for cleaning a messy database and summarize information. We’ve all heard that data analysts allocate about 70% of their time to data cleaning and wrangling. With janitor, this is straightforward. In this tutorial, we’ll present some (not all !!!) janitor cool functions. Of course, we use the mtcars R base dataset to provide for reproducible examples. Let’s have a look to the variables.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Now, just for the fun, we rename our columns in a messy way :
library(tidyverse)
mtcars_messy <- mtcars %>% rename(
"MILES PER! Galon" = mpg,
"NUMBER OF cylinders" = cyl,
"WEIGHT !!!!???" = wt
)
names(mtcars_messy)
## [1] "MILES PER! Galon" "NUMBER OF cylinders" "disp"
## [4] "hp" "drat" "WEIGHT !!!!???"
## [7] "qsec" "vs" "am"
## [10] "gear" "carb"
One must admit that we usually have to deal with this kind of variables’ names. Tired of renaming them one by one ? one answer : janitor::clean_names()
library(janitor)
mtcars_cleaned <- mtcars_messy %>% clean_names()
names(mtcars_cleaned)
## [1] "miles_per_galon" "number_of_cylinders" "disp"
## [4] "hp" "drat" "weight"
## [7] "qsec" "vs" "am"
## [10] "gear" "carb"
Pretty cool isn’t ? There is another interesting tool in janitor. The get_dupes function allows us to detect duplicates within a dataframe. Just to check how it works, we create duplicated rows:
mtcars <- rownames_to_column(mtcars, var = "car_name") # Transforming our rows' names to a column entitled "car_name"
mtcars_dupli <- mtcars %>% rbind(mtcars[5:7, ]) # We're duplicating the 5, 6 and 7th rows
nrow(mtcars_dupli) # We have indeed 35 rows (3 more than the classic mtcars)
## [1] 35
How can we find duplicates ? easy !
mtcars_dupli %>% get_dupes(car_name)
## car_name dupe_count mpg cyl disp hp drat wt qsec vs am gear
## 1 Duster 360 2 14.3 8 360 245 3.21 3.57 15.84 0 0 3
## 2 Duster 360 2 14.3 8 360 245 3.21 3.57 15.84 0 0 3
## 3 Hornet Sportabout 2 18.7 8 360 175 3.15 3.44 17.02 0 0 3
## 4 Hornet Sportabout 2 18.7 8 360 175 3.15 3.44 17.02 0 0 3
## 5 Valiant 2 18.1 6 225 105 2.76 3.46 20.22 1 0 3
## 6 Valiant 2 18.1 6 225 105 2.76 3.46 20.22 1 0 3
## carb
## 1 4
## 2 4
## 3 2
## 4 2
## 5 1
## 6 1
As you see, the function even indicates the duplication counts (in our case 2 as we’ve added one duplicate to each observation)
I adorn_() tabyl()
With janitor, one can also get valuable information in a tabular form. The tabyl() function works approximately like the table base R but offers a much better aesthetic. Let’s check the differences:
mtcars_dupli$car_name %>% table()
## .
## AMC Javelin Cadillac Fleetwood Camaro Z28 Chrysler Imperial
## 1 1 1 1
## Datsun 710 Dodge Challenger Duster 360 Ferrari Dino
## 1 1 2 1
## Fiat 128 Fiat X1-9 Ford Pantera L Honda Civic
## 1 1 1 1
## Hornet 4 Drive Hornet Sportabout Lincoln Continental Lotus Europa
## 1 2 1 1
## Maserati Bora Mazda RX4 Mazda RX4 Wag Merc 230
## 1 1 1 1
## Merc 240D Merc 280 Merc 280C Merc 450SE
## 1 1 1 1
## Merc 450SL Merc 450SLC Pontiac Firebird Porsche 914-2
## 1 1 1 1
## Toyota Corolla Toyota Corona Valiant Volvo 142E
## 1 1 2 1
not so georgous, I agree. Now, let’s see what tabyl() function does :
mtcars_dupli$car_name %>% tabyl()
## . n percent
## AMC Javelin 1 0.02857143
## Cadillac Fleetwood 1 0.02857143
## Camaro Z28 1 0.02857143
## Chrysler Imperial 1 0.02857143
## Datsun 710 1 0.02857143
## Dodge Challenger 1 0.02857143
## Duster 360 2 0.05714286
## Ferrari Dino 1 0.02857143
## Fiat 128 1 0.02857143
## Fiat X1-9 1 0.02857143
## Ford Pantera L 1 0.02857143
## Honda Civic 1 0.02857143
## Hornet 4 Drive 1 0.02857143
## Hornet Sportabout 2 0.05714286
## Lincoln Continental 1 0.02857143
## Lotus Europa 1 0.02857143
## Maserati Bora 1 0.02857143
## Mazda RX4 1 0.02857143
## Mazda RX4 Wag 1 0.02857143
## Merc 230 1 0.02857143
## Merc 240D 1 0.02857143
## Merc 280 1 0.02857143
## Merc 280C 1 0.02857143
## Merc 450SE 1 0.02857143
## Merc 450SL 1 0.02857143
## Merc 450SLC 1 0.02857143
## Pontiac Firebird 1 0.02857143
## Porsche 914-2 1 0.02857143
## Toyota Corolla 1 0.02857143
## Toyota Corona 1 0.02857143
## Valiant 2 0.05714286
## Volvo 142E 1 0.02857143
We’re not done yet ! This function can also be applied to generate contingency tables. Suppose, we want to get the number of cylinders according to the type of transmission (0 = automatic, 1 = manuel) :
mtcars_dupli %>% tabyl(am, cyl)
## am 4 6 8
## 0 3 5 14
## 1 8 3 2
We get the information but it’s not very pretty right ? don’t panic, the janitor package provides the adorn_ functions which generate a variety of summary information :
mtcars_dupli %>% tabyl(am, cyl) %>%
adorn_totals(where = c("row", "col")) # get the rows and columns sum
## am 4 6 8 Total
## 0 3 5 14 22
## 1 8 3 2 13
## Total 11 8 16 35
mtcars_dupli %>% tabyl(am, cyl) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages("all") %>% # getting the proportions
adorn_pct_formatting() %>% # formatting values in %
adorn_ns(position = "rear") %>% # diplaying counts "behind" the previously calculated percentages
adorn_title(placement = "combined") # display a "combined" title (am/cyl) (checkout "top" for another view)
## am/cyl 4 6 8 Total
## 0 8.6% (3) 14.3% (5) 40.0% (14) 62.9% (22)
## 1 22.9% (8) 8.6% (3) 5.7% (2) 37.1% (13)
## Total 31.4% (11) 22.9% (8) 45.7% (16) 100.0% (35)
Finally, it’s possible to convert our results into great looking HTML tables using the kable function from the knitr package :
library(knitr)
mtcars_dupli %>% tabyl(am, cyl) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages("all") %>%
adorn_pct_formatting() %>%
adorn_ns(position = "rear") %>%
adorn_title(placement = "combined") %>%
kable()
am/cyl | 4 | 6 | 8 | Total |
---|---|---|---|---|
0 | 8.6% (3) | 14.3% (5) | 40.0% (14) | 62.9% (22) |
1 | 22.9% (8) | 8.6% (3) | 5.7% (2) | 37.1% (13) |
Total | 31.4% (11) | 22.9% (8) | 45.7% (16) | 100.0% (35) |
That’s it for this tutorial nevertheless I highly recommend taking a look to the official package’s vignette.