Introducing the tidyr pivot_longer()

When working with a dataframe, it’s usually preferable that you respect the philosophy introduced by Hadley Wickham about the Tidy data where each variable has its own column and each observation its own row.

So, what would you do when you encounter a “non-tidy” dataframe ?

You would surely consider using the tidyr package which provides two important functions: pivot_longer() and pivot_wider(). In this short tutorial, we’ll introduce the pivot_longer(). This function transforms several columns into a column name and a column value, thus making the dataframe longer.

Let’s dive into one example. We use the recent wpp2017 package which provides several interesting datasets to work with. In our case, we’ll work with the migration data:

library(tidyverse)
library(wpp2017)

data("migration")
head(migration)
##   country_code                                                        name
## 1          900                                                       WORLD
## 2          901                                      More developed regions
## 3          902                                      Less developed regions
## 4          941                                   Least developed countries
## 5          934 Less developed regions, excluding least developed countries
## 6          948                     Less developed regions, excluding China
##   1950-1955 1955-1960 1960-1965 1965-1970 1970-1975 1975-1980 1980-1985
## 1     0.000     0.000     0.000     0.000     0.000     0.000     0.000
## 2   698.340   192.409  2529.843  3356.689  6885.076  6685.832  5853.938
## 3  -698.340  -192.409 -2529.843 -3356.689 -6885.076 -6685.832 -5853.938
## 4  -527.004  -578.341  -887.150  -910.978 -4714.401 -4014.041 -5916.591
## 5  -171.336   385.932 -1642.693 -2445.711 -2170.675 -2671.791    62.653
## 6  -515.298  -258.091 -1655.344 -3771.595 -5841.591 -6825.244 -5777.852
##   1985-1990  1990-1995  1995-2000  2000-2005  2005-2010  2010-2015  2015-2020
## 1     0.000      0.000      0.000      0.000      0.000      0.000      0.000
## 2  7329.668  10967.514  13403.910  15655.945  16523.476  11057.230  12063.829
## 3 -7329.668 -10967.514 -13403.910 -15655.945 -16523.476 -11057.230 -12063.829
## 4 -3673.886   1156.029  -3924.069  -4028.563  -9451.006  -5849.633  -5247.802
## 5 -3655.782 -12123.543  -9479.841 -11627.382  -7072.470  -5207.597  -6816.027
## 6 -7234.352 -10243.722 -13146.073 -13941.946 -14514.712  -9646.026 -10760.776
##    2020-2025  2025-2030  2030-2035  2035-2040  2040-2045  2045-2050  2050-2055
## 1      0.000      0.000      0.000      0.000      0.000      0.000      0.000
## 2  11196.036  11716.194  11721.195  11721.191  11721.182  11720.975  11134.952
## 3 -11196.036 -11716.194 -11721.195 -11721.191 -11721.182 -11720.975 -11134.952
## 4  -4181.017  -4265.865  -4290.865  -4255.856  -4255.859  -4255.857  -4043.059
## 5  -7015.019  -7450.329  -7430.330  -7465.335  -7465.323  -7465.118  -7091.893
## 6  -9811.973 -10396.010 -10395.657 -10395.621 -10398.686 -10400.527  -9880.330
##    2055-2060 2060-2065 2065-2070 2070-2075 2075-2080 2080-2085 2085-2090
## 1      0.000     0.000     0.000     0.000     0.000     0.000     0.000
## 2  10548.927  9962.853  9376.811  8790.773  8204.720  7618.665  7032.616
## 3 -10548.927 -9962.853 -9376.811 -8790.773 -8204.720 -7618.665 -7032.616
## 4  -3830.303 -3617.497 -3404.688 -3191.896 -2979.101 -2766.336 -2553.509
## 5  -6718.624 -6345.356 -5972.123 -5598.877 -5225.619 -4852.329 -4479.107
## 6  -9360.328 -8840.264 -8320.235 -7800.192 -7280.191 -6760.214 -6240.219
##   2090-2095 2095-2100
## 1     0.000     0.000
## 2  6446.573  5860.558
## 3 -6446.573 -5860.558
## 4 -2340.726 -2127.959
## 5 -4105.847 -3732.599
## 6 -5720.207 -5200.325

We can check the structure of our data using glimpse():

migration %>% glimpse()
## Rows: 241
## Columns: 32
## $ country_code <int> 900, 901, 902, 941, 934, 948, 1503, 1517, 1502, 1501, ...
## $ name         <chr> "WORLD", "More developed regions", "Less developed reg...
## $ `1950-1955`  <dbl> 0.000, 698.340, -698.340, -527.004, -171.336, -515.298...
## $ `1955-1960`  <dbl> 0.000, 192.409, -192.409, -578.341, 385.932, -258.091,...
## $ `1960-1965`  <dbl> 0.000, 2529.843, -2529.843, -887.150, -1642.693, -1655...
## $ `1965-1970`  <dbl> 0.000, 3356.689, -3356.689, -910.978, -2445.711, -3771...
## $ `1970-1975`  <dbl> 0.000, 6885.076, -6885.076, -4714.401, -2170.675, -584...
## $ `1975-1980`  <dbl> 0.000, 6685.832, -6685.832, -4014.041, -2671.791, -682...
## $ `1980-1985`  <dbl> 0.000, 5853.938, -5853.938, -5916.591, 62.653, -5777.8...
## $ `1985-1990`  <dbl> 0.000, 7329.668, -7329.668, -3673.886, -3655.782, -723...
## $ `1990-1995`  <dbl> 0.000, 10967.514, -10967.514, 1156.029, -12123.543, -1...
## $ `1995-2000`  <dbl> 0.000, 13403.910, -13403.910, -3924.069, -9479.841, -1...
## $ `2000-2005`  <dbl> 0.000, 15655.945, -15655.945, -4028.563, -11627.382, -...
## $ `2005-2010`  <dbl> 0.000, 16523.476, -16523.476, -9451.006, -7072.470, -1...
## $ `2010-2015`  <dbl> 0.000, 11057.230, -11057.230, -5849.633, -5207.597, -9...
## $ `2015-2020`  <dbl> 0.000, 12063.829, -12063.829, -5247.802, -6816.027, -1...
## $ `2020-2025`  <dbl> 0.000, 11196.036, -11196.036, -4181.017, -7015.019, -9...
## $ `2025-2030`  <dbl> 0.000, 11716.194, -11716.194, -4265.865, -7450.329, -1...
## $ `2030-2035`  <dbl> 0.000, 11721.195, -11721.195, -4290.865, -7430.330, -1...
## $ `2035-2040`  <dbl> 0.000, 11721.191, -11721.191, -4255.856, -7465.335, -1...
## $ `2040-2045`  <dbl> 0.000, 11721.182, -11721.182, -4255.859, -7465.323, -1...
## $ `2045-2050`  <dbl> 0.000, 11720.975, -11720.975, -4255.857, -7465.118, -1...
## $ `2050-2055`  <dbl> 0.000, 11134.952, -11134.952, -4043.059, -7091.893, -9...
## $ `2055-2060`  <dbl> 0.000, 10548.927, -10548.927, -3830.303, -6718.624, -9...
## $ `2060-2065`  <dbl> 0.000, 9962.853, -9962.853, -3617.497, -6345.356, -884...
## $ `2065-2070`  <dbl> 0.000, 9376.811, -9376.811, -3404.688, -5972.123, -832...
## $ `2070-2075`  <dbl> 0.000, 8790.773, -8790.773, -3191.896, -5598.877, -780...
## $ `2075-2080`  <dbl> 0.000, 8204.720, -8204.720, -2979.101, -5225.619, -728...
## $ `2080-2085`  <dbl> 0.000, 7618.665, -7618.665, -2766.336, -4852.329, -676...
## $ `2085-2090`  <dbl> 0.000, 7032.616, -7032.616, -2553.509, -4479.107, -624...
## $ `2090-2095`  <dbl> 0.000, 6446.573, -6446.573, -2340.726, -4105.847, -572...
## $ `2095-2100`  <dbl> 0.000, 5860.558, -5860.558, -2127.959, -3732.599, -520...

You see that the dates column should preferably be structured into two columns: A column that we may call time_period and a calumn that we call net_migration_value. With pivot_longer() its extremely easy to do that:

migration %>% pivot_longer(
  
  cols = 3:32,  # The columns that we want to pivot
  
  names_to = "time_period", # the name of the new column name
  
  values_to = "net_migration_value", # the name of the new column value
  
  
)
## # A tibble: 7,230 x 4
##    country_code name  time_period net_migration_value
##           <int> <chr> <chr>                     <dbl>
##  1          900 WORLD 1950-1955                     0
##  2          900 WORLD 1955-1960                     0
##  3          900 WORLD 1960-1965                     0
##  4          900 WORLD 1965-1970                     0
##  5          900 WORLD 1970-1975                     0
##  6          900 WORLD 1975-1980                     0
##  7          900 WORLD 1980-1985                     0
##  8          900 WORLD 1985-1990                     0
##  9          900 WORLD 1990-1995                     0
## 10          900 WORLD 1995-2000                     0
## # ... with 7,220 more rows

And that’s it 😄 !

Avatar
Mohamed El Fodil Ihaddaden
Ph.D candidate in Economics.

My research interests include Performance Management, Efficiency Analysis and Experimental Economics.

Related