(Very) Easy Web Scraping with ralger

ralger is a package that aims to facilitate to the maximum web scraping in R. For scraping some data, you only need two elements, the link of the web page and the html or css node that references the needed information. Don’t panic, you don’t have to spend hours learning html and css. You can just use the SelectorGadget chrome extension. You can check out this tutorial for more information.

scrap()

Let’s dive into an example ! Suppose we want to extract all Golden Globes Best Actress Nominees (including the winner). In ralger you need only two elements:

The link: https://www.goldenglobes.com/winners-nominees/best-performance-actress-motion-picture-drama

The node: .primary-nominee a

And that’s it, we’re ready to scrap !

library(ralger)

data <- scrap(
  "https://www.goldenglobes.com/winners-nominees/best-performance-actress-motion-picture-drama", 
  ".primary-nominee a"
)

data
##  [1] "Renée Zellweger"    "Cynthia Erivo"      "Scarlett Johansson"
##  [4] "Saoirse Ronan"      "Charlize Theron"    "Glenn Close"       
##  [7] "Lady Gaga"          "Nicole Kidman"      "Melissa McCarthy"  
## [10] "Rosamund Pike"      "Frances McDormand"  "Meryl Streep"      
## [13] "Michelle Williams"  "Jessica Chastain"   "Sally Hawkins"     
## [16] "Isabelle Huppert"   "Amy Adams"          "Jessica Chastain"  
## [19] "Ruth Negga"         "Natalie Portman"

Pretty simple right ? I hope so. Anyway, the problem here is that the main page displays only 20 nominees, from 2017 to 2020. What if we wanted to extract all nominees in history ? Indeed, you’re right, we’d have to scroll multiple pages (20 to be exact) across the website. In this context, we need to use paste() in conjunction with scrap() as follows:

link <- "https://www.goldenglobes.com/winners-nominees/best-performance-actress-motion-picture-drama?page=" # Mind the change in the link structure "page=" 

node <- ".primary-nominee a" # we use the same node as previously

data_all <- scrap(paste(link, 0:20, sep = ""), node)

data_all
##   [1] "Renée Zellweger"      "Cynthia Erivo"        "Scarlett Johansson"  
##   [4] "Saoirse Ronan"        "Charlize Theron"      "Glenn Close"         
##   [7] "Rosamund Pike"        "Lady Gaga"            "Nicole Kidman"       
##  [10] "Melissa McCarthy"     "Frances McDormand"    "Jessica Chastain"    
##  [13] "Sally Hawkins"        "Meryl Streep"         "Michelle Williams"   
##  [16] "Isabelle Huppert"     "Ruth Negga"           "Natalie Portman"     
##  [19] "Amy Adams"            "Jessica Chastain"     "Brie Larson"         
##  [22] "Saoirse Ronan"        "Cate Blanchett"       "Rooney Mara"         
##  [25] "Alicia Vikander"      "Julianne Moore"       "Reese Witherspoon"   
##  [28] "Jennifer Aniston"     "Felicity Jones"       "Rosamund Pike"       
##  [31] "Cate Blanchett"       "Sandra Bullock"       "Judi Dench"          
##  [34] "Kate Winslet"         "Emma Thompson"        "Jessica Chastain"    
##  [37] "Naomi Watts"          "Rachel Weisz"         "Marion Cotillard"    
##  [40] "Helen Mirren"         "Meryl Streep"         "Glenn Close"         
##  [43] "Viola Davis"          "Rooney Mara"          "Tilda Swinton"       
##  [46] "Natalie Portman"      "Nicole Kidman"        "Jennifer Lawrence"   
##  [49] "Michelle Williams"    "Halle Berry"          "Sandra Bullock"      
##  [52] "Emily Blunt"          "Helen Mirren"         "Carey Mulligan"      
##  [55] "Gabourey Sidibe"      "Kate Winslet"         "Meryl Streep"        
##  [58] "Anne Hathaway"        "Angelina Jolie"       "Kristin Scott Thomas"
##  [61] "Julie Christie"       "Jodie Foster"         "Angelina Jolie"      
##  [64] "Keira Knightley"      "Cate Blanchett"       "Helen Mirren"        
##  [67] "Judi Dench"           "Maggie Gyllenhaal"    "Kate Winslet"        
##  [70] "Penélope Cruz"        "Felicity Huffman"     "Maria Bello"         
##  [73] "Gwyneth Paltrow"      "Charlize Theron"      "Ziyi Zhang"          
##  [76] "Hilary Swank"         "Uma Thurman"          "Scarlett Johansson"  
##  [79] "Nicole Kidman"        "Imelda Staunton"      "Charlize Theron"     
##  [82] "Cate Blanchett"       "Scarlett Johansson"   "Nicole Kidman"       
##  [85] "Uma Thurman"          "Evan Rachel Wood"     "Nicole Kidman"       
##  [88] "Julianne Moore"       "Meryl Streep"         "Salma Hayek"         
##  [91] "Diane Lane"           "Sissy Spacek"         "Halle Berry"         
##  [94] "Judi Dench"           "Nicole Kidman"        "Deep End, The"       
##  [97] "Julia Roberts"        "Joan Allen"           "Björk"               
## [100] "Ellen Burstyn"        "Laura Linney"         "Hilary Swank"        
## [103] "Annette Bening"       "Julianne Moore"       "Meryl Streep"        
## [106] "Sigourney Weaver"     "Cate Blanchett"       "Fernanda Montenegro" 
## [109] "Susan Sarandon"       "Meryl Streep"         "Emily Watson"        
## [112] "Judi Dench"           "Helena Bonham Carter" "Jodie Foster"        
## [115] "Jessica Lange"        "Kate Winslet"         "Brenda Blethyn"      
## [118] "Courtney Love"        "Kristin Scott Thomas" "Meryl Streep"        
## [121] "Emily Watson"         "Sharon Stone"         "Susan Sarandon"      
## [124] "Elisabeth Shue"       "Meryl Streep"         "Emma Thompson"       
## [127] "Jessica Lange"        "Meryl Streep"         "Jodie Foster"        
## [130] "Jennifer Jason Leigh" "Miranda Richardson"   "Holly Hunter"        
## [133] "Juliette Binoche"     "Michelle Pfeiffer"    "Emma Thompson"       
## [136] "Debra Winger"         "Emma Thompson"        "Mary McDonnell"      
## [139] "Michelle Pfeiffer"    "Susan Sarandon"       "Sharon Stone"        
## [142] "Jodie Foster"         "Annette Bening"       "Geena Davis"         
## [145] "Laura Dern"           "Susan Sarandon"       "Kathy Bates"         
## [148] "Anjelica Huston"      "Michelle Pfeiffer"    "Susan Sarandon"      
## [151] "Joanne Woodward"      "Michelle Pfeiffer"    "Sally Field"         
## [154] "Jessica Lange"        "Andie MacDowell"      "Liv Ullmann"         
## [157] "Sigourney Weaver"     "Jodie Foster"         "Shirley MacLaine"    
## [160] "Christine Lahti"      "Meryl Streep"         "Sally Kirkland"      
## [163] "Rachel Levin"         "Barbra Streisand"     "Glenn Close"         
## [166] "Faye Dunaway"         "Marlee Matlin"        "Julie Andrews"       
## [169] "Anne Bancroft"        "Farrah Fawcett"       "Sigourney Weaver"    
## [172] "Whoopi Goldberg"      "Anne Bancroft"        "Cher"                
## [175] "Geraldine Page"       "Meryl Streep"         "Sally Field"         
## [178] "Sissy Spacek"         "Diane Keaton"         "Jessica Lange"       
## [181] "Vanessa Redgrave"     "Shirley MacLaine"     "Debra Winger"        
## [184] "Jane Alexander"       "Bonnie Bedelia"       "Meryl Streep"        
## [187] "Meryl Streep"         "Diane Keaton"         "Jessica Lange"       
## [190] "Sissy Spacek"         "Debra Winger"         "Meryl Streep"        
## [193] "Diane Keaton"         "Sissy Spacek"         "Sally Field"         
## [196] "Katharine Hepburn"    "Mary Tyler Moore"     "Ellen Burstyn"       
## [199] "Nastassja Kinski"     "Deborah Raffin"       "Gena Rowlands"       
## [202] "Sally Field"          "Jill Clayburgh"       "Lisa Eichhorn"       
## [205] "Jane Fonda"           "Marsha Mason"         "Jane Fonda"          
## [208] "Geraldine Page"       "Ingrid Bergman"       "Jill Clayburgh"      
## [211] "Glenda Jackson"       "Jane Fonda"           "Anne Bancroft"       
## [214] "Diane Keaton"         "Kathleen Quinlan"     "Gena Rowlands"       
## [217] "Faye Dunaway"         "Glenda Jackson"       "Sarah Miles"         
## [220] "Talia Shire"          "Liv Ullmann"          "Louise Fletcher"     
## [223] "Faye Dunaway"         "Marilyn Hassett"      "Glenda Jackson"      
## [226] "Karen Black"          "Gena Rowlands"        "Ellen Burstyn"       
## [229] "Faye Dunaway"         "Valerie Perrine"      "Liv Ullmann"         
## [232] "Marsha Mason"         "Ellen Burstyn"        "Barbra Streisand"    
## [235] "Elizabeth Taylor"     "Joanne Woodward"      "Liv Ullmann"         
## [238] "Tuesday Weld"         "Joanne Woodward"      "Diana Ross"          
## [241] "Cicely Tyson"         "Trish Van Devere"     "Jane Fonda"          
## [244] "Vanessa Redgrave"     "Jessica Walter"       "Dyan Cannon"         
## [247] "Glenda Jackson"       "Ali MacGraw"          "Faye Dunaway"        
## [250] "Glenda Jackson"       "Melina Mercouri"      "Sarah Miles"         
## [253] "Geneviève Bujold"     "Jane Fonda"           "Liza Minnelli"       
## [256] "Jean Simmons"         "Maggie Smith"         "Joanne Woodward"     
## [259] "Beryl Reid"           "Mia Farrow"           "Katharine Hepburn"   
## [262] "Vanessa Redgrave"     "Edith Evans"          "Faye Dunaway"        
## [265] "Audrey Hepburn"       "Katharine Hepburn"    "Anne Heywood"        
## [268] "Anouk Aimée"          "Ida Kaminska"         "Virginia McKenna"    
## [271] "Elizabeth Taylor"     "Natalie Wood"         "Samantha Eggar"      
## [274] "Simone Signoret"      "Maggie Smith"         "Julie Christie"      
## [277] "Elizabeth Hartman"    "Anne Bancroft"        "Ava Gardner"         
## [280] "Rita Hayworth"        "Geraldine Page"       "Jean Seberg"         
## [283] "Leslie Caron"         "Marina Vlady"         "Natalie Wood"        
## [286] "Polly Bergen"         "Geraldine Page"       "Rachel Roberts"      
## [289] "Romy Schneider"       "Alida Valli"          "Geraldine Page"      
## [292] "Glynis Johns"         "Melina Mercouri"      "Lee Remick"          
## [295] "Susan Strasberg"      "Shelley Winters"      "Susannah York"       
## [298] "Anne Bancroft"        "Bette Davis"          "Katharine Hepburn"   
## [301] "Geraldine Page"       "Leslie Caron"         "Shirley MacLaine"    
## [304] "Claudia McNeil"       "Natalie Wood"         "Greer Garson"        
## [307] "Doris Day"            "Nancy Kwan"           "Jean Simmons"        
## [310] "Elizabeth Taylor"     "Elizabeth Taylor"     "Lee Remick"          
## [313] "Simone Signoret"      "Katharine Hepburn"    "Audrey Hepburn"      
## [316] "Susan Hayward"        "Ingrid Bergman"       "Deborah Kerr"        
## [319] "Shirley MacLaine"     "Jean Simmons"         "Joanne Woodward"     
## [322] "Marlene Dietrich"     "Deborah Kerr"         "Anna Magnani"        
## [325] "Eva Marie Saint"      "Audrey Hepburn"       "Katharine Hepburn"   
## [328] "Carroll Baker"        "Ingrid Bergman"       "Helen Hayes"         
## [331] "Anna Magnani"         "Grace Kelly"          "Audrey Hepburn"      
## [334] "Shirley Booth"        "Joan Crawford"        "Olivia de Havilland" 
## [337] "Jane Wyman"           "Vivien Leigh"         "Shelley Winters"     
## [340] "Gloria Swanson"       "Bette Davis"          "Olivia de Havilland" 
## [343] "Deborah Kerr"         "Jane Wyman"           "Rosalind Russell"    
## [346] "Rosalind Russell"     "Ingrid Bergman"       "Ingrid Bergman"      
## [349] "Jennifer Jones"

tidy_scrap()

Now, imagine that we need a data frame composed of two columns :

  • Actress: The names of Golden Globe Best Actress Nominees,
  • Movie: The movie title for which they were nominated.

To construct our data frame we’ll use the tidy_scrap() function as follows:

links <- paste(link, 0:20, sep = "") # The links required to extract the 350 observations

nodes <- c(".primary-nominee a", ".secondary-nominee")

column_names <- c("Actress", "Movie")


global_df <- tidy_scrap(links, nodes, column_names)
## Warning in (function (..., deparse.level = 1) : number of rows of result is not
## a multiple of vector length (arg 2)
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
head(global_df, n = 10)
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 10 x 2
##    Actress            Movie                   
##    <chr>              <chr>                   
##  1 Renée Zellweger    Judy                    
##  2 Cynthia Erivo      Harriet                 
##  3 Scarlett Johansson Marriage Story          
##  4 Saoirse Ronan      Little Women            
##  5 Charlize Theron    Bombshell               
##  6 Glenn Close        Wife, The               
##  7 Rosamund Pike      Private War, A          
##  8 Lady Gaga          Star Is Born, A (2018)  
##  9 Nicole Kidman      Destroyer               
## 10 Melissa McCarthy   Can You Ever Forgive Me?
Avatar
Mohamed El Fodil Ihaddaden
Ph.D candidate in Economics.

My research interests include Performance Management, Efficiency Analysis and Experimental Economics.