(Very) Easy Web Scraping with ralger

ralger is a package that aims to facilitate to the maximum web scraping in R. For scraping some data, you only need two elements, the link of the web page and the html or css node that references the needed information. Don’t panic, you don’t have to spend hours learning html and css. You can just use the SelectorGadget chrome extension. You can check out this tutorial for more information.

scrap()

Let’s dive into an example ! Suppose we want to extract all Golden Globes Best Actress Nominees (including the winner). In ralger you need only two elements:

The link: https://www.goldenglobes.com/winners-nominees/best-performance-actress-motion-picture-drama

The node: .primary-nominee a

And that’s it, we’re ready to scrap !

library(ralger)

data <- scrap(
  "https://www.goldenglobes.com/winners-nominees/best-performance-actress-motion-picture-drama", 
  ".primary-nominee a"
)

data
##  [1] "Renée Zellweger"    "Cynthia Erivo"      "Scarlett Johansson"
##  [4] "Saoirse Ronan"      "Charlize Theron"    "Glenn Close"       
##  [7] "Rosamund Pike"      "Lady Gaga"          "Nicole Kidman"     
## [10] "Melissa McCarthy"   "Frances McDormand"  "Jessica Chastain"  
## [13] "Sally Hawkins"      "Meryl Streep"       "Michelle Williams" 
## [16] "Isabelle Huppert"   "Ruth Negga"         "Natalie Portman"   
## [19] "Amy Adams"          "Jessica Chastain"

Pretty simple right ? I hope so. Anyway, the problem here is that the main page displays only 20 nominees, from 2017 to 2020. What if we wanted to extract all nominees in history ? Indeed, you’re right, we’d have to scroll multiple pages (20 to be exact) across the website. In this context, we need to use paste() in conjunction with scrap() as follows:

link <- "https://www.goldenglobes.com/winners-nominees/best-performance-actress-motion-picture-drama?page=" # Mind the change in the link structure "page=" 

node <- ".primary-nominee a" # we use the same node as previously

data_all <- scrap(paste(link, 0:20, sep = ""), node)

data_all
##   [1] "Renée Zellweger"      "Scarlett Johansson"   "Saoirse Ronan"       
##   [4] "Charlize Theron"      "Cynthia Erivo"        "Glenn Close"         
##   [7] "Lady Gaga"            "Nicole Kidman"        "Melissa McCarthy"    
##  [10] "Rosamund Pike"        "Frances McDormand"    "Sally Hawkins"       
##  [13] "Meryl Streep"         "Michelle Williams"    "Jessica Chastain"    
##  [16] "Isabelle Huppert"     "Amy Adams"            "Jessica Chastain"    
##  [19] "Ruth Negga"           "Natalie Portman"      "Brie Larson"         
##  [22] "Cate Blanchett"       "Rooney Mara"          "Alicia Vikander"     
##  [25] "Saoirse Ronan"        "Julianne Moore"       "Jennifer Aniston"    
##  [28] "Felicity Jones"       "Rosamund Pike"        "Reese Witherspoon"   
##  [31] "Cate Blanchett"       "Judi Dench"           "Kate Winslet"        
##  [34] "Emma Thompson"        "Sandra Bullock"       "Jessica Chastain"    
##  [37] "Marion Cotillard"     "Helen Mirren"         "Naomi Watts"         
##  [40] "Rachel Weisz"         "Meryl Streep"         "Glenn Close"         
##  [43] "Viola Davis"          "Rooney Mara"          "Tilda Swinton"       
##  [46] "Natalie Portman"      "Michelle Williams"    "Halle Berry"         
##  [49] "Nicole Kidman"        "Jennifer Lawrence"    "Sandra Bullock"      
##  [52] "Emily Blunt"          "Helen Mirren"         "Carey Mulligan"      
##  [55] "Gabourey Sidibe"      "Kate Winslet"         "Kristin Scott Thomas"
##  [58] "Meryl Streep"         "Anne Hathaway"        "Angelina Jolie"      
##  [61] "Julie Christie"       "Keira Knightley"      "Cate Blanchett"      
##  [64] "Jodie Foster"         "Angelina Jolie"       "Helen Mirren"        
##  [67] "Kate Winslet"         "Penélope Cruz"        "Judi Dench"          
##  [70] "Maggie Gyllenhaal"    "Felicity Huffman"     "Maria Bello"         
##  [73] "Gwyneth Paltrow"      "Charlize Theron"      "Ziyi Zhang"          
##  [76] "Hilary Swank"         "Uma Thurman"          "Scarlett Johansson"  
##  [79] "Nicole Kidman"        "Imelda Staunton"      "Charlize Theron"     
##  [82] "Cate Blanchett"       "Scarlett Johansson"   "Nicole Kidman"       
##  [85] "Uma Thurman"          "Evan Rachel Wood"     "Nicole Kidman"       
##  [88] "Julianne Moore"       "Meryl Streep"         "Salma Hayek"         
##  [91] "Diane Lane"           "Sissy Spacek"         "Halle Berry"         
##  [94] "Judi Dench"           "Nicole Kidman"        "Deep End, The"       
##  [97] "Julia Roberts"        "Joan Allen"           "Björk"               
## [100] "Ellen Burstyn"        "Laura Linney"         "Hilary Swank"        
## [103] "Meryl Streep"         "Sigourney Weaver"     "Annette Bening"      
## [106] "Julianne Moore"       "Cate Blanchett"       "Fernanda Montenegro" 
## [109] "Susan Sarandon"       "Meryl Streep"         "Emily Watson"        
## [112] "Judi Dench"           "Helena Bonham Carter" "Jodie Foster"        
## [115] "Jessica Lange"        "Kate Winslet"         "Brenda Blethyn"      
## [118] "Emily Watson"         "Courtney Love"        "Kristin Scott Thomas"
## [121] "Meryl Streep"         "Sharon Stone"         "Meryl Streep"        
## [124] "Emma Thompson"        "Susan Sarandon"       "Elisabeth Shue"      
## [127] "Jessica Lange"        "Jodie Foster"         "Jennifer Jason Leigh"
## [130] "Miranda Richardson"   "Meryl Streep"         "Holly Hunter"        
## [133] "Juliette Binoche"     "Michelle Pfeiffer"    "Emma Thompson"       
## [136] "Debra Winger"         "Emma Thompson"        "Sharon Stone"        
## [139] "Mary McDonnell"       "Michelle Pfeiffer"    "Susan Sarandon"      
## [142] "Jodie Foster"         "Annette Bening"       "Geena Davis"         
## [145] "Laura Dern"           "Susan Sarandon"       "Kathy Bates"         
## [148] "Anjelica Huston"      "Michelle Pfeiffer"    "Susan Sarandon"      
## [151] "Joanne Woodward"      "Michelle Pfeiffer"    "Sally Field"         
## [154] "Jessica Lange"        "Andie MacDowell"      "Liv Ullmann"         
## [157] "Sigourney Weaver"     "Jodie Foster"         "Shirley MacLaine"    
## [160] "Christine Lahti"      "Meryl Streep"         "Sally Kirkland"      
## [163] "Glenn Close"          "Faye Dunaway"         "Rachel Levin"        
## [166] "Barbra Streisand"     "Marlee Matlin"        "Julie Andrews"       
## [169] "Anne Bancroft"        "Farrah Fawcett"       "Sigourney Weaver"    
## [172] "Whoopi Goldberg"      "Anne Bancroft"        "Cher"                
## [175] "Geraldine Page"       "Meryl Streep"         "Sally Field"         
## [178] "Diane Keaton"         "Jessica Lange"        "Vanessa Redgrave"    
## [181] "Sissy Spacek"         "Shirley MacLaine"     "Meryl Streep"        
## [184] "Debra Winger"         "Jane Alexander"       "Bonnie Bedelia"      
## [187] "Meryl Streep"         "Diane Keaton"         "Jessica Lange"       
## [190] "Sissy Spacek"         "Debra Winger"         "Meryl Streep"        
## [193] "Sally Field"          "Katharine Hepburn"    "Diane Keaton"        
## [196] "Sissy Spacek"         "Mary Tyler Moore"     "Gena Rowlands"       
## [199] "Ellen Burstyn"        "Nastassja Kinski"     "Deborah Raffin"      
## [202] "Sally Field"          "Jill Clayburgh"       "Lisa Eichhorn"       
## [205] "Jane Fonda"           "Marsha Mason"         "Jane Fonda"          
## [208] "Ingrid Bergman"       "Jill Clayburgh"       "Glenda Jackson"      
## [211] "Geraldine Page"       "Jane Fonda"           "Anne Bancroft"       
## [214] "Diane Keaton"         "Kathleen Quinlan"     "Gena Rowlands"       
## [217] "Faye Dunaway"         "Glenda Jackson"       "Sarah Miles"         
## [220] "Talia Shire"          "Liv Ullmann"          "Louise Fletcher"     
## [223] "Glenda Jackson"       "Karen Black"          "Faye Dunaway"        
## [226] "Marilyn Hassett"      "Gena Rowlands"        "Ellen Burstyn"       
## [229] "Faye Dunaway"         "Valerie Perrine"      "Liv Ullmann"         
## [232] "Marsha Mason"         "Barbra Streisand"     "Elizabeth Taylor"    
## [235] "Joanne Woodward"      "Ellen Burstyn"        "Liv Ullmann"         
## [238] "Diana Ross"           "Cicely Tyson"         "Trish Van Devere"    
## [241] "Tuesday Weld"         "Joanne Woodward"      "Jane Fonda"          
## [244] "Vanessa Redgrave"     "Jessica Walter"       "Dyan Cannon"         
## [247] "Glenda Jackson"       "Ali MacGraw"          "Faye Dunaway"        
## [250] "Glenda Jackson"       "Melina Mercouri"      "Sarah Miles"         
## [253] "Geneviève Bujold"     "Jane Fonda"           "Liza Minnelli"       
## [256] "Jean Simmons"         "Maggie Smith"         "Joanne Woodward"     
## [259] "Beryl Reid"           "Mia Farrow"           "Katharine Hepburn"   
## [262] "Vanessa Redgrave"     "Edith Evans"          "Faye Dunaway"        
## [265] "Audrey Hepburn"       "Katharine Hepburn"    "Anne Heywood"        
## [268] "Anouk Aimée"          "Natalie Wood"         "Ida Kaminska"        
## [271] "Virginia McKenna"     "Elizabeth Taylor"     "Samantha Eggar"      
## [274] "Julie Christie"       "Elizabeth Hartman"    "Simone Signoret"     
## [277] "Maggie Smith"         "Anne Bancroft"        "Ava Gardner"         
## [280] "Rita Hayworth"        "Geraldine Page"       "Jean Seberg"         
## [283] "Leslie Caron"         "Polly Bergen"         "Geraldine Page"      
## [286] "Rachel Roberts"       "Romy Schneider"       "Alida Valli"         
## [289] "Marina Vlady"         "Natalie Wood"         "Geraldine Page"      
## [292] "Susannah York"        "Anne Bancroft"        "Bette Davis"         
## [295] "Katharine Hepburn"    "Glynis Johns"         "Melina Mercouri"     
## [298] "Lee Remick"           "Susan Strasberg"      "Shelley Winters"     
## [301] "Geraldine Page"       "Claudia McNeil"       "Natalie Wood"        
## [304] "Leslie Caron"         "Shirley MacLaine"     "Greer Garson"        
## [307] "Doris Day"            "Nancy Kwan"           "Jean Simmons"        
## [310] "Elizabeth Taylor"     "Elizabeth Taylor"     "Lee Remick"          
## [313] "Simone Signoret"      "Katharine Hepburn"    "Audrey Hepburn"      
## [316] "Susan Hayward"        "Ingrid Bergman"       "Deborah Kerr"        
## [319] "Shirley MacLaine"     "Jean Simmons"         "Joanne Woodward"     
## [322] "Marlene Dietrich"     "Deborah Kerr"         "Anna Magnani"        
## [325] "Eva Marie Saint"      "Audrey Hepburn"       "Katharine Hepburn"   
## [328] "Carroll Baker"        "Ingrid Bergman"       "Helen Hayes"         
## [331] "Anna Magnani"         "Grace Kelly"          "Audrey Hepburn"      
## [334] "Shirley Booth"        "Joan Crawford"        "Olivia de Havilland" 
## [337] "Jane Wyman"           "Vivien Leigh"         "Shelley Winters"     
## [340] "Gloria Swanson"       "Bette Davis"          "Olivia de Havilland" 
## [343] "Deborah Kerr"         "Jane Wyman"           "Rosalind Russell"    
## [346] "Rosalind Russell"     "Ingrid Bergman"       "Ingrid Bergman"      
## [349] "Jennifer Jones"

tidy_scrap()

Now, imagine that we need a data frame composed of two columns :

  • Actress: The names of Golden Globe Best Actress Nominees,
  • Movie: The movie title for which they were nominated.

To construct our data frame we’ll use the tidy_scrap() function as follows:

links <- paste(link, 0:20, sep = "") # The links required to extract the 350 observations

nodes <- c(".primary-nominee a", ".secondary-nominee")

column_names <- c("Actress", "Movie")


global_df <- tidy_scrap(links, nodes, column_names)
## Warning in (function (..., deparse.level = 1) : number of rows of result is not
## a multiple of vector length (arg 2)
## Warning: The `x` argument of `as_tibble.matrix()` must have column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
head(global_df, n = 10)
## # A tibble: 10 x 2
##    Actress            Movie                   
##    <chr>              <chr>                   
##  1 Renée Zellweger    Judy                    
##  2 Scarlett Johansson Marriage Story          
##  3 Saoirse Ronan      Little Women            
##  4 Charlize Theron    Bombshell               
##  5 Cynthia Erivo      Harriet                 
##  6 Glenn Close        Wife, The               
##  7 Lady Gaga          Star Is Born, A (2018)  
##  8 Nicole Kidman      Destroyer               
##  9 Melissa McCarthy   Can You Ever Forgive Me?
## 10 Rosamund Pike      Private War, A
Avatar
Mohamed El Fodil Ihaddaden
Ph.D candidate in Economics.

My research interests include Performance Management, Efficiency Analysis and Experimental Economics.