wordcloud: Fish movement / dispersal

I recently discovered an interesting R script that automatically extracts the results of a Google Scholar Search (for searches with up to max. 1000 result hits) written by Kay Cichini. Furthermore the extraction result can then further be used e.g. to analyze the occurrence and/or frequency of words in the publications’ titles. Moreover, the results can clearly be presented in a so called wordcloud to illustrate the frequency of words, to show how complex topics are and to outline the most important words connected to a specific scientific search term.

The following wordcloud shows the result of the Google Scholar Search term: “allintitle: fish dispersal OR movement” which results in appr. 874 hits. The wordcloud and the title extraction have been conduced using a slightly modified version of Kays’ R script called “GScholarScraper” that can be found here. So, all credits for this script go to the original author Kay Cichini. The “fish dispersal OR movement” wordcloud has been made publicly available at http://dx.doi.org/10.6084/m9.figshare.718144.

Wordcloud for Google Scholar Search: "allintitle: fish dispersal OR movement".

Wordcloud for Google Scholar Search: “allintitle: fish dispersal OR movement”.

And here the modified Script:

# File-Name: GScholarScraper_3.1.R modified
######## Modified: Johannes Radinger 2013-06-05 ########
# Date: 2012-08-22
# Author: Kay Cichini
# Email: ---@gmail.com
# Purpose: Scrape Google Scholar search result
# Packages used: XML
# Licence: CC BY-SA-NC
# Arguments:
# (1) input:
# A search string as used in Google Scholar search dialog
# (2) write:
# Logical, should a table be writen to user default directory?
# if TRUE ("T") a CSV-file with hyperlinks to the publications will be created.
# Difference to version 3:
# (3) added "since" argument - define year since when publications should be returned..
# defaults to 1900..
# (4) added "citation" argument - logical, if "0" citations are included
# defaults to "1" and no citations will be included..
# added field "YEAR" to output 
# Caveat: if a submitted search string gives more than 1000 hits there seem
# to be some problems (I guess I'm being stopped by Google for roboting the site..)
# And, there is an issue with this error message:
# > Error in htmlParse(URL): 
# > error in creating parser for http://scholar.google.com/scholar?q
# I haven't figured out his one yet.. most likely also a Google blocking mechanism..
# Reconnecting / new IP-address helps..

GScholar_Scraper <- function(input, since = 1900, write = F, citation = 1) {

  # putting together the search-URL:
  URL <- paste("http://scholar.google.com/scholar?q=", input, "&as_sdt=1,5&as_vis=", 
               citation, "&as_ylo=", since, sep = "")
  cat("\nThe URL used is: ", "\n----\n", paste("* ", "http://scholar.google.com/scholar?q=", input, "&as_sdt=1,5&as_vis=", 
                                               citation, "&as_ylo=", since, " *", sep = ""))

  # get content and parse it:
  doc <- htmlParse(URL)

  # number of hits:
  h1 <- xpathSApply(doc, "//div[@id='gs_ab_md']", xmlValue)
  h2 <- strsplit(h1, " ")[[1]][2] 
  num <- as.integer(sub("[[:punct:]]", "", h2))
  cat("\n\nNumber of hits: ", num, "\n----\n", "If this number is far from the returned results\nsomething might have gone wrong..\n\n", sep = "")

  # If there are no results, stop and throw an error message:
  if (num == 0 | is.na(num)) {
    stop("\n\n...There is no result for the submitted search string!")

  pages.max <- ceiling(num/20)

  # 'start' as used in URL:
  start <- 20 * 1:pages.max - 20

  # Collect URLs as list:
  URLs <- paste("http://scholar.google.com/scholar?start=", start, "&q=", input, 
                "&num=100&as_sdt=1,5&as_vis=", citation, "&as_ylo=", since, sep = "")

  scraper_internal <- function(x) {

    doc <- htmlParse(x, encoding="UTF-8")

    # titles:
    tit <- xpathSApply(doc, "//h3[@class='gs_rt']", xmlValue)

    # publication:
    pub <- xpathSApply(doc, "//div[@class='gs_a']", xmlValue)

    # links:
    lin <- xpathSApply(doc, "//h3[@class='gs_rt']/a", xmlAttrs)

    # summaries are truncated, and thus wont be used..  
    # abst <- xpathSApply(doc, '//div[@class='gs_rs']', xmlValue)
    # ..to be extended for individual needs
    dat <- data.frame(TITLES = tit, PUBLICATION = pub, 
                      YEAR = as.integer(gsub(".*\\s(\\d{4})\\s.*", "\\1", pub)))

  result <- do.call("rbind", lapply(URLs, scraper_internal))
  if (write == T) {
    result$LINKS <- paste("=Hyperlink(","\"", result$LINKS, "\"", ")", sep = "")
    write.table(result, "GScholar_Output.CSV", sep = ";", 
                row.names = F, quote = F)
  } else {

# my example
input <- "allintitle:fish dispersal"
df <- GScholar_Scraper(input, citation = 0)

# Histogramm
year_freq <- df$YEAR[df$YEAR<=2013]
breaks <- seq(min(year_freq,na.rm=TRUE),max(year_freq,na.rm=TRUE))
hist(year_freq,breaks=breaks, xlab = "Year", 
     main = "Publications about 'fish dispersal OR movement'")

######## WORDCLOUD ########


corpus <- Corpus(VectorSource(df$TITLES))
corpus <- tm_map(corpus, function(x)removeWords(x, c(stopwords(), "PDF", "B", "DOC", "HTML", "BOOK", "CITATION")))
corpus <- tm_map(corpus, removePunctuation)
tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing = TRUE)
d <- data.frame(word = names(v), freq = v)

# remove numbers from strings:
d <- d[-grep("[0-9]", d$word), ]

# print wordcloud:
pal <- brewer.pal(9,"BuGn")
pal <- pal[-(1:4)]
wordcloud(words=d$word, freq=sqrt(d$freq),scale=c(4.5,0.1), min.freq=sqrt(5),random.order=FALSE,rot.per=.2,colors=pal)

One thought on “wordcloud: Fish movement / dispersal

  1. Pingback: Wordcloud Google Scholar Search in R | River Ecology and Research

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s