Cengiz Zopluoglu: Compiling Keywords from the Published Articles in Educational and Pscyhological Measurement

Cengiz Zopluoglu

What are the most commonly used keywords in published articles in educational measurement journals? Is there any trending topic in educational measurement journals? I decided to do a sample analysis in the context of Educational and Psychological Measurement (EPM). Initially, I was planning to do a comparison across journals including Journal of Educational Measurement (JEM), Applied Measurement in Education (AME), and Educational Measurement: Issues and Practices (EM:IP). However, I realized JEM and AME do not require keywords for the papers they publish. So, I focus on only EPM for now, and this can be replicated with a few tweaks in the code for EM:IP. This post will include some coding about how to automate scraping the desired information from the EPM webpage and I plan to write a follow-up post later with some analysis of the dataset compiled in this post.


require(rvest)
require(xml2)
require(knitr)
require(kableExtra)

Scraping Keywords from the EPM Website

First, let’s look at what information one needs to scrap from the EPM website and how this information is stored at the EPM website. Table of Contents for a certain volume and issue can be accessed using a link https://journals.sagepub.com/toc/epma/i/j, where i is the volume number and j is the issue number. For instance, if one wants to access Volume 79 Issue 2, the link is https://journals.sagepub.com/toc/epma/79/2. On this webpage, there is a list of titles for the papers published in this issue, and there is a link for Abstract under each title. At the link for Abstract, one can access the information about the title, authors, publication date, abstract, references, and keywords.

A sample webpage for the table of contents from an EPM issue

At the link for Abstract, one can access the information about the title, authors, publication date, abstract, references, and keywords for each paper.

A sample webpage for the abstract from a EPM paper

Based on my exploration, EPM has been publishing keywords starting from Volume 63 and Issue 1 published in February 2003. That makes 98 issues published with keywords. If there is on average 10 articles published per issue, there may be about 1000 papers. One alternative is to extract this information manually one by one by visiting each paper published, which will probably take days and weeks. Therefore, it is wise to try to automate this process.

Step 1: Retrieving keywords for an article from an abstract link

For the example above, the link for the Abstract is https://journals.sagepub.com/doi/abs/10.1177/0013164418773494. Let’s read this webpage into R.


link = "https://journals.sagepub.com/doi/abs/10.1177/0013164418773494"
article <- read_html(link)

article


{html_document}
<html lang="en" class="pb-page" data-request-id="76c746d0-aee4-4faa-8964-1b5e3a49f30e">
[1] <head data-pb-dropzone="head">\n<meta http-equiv="Content-Type" ...
[2] <body class="pb-ui">\n<div class="totoplink">\n<a id="skiptocon ...

The object article simply contains all html source code for this specific webpage. If you want to see it, you can check from this link

view-source:https://journals.sagepub.com/doi/abs/10.1177/0013164418773494

The information we are looking is somewhere in this source text, and the trick is how to pull that information. At this point, you would need some help from SelectorGadget app which can be installed as a Chrome extension. There is a good number of video tutorials on the web about how to use it, so I will skip that. When you open the SelectorGadget on the link for the paper and click the box where Keywords are listed, it will show you the CSS selector for that element. As can be seen in the screen shot below, we learn that the CSS selector for Keywords box in this page is “.hlFld-KeywordText”.

Finding the CSS selector for the Keyword box in the Abstract page

Now, we can pull that specific piece from the whole html page we read to R before..


keywords <- html_text(html_nodes(article,".hlFld-KeywordText"))

keywords


[1] "Keywords multilevel modeling, fixed effects modeling, random coefficients, small samples"

It looks like we got what we want, but it needs some polishing. First, we need to get rid of the word “Keywords” at the beginning, transform each keyword separated by comma to different strings, and then delete the white spaces.


keywords  <- substring(keywords,10) 

keywords


[1] "multilevel modeling, fixed effects modeling, random coefficients, small samples"


keywords <- strsplit(keywords,",")[[1]]

keywords


[1] "multilevel modeling"     " fixed effects modeling"
[3] " random coefficients"    " small samples"


keywords <- trimws(keywords)

keywords


[1] "multilevel modeling"    "fixed effects modeling"
[3] "random coefficients"    "small samples"

Nice! So, given a link for a paper abstract, we just retrieved a vector of keywords for that specific paper. Below is a simple function to organize the code above. It takes an abstract link as input and returns the keywords on that link as an output.


key.words <- function(link){
  
  article  <- read_html(link)
  keywords <- html_text(html_nodes(article,".hlFld-KeywordText"))
  keywords <- substring(keywords,10)
  if(length(keywords)!=0){
    out      <- trimws(strsplit(keywords,",")[[1]])
  } else {
    out <- NULL
    }
  return(out)
  
}


link = "https://journals.sagepub.com/doi/abs/10.1177/0013164418773494"

key.words(link)


[1] "multilevel modeling"    "fixed effects modeling"
[3] "random coefficients"    "small samples"

Step 2: Retrieving abstract links fo papers published in an issue

The next important task is to extract the information about the abstract links for all papers published for an issue. As mentioned before, the table of contents for EPM issues follows the following format https://journals.sagepub.com/toc/epma/i/j. Suppose, we look at the latest issue https://journals.sagepub.com/toc/epma/79/2. If you select Abstract for any paper on this page using the SelectorGadget, it will show the CSS selector as “.abstract-link”.

Let’s look at what we can retrieve from this page using the “.abstract-link” node.


link = "https://journals.sagepub.com/toc/epma/79/2"

issue <- read_html(link)


abstract.links <- html_attr(html_nodes(issue,".abstract-link"),"href")

abstract.links


[1] "/doi/abs/10.1177/0013164418773494"
[2] "/doi/abs/10.1177/0013164418773851"
[3] "/doi/abs/10.1177/0013164418777569"
[4] "/doi/abs/10.1177/0013164418777854"
[5] "/doi/abs/10.1177/0013164418783530"
[6] "/doi/abs/10.1177/0013164418790634"
[7] "/doi/abs/10.1177/0013164418791673"
[8] "/doi/abs/10.1177/0013164418777784"
[9] "/doi/abs/10.1177/0013164417733305"

This looks good! It shows there are 9 papers published in Volume 79 Issue 2, and these are the part of the links to access their abstract pages.

All we need to do is to put “https://journals.sagepub.com/” in front of these texts.


abstract.links <- paste0("https://journals.sagepub.com",abstract.links)

abstract.links


[1] "https://journals.sagepub.com/doi/abs/10.1177/0013164418773494"
[2] "https://journals.sagepub.com/doi/abs/10.1177/0013164418773851"
[3] "https://journals.sagepub.com/doi/abs/10.1177/0013164418777569"
[4] "https://journals.sagepub.com/doi/abs/10.1177/0013164418777854"
[5] "https://journals.sagepub.com/doi/abs/10.1177/0013164418783530"
[6] "https://journals.sagepub.com/doi/abs/10.1177/0013164418790634"
[7] "https://journals.sagepub.com/doi/abs/10.1177/0013164418791673"
[8] "https://journals.sagepub.com/doi/abs/10.1177/0013164418777784"
[9] "https://journals.sagepub.com/doi/abs/10.1177/0013164417733305"

Again, let’s organize this code in a simple function. This function takes the volume number and issue number as numeric inputs, and then returns the list of abstract page links as a character vector.


doi <- function(volume,issue){
  
  link            <- paste0("https://journals.sagepub.com/toc/epma/",volume,"/",issue)
  issue           <- read_html(link)
  abstract.links  <- html_attr(html_nodes(issue,".abstract-link"),"href")
  out             <- paste0("https://journals.sagepub.com",abstract.links)
  
  return(out)
  
}


doi(volume=70,issue=3) # Returns the links for abstract pages from papers published in Volume 70 Issue 3


 [1] "https://journals.sagepub.com/doi/abs/10.1177/0013164409355694"
 [2] "https://journals.sagepub.com/doi/abs/10.1177/0013164409355690"
 [3] "https://journals.sagepub.com/doi/abs/10.1177/0013164409355692"
 [4] "https://journals.sagepub.com/doi/abs/10.1177/0013164409355693"
 [5] "https://journals.sagepub.com/doi/abs/10.1177/0013164409355696"
 [6] "https://journals.sagepub.com/doi/abs/10.1177/0013164409344510"
 [7] "https://journals.sagepub.com/doi/abs/10.1177/0013164409344517"
 [8] "https://journals.sagepub.com/doi/abs/10.1177/0013164409344520"
 [9] "https://journals.sagepub.com/doi/abs/10.1177/0013164409355685"
[10] "https://journals.sagepub.com/doi/abs/10.1177/0013164409355687"
[11] "https://journals.sagepub.com/doi/abs/10.1177/0013164410367480"

Step 3: Retrieving all keywords published in an issue

After creating two functions in Step 1 and Step 2, this will be easier. Suppose we are looking at Volume 70 Issue 3. First, extract the links using the function in Step 2. Then, run a for loop to extract the keywords from each link using the function in Step 1. We will also organize them in a nice looking data frame.


i = 70  # volume number
j = 3   # issue number

doi.links <- doi(volume=i,issue=j) 

Keywords <- data.frame(article=NULL,keywords=NULL)

for(r in 1:length(doi.links)){
  
  link     <- doi.links[r]
  keywords <- key.words(link)
  if(is.null(keywords)==FALSE){
    sub.data <- data.frame(article=r,keywords=keywords)
    Keywords <- rbind(Keywords,sub.data)
  }
}

Keywords$volume <- i
Keywords$issue  <- j


kable(Keywords,format="html",row.names=FALSE,align="cccc")

article	keywords	volume	issue
1	objective level scores	70	3
1	subscore augmentation	70	3
1	reliability	70	3
2	coefficient alpha	70	3
2	reliability	70	3
2	confidence intervals	70	3
2	simulation studies	70	3
2	Fisher’s Z transformation	70	3
3	pilot study	70	3
3	sample size	70	3
3	instrument development	70	3
4	crossed random effects	70	3
4	teacher effects	70	3
4	multilevel	70	3
4	growth model	70	3
4	piecewise	70	3
4	summer learning	70	3
4	value added	70	3
4	Bayesian	70	3
5	Graded Response Model	70	3
5	Multilevel Item Response Theory	70	3
5	health-related quality of life	70	3
5	Explanatory Item Response Model	70	3
5	nonlinear mixed model	70	3
5	PedsQL	70	3
6	university attachment	70	3
6	group attachment	70	3
6	belonging	70	3
6	construct validity	70	3
7	homework purpose	70	3
7	scale development	70	3
7	factor analysis	70	3
7	high school students	70	3
8	confirmatory factor analysis	70	3
8	ordinal data	70	3
8	structural equation modeling	70	3
8	multilevel modeling	70	3
8	APCCS II-HST	70	3
9	test score validity	70	3
9	factor analysis	70	3
9	cross-validation	70	3
9	multisample confirmatory factor analysis	70	3
10	work/family conflict	70	3
10	coworker support	70	3
10	organizational citizenship	70	3
10	coworker backup	70	3

Let’s also organize this in a single function. This function takes the volume number and issue number as numeric inputs, and then returns the keywords published in that issue.


extract.Keywords <- function(i,j){
  
doi.links <- doi(volume=i,issue=j) 

Keywords <- data.frame(article=NULL,keywords=NULL)

for(r in 1:length(doi.links)){
  
  link     <- doi.links[r]
  keywords <- key.words(link)
  if(is.null(keywords)==FALSE){
    sub.data <- data.frame(article=r,keywords=keywords)
    Keywords <- rbind(Keywords,sub.data)
  }
}

  Keywords$volume <- i
  Keywords$issue  <- j

  return(Keywords)
}

extract.Keywords(i=70,j=3)


   article                                 keywords volume issue
1        1                   objective level scores     70     3
2        1                    subscore augmentation     70     3
3        1                              reliability     70     3
4        2                        coefficient alpha     70     3
5        2                              reliability     70     3
6        2                     confidence intervals     70     3
7        2                       simulation studies     70     3
8        2                Fisher’s Z transformation     70     3
9        3                              pilot study     70     3
10       3                              sample size     70     3
11       3                   instrument development     70     3
12       4                   crossed random effects     70     3
13       4                          teacher effects     70     3
14       4                               multilevel     70     3
15       4                             growth model     70     3
16       4                                piecewise     70     3
17       4                          summer learning     70     3
18       4                              value added     70     3
19       4                                 Bayesian     70     3
20       5                    Graded Response Model     70     3
21       5          Multilevel Item Response Theory     70     3
22       5           health-related quality of life     70     3
23       5          Explanatory Item Response Model     70     3
24       5                    nonlinear mixed model     70     3
25       5                                   PedsQL     70     3
26       6                    university attachment     70     3
27       6                         group attachment     70     3
28       6                                belonging     70     3
29       6                       construct validity     70     3
30       7                         homework purpose     70     3
31       7                        scale development     70     3
32       7                          factor analysis     70     3
33       7                     high school students     70     3
34       8             confirmatory factor analysis     70     3
35       8                             ordinal data     70     3
36       8             structural equation modeling     70     3
37       8                      multilevel modeling     70     3
38       8                             APCCS II-HST     70     3
39       9                      test score validity     70     3
40       9                          factor analysis     70     3
41       9                         cross-validation     70     3
42       9 multisample confirmatory factor analysis     70     3
43      10                     work/family conflict     70     3
44      10                         coworker support     70     3
45      10               organizational citizenship     70     3
46      10                          coworker backup     70     3

Step 4: Retrieving all keywords published since 2003

As mentioned at the beginning, EPM has been publishing keywords starting from Volume 63 and Issue 1 published in February 2003. Also, each volume has six issues since then.

Now, let’s create a data frame for the volume and issue information since 2003.


EPM.issues <- expand.grid(volume=63:79,issue=1:6)

# Add year as a column

EPM.issues$year <- 2003:2019
EPM.issues <- EPM.issues[order(EPM.issues$volume),]

# The last four rows are not published yet.


EPM.issues <- EPM.issues[1:98,]

EPM.issues


    volume issue year
1       63     1 2003
18      63     2 2003
35      63     3 2003
52      63     4 2003
69      63     5 2003
86      63     6 2003
2       64     1 2004
19      64     2 2004
36      64     3 2004
53      64     4 2004
70      64     5 2004
87      64     6 2004
3       65     1 2005
20      65     2 2005
37      65     3 2005
54      65     4 2005
71      65     5 2005
88      65     6 2005
4       66     1 2006
21      66     2 2006
38      66     3 2006
55      66     4 2006
72      66     5 2006
89      66     6 2006
5       67     1 2007
22      67     2 2007
39      67     3 2007
56      67     4 2007
73      67     5 2007
90      67     6 2007
6       68     1 2008
23      68     2 2008
40      68     3 2008
57      68     4 2008
74      68     5 2008
91      68     6 2008
7       69     1 2009
24      69     2 2009
41      69     3 2009
58      69     4 2009
75      69     5 2009
92      69     6 2009
8       70     1 2010
25      70     2 2010
42      70     3 2010
59      70     4 2010
76      70     5 2010
93      70     6 2010
9       71     1 2011
26      71     2 2011
43      71     3 2011
60      71     4 2011
77      71     5 2011
94      71     6 2011
10      72     1 2012
27      72     2 2012
44      72     3 2012
61      72     4 2012
78      72     5 2012
95      72     6 2012
11      73     1 2013
28      73     2 2013
45      73     3 2013
62      73     4 2013
79      73     5 2013
96      73     6 2013
12      74     1 2014
29      74     2 2014
46      74     3 2014
63      74     4 2014
80      74     5 2014
97      74     6 2014
13      75     1 2015
30      75     2 2015
47      75     3 2015
64      75     4 2015
81      75     5 2015
98      75     6 2015
14      76     1 2016
31      76     2 2016
48      76     3 2016
65      76     4 2016
82      76     5 2016
99      76     6 2016
15      77     1 2017
32      77     2 2017
49      77     3 2017
66      77     4 2017
83      77     5 2017
100     77     6 2017
16      78     1 2018
33      78     2 2018
50      78     3 2018
67      78     4 2018
84      78     5 2018
101     78     6 2018
17      79     1 2019
34      79     2 2019

Now, I will use a for loop to run extract.Keywords() function from Step 3 for every issue of EPM, and compile them in one data frame. Note that this took about an hour to run, so be patient if you are replicating this.


all.Keywords <- c()

for(rep in 1:nrow(EPM.issues)){

  key      <- extract.Keywords(i=EPM.issues[rep,]$volume,
                          j=EPM.issues[rep,]$issue)

  key$year <- EPM.issues[rep,]$year
  
  all.Keywords <- rbind(all.Keywords,key)
}

The dataset obtained from the code above can be downloaded from this link. This includes a total of 4,061 keywords from 901 published articles published in 98 issues since February 2003. Now, we can dive into this dataset and see what it reveals!

Compiling Keywords from the Published Articles in Educational and Pscyhological Measurement

Scraping Keywords from the EPM Website

Step 1: Retrieving keywords for an article from an abstract link

Step 2: Retrieving abstract links fo papers published in an issue

Step 3: Retrieving all keywords published in an issue

Step 4: Retrieving all keywords published since 2003

Citation