What are the most commonly used keywords in published articles in educational measurement journals? Is there any trending topic in educational measurement journals? I decided to do a sample analysis in the context of Educational and Psychological Measurement (EPM). However, I had figure out first how to compile a dataset of keywords used in EPM. It turned out to be another web scraping story.
What are the most commonly used keywords in published articles in educational measurement journals? Is there any trending topic in educational measurement journals? I decided to do a sample analysis in the context of Educational and Psychological Measurement (EPM). Initially, I was planning to do a comparison across journals including Journal of Educational Measurement (JEM), Applied Measurement in Education (AME), and Educational Measurement: Issues and Practices (EM:IP). However, I realized JEM and AME do not require keywords for the papers they publish. So, I focus on only EPM for now, and this can be replicated with a few tweaks in the code for EM:IP. This post will include some coding about how to automate scraping the desired information from the EPM webpage and I plan to write a follow-up post later with some analysis of the dataset compiled in this post.
require(rvest)
require(xml2)
require(knitr)
require(kableExtra)
First, let’s look at what information one needs to scrap from the EPM website and how this information is stored at the EPM website. Table of Contents for a certain volume and issue can be accessed using a link https://journals.sagepub.com/toc/epma/i/j, where i is the volume number and j is the issue number. For instance, if one wants to access Volume 79 Issue 2, the link is https://journals.sagepub.com/toc/epma/79/2. On this webpage, there is a list of titles for the papers published in this issue, and there is a link for Abstract under each title. At the link for Abstract, one can access the information about the title, authors, publication date, abstract, references, and keywords.
At the link for Abstract, one can access the information about the title, authors, publication date, abstract, references, and keywords for each paper.
Based on my exploration, EPM has been publishing keywords starting from Volume 63 and Issue 1 published in February 2003. That makes 98 issues published with keywords. If there is on average 10 articles published per issue, there may be about 1000 papers. One alternative is to extract this information manually one by one by visiting each paper published, which will probably take days and weeks. Therefore, it is wise to try to automate this process.
For the example above, the link for the Abstract is https://journals.sagepub.com/doi/abs/10.1177/0013164418773494. Let’s read this webpage into R.
link = "https://journals.sagepub.com/doi/abs/10.1177/0013164418773494"
article <- read_html(link)
article
{html_document}
<html lang="en" class="pb-page" data-request-id="76c746d0-aee4-4faa-8964-1b5e3a49f30e">
[1] <head data-pb-dropzone="head">\n<meta http-equiv="Content-Type" ...
[2] <body class="pb-ui">\n<div class="totoplink">\n<a id="skiptocon ...
The object article simply contains all html source code for this specific webpage. If you want to see it, you can check from this link
view-source:https://journals.sagepub.com/doi/abs/10.1177/0013164418773494
The information we are looking is somewhere in this source text, and the trick is how to pull that information. At this point, you would need some help from SelectorGadget app which can be installed as a Chrome extension. There is a good number of video tutorials on the web about how to use it, so I will skip that. When you open the SelectorGadget on the link for the paper and click the box where Keywords are listed, it will show you the CSS selector for that element. As can be seen in the screen shot below, we learn that the CSS selector for Keywords box in this page is “.hlFld-KeywordText”.
Now, we can pull that specific piece from the whole html page we read to R before..
keywords <- html_text(html_nodes(article,".hlFld-KeywordText"))
keywords
[1] "Keywords multilevel modeling, fixed effects modeling, random coefficients, small samples"
It looks like we got what we want, but it needs some polishing. First, we need to get rid of the word “Keywords” at the beginning, transform each keyword separated by comma to different strings, and then delete the white spaces.
keywords <- substring(keywords,10)
keywords
[1] "multilevel modeling, fixed effects modeling, random coefficients, small samples"
keywords <- strsplit(keywords,",")[[1]]
keywords
[1] "multilevel modeling" " fixed effects modeling"
[3] " random coefficients" " small samples"
keywords <- trimws(keywords)
keywords
[1] "multilevel modeling" "fixed effects modeling"
[3] "random coefficients" "small samples"
Nice! So, given a link for a paper abstract, we just retrieved a vector of keywords for that specific paper. Below is a simple function to organize the code above. It takes an abstract link as input and returns the keywords on that link as an output.
key.words <- function(link){
article <- read_html(link)
keywords <- html_text(html_nodes(article,".hlFld-KeywordText"))
keywords <- substring(keywords,10)
if(length(keywords)!=0){
out <- trimws(strsplit(keywords,",")[[1]])
} else {
out <- NULL
}
return(out)
}
link = "https://journals.sagepub.com/doi/abs/10.1177/0013164418773494"
key.words(link)
[1] "multilevel modeling" "fixed effects modeling"
[3] "random coefficients" "small samples"
The next important task is to extract the information about the abstract links for all papers published for an issue. As mentioned before, the table of contents for EPM issues follows the following format https://journals.sagepub.com/toc/epma/i/j. Suppose, we look at the latest issue https://journals.sagepub.com/toc/epma/79/2. If you select Abstract for any paper on this page using the SelectorGadget, it will show the CSS selector as “.abstract-link”.
Let’s look at what we can retrieve from this page using the “.abstract-link” node.
link = "https://journals.sagepub.com/toc/epma/79/2"
issue <- read_html(link)
abstract.links <- html_attr(html_nodes(issue,".abstract-link"),"href")
abstract.links
[1] "/doi/abs/10.1177/0013164418773494"
[2] "/doi/abs/10.1177/0013164418773851"
[3] "/doi/abs/10.1177/0013164418777569"
[4] "/doi/abs/10.1177/0013164418777854"
[5] "/doi/abs/10.1177/0013164418783530"
[6] "/doi/abs/10.1177/0013164418790634"
[7] "/doi/abs/10.1177/0013164418791673"
[8] "/doi/abs/10.1177/0013164418777784"
[9] "/doi/abs/10.1177/0013164417733305"
This looks good! It shows there are 9 papers published in Volume 79 Issue 2, and these are the part of the links to access their abstract pages.
All we need to do is to put “https://journals.sagepub.com/” in front of these texts.
abstract.links <- paste0("https://journals.sagepub.com",abstract.links)
abstract.links
[1] "https://journals.sagepub.com/doi/abs/10.1177/0013164418773494"
[2] "https://journals.sagepub.com/doi/abs/10.1177/0013164418773851"
[3] "https://journals.sagepub.com/doi/abs/10.1177/0013164418777569"
[4] "https://journals.sagepub.com/doi/abs/10.1177/0013164418777854"
[5] "https://journals.sagepub.com/doi/abs/10.1177/0013164418783530"
[6] "https://journals.sagepub.com/doi/abs/10.1177/0013164418790634"
[7] "https://journals.sagepub.com/doi/abs/10.1177/0013164418791673"
[8] "https://journals.sagepub.com/doi/abs/10.1177/0013164418777784"
[9] "https://journals.sagepub.com/doi/abs/10.1177/0013164417733305"
Again, let’s organize this code in a simple function. This function takes the volume number and issue number as numeric inputs, and then returns the list of abstract page links as a character vector.
doi <- function(volume,issue){
link <- paste0("https://journals.sagepub.com/toc/epma/",volume,"/",issue)
issue <- read_html(link)
abstract.links <- html_attr(html_nodes(issue,".abstract-link"),"href")
out <- paste0("https://journals.sagepub.com",abstract.links)
return(out)
}
doi(volume=70,issue=3) # Returns the links for abstract pages from papers published in Volume 70 Issue 3
[1] "https://journals.sagepub.com/doi/abs/10.1177/0013164409355694"
[2] "https://journals.sagepub.com/doi/abs/10.1177/0013164409355690"
[3] "https://journals.sagepub.com/doi/abs/10.1177/0013164409355692"
[4] "https://journals.sagepub.com/doi/abs/10.1177/0013164409355693"
[5] "https://journals.sagepub.com/doi/abs/10.1177/0013164409355696"
[6] "https://journals.sagepub.com/doi/abs/10.1177/0013164409344510"
[7] "https://journals.sagepub.com/doi/abs/10.1177/0013164409344517"
[8] "https://journals.sagepub.com/doi/abs/10.1177/0013164409344520"
[9] "https://journals.sagepub.com/doi/abs/10.1177/0013164409355685"
[10] "https://journals.sagepub.com/doi/abs/10.1177/0013164409355687"
[11] "https://journals.sagepub.com/doi/abs/10.1177/0013164410367480"
After creating two functions in Step 1 and Step 2, this will be easier. Suppose we are looking at Volume 70 Issue 3. First, extract the links using the function in Step 2. Then, run a for
loop to extract the keywords from each link using the function in Step 1. We will also organize them in a nice looking data frame.
i = 70 # volume number
j = 3 # issue number
doi.links <- doi(volume=i,issue=j)
Keywords <- data.frame(article=NULL,keywords=NULL)
for(r in 1:length(doi.links)){
link <- doi.links[r]
keywords <- key.words(link)
if(is.null(keywords)==FALSE){
sub.data <- data.frame(article=r,keywords=keywords)
Keywords <- rbind(Keywords,sub.data)
}
}
Keywords$volume <- i
Keywords$issue <- j
kable(Keywords,format="html",row.names=FALSE,align="cccc")
article | keywords | volume | issue |
---|---|---|---|
1 | objective level scores | 70 | 3 |
1 | subscore augmentation | 70 | 3 |
1 | reliability | 70 | 3 |
2 | coefficient alpha | 70 | 3 |
2 | reliability | 70 | 3 |
2 | confidence intervals | 70 | 3 |
2 | simulation studies | 70 | 3 |
2 | Fisher’s Z transformation | 70 | 3 |
3 | pilot study | 70 | 3 |
3 | sample size | 70 | 3 |
3 | instrument development | 70 | 3 |
4 | crossed random effects | 70 | 3 |
4 | teacher effects | 70 | 3 |
4 | multilevel | 70 | 3 |
4 | growth model | 70 | 3 |
4 | piecewise | 70 | 3 |
4 | summer learning | 70 | 3 |
4 | value added | 70 | 3 |
4 | Bayesian | 70 | 3 |
5 | Graded Response Model | 70 | 3 |
5 | Multilevel Item Response Theory | 70 | 3 |
5 | health-related quality of life | 70 | 3 |
5 | Explanatory Item Response Model | 70 | 3 |
5 | nonlinear mixed model | 70 | 3 |
5 | PedsQL | 70 | 3 |
6 | university attachment | 70 | 3 |
6 | group attachment | 70 | 3 |
6 | belonging | 70 | 3 |
6 | construct validity | 70 | 3 |
7 | homework purpose | 70 | 3 |
7 | scale development | 70 | 3 |
7 | factor analysis | 70 | 3 |
7 | high school students | 70 | 3 |
8 | confirmatory factor analysis | 70 | 3 |
8 | ordinal data | 70 | 3 |
8 | structural equation modeling | 70 | 3 |
8 | multilevel modeling | 70 | 3 |
8 | APCCS II-HST | 70 | 3 |
9 | test score validity | 70 | 3 |
9 | factor analysis | 70 | 3 |
9 | cross-validation | 70 | 3 |
9 | multisample confirmatory factor analysis | 70 | 3 |
10 | work/family conflict | 70 | 3 |
10 | coworker support | 70 | 3 |
10 | organizational citizenship | 70 | 3 |
10 | coworker backup | 70 | 3 |
Let’s also organize this in a single function. This function takes the volume number and issue number as numeric inputs, and then returns the keywords published in that issue.
extract.Keywords <- function(i,j){
doi.links <- doi(volume=i,issue=j)
Keywords <- data.frame(article=NULL,keywords=NULL)
for(r in 1:length(doi.links)){
link <- doi.links[r]
keywords <- key.words(link)
if(is.null(keywords)==FALSE){
sub.data <- data.frame(article=r,keywords=keywords)
Keywords <- rbind(Keywords,sub.data)
}
}
Keywords$volume <- i
Keywords$issue <- j
return(Keywords)
}
extract.Keywords(i=70,j=3)
article keywords volume issue
1 1 objective level scores 70 3
2 1 subscore augmentation 70 3
3 1 reliability 70 3
4 2 coefficient alpha 70 3
5 2 reliability 70 3
6 2 confidence intervals 70 3
7 2 simulation studies 70 3
8 2 Fisher’s Z transformation 70 3
9 3 pilot study 70 3
10 3 sample size 70 3
11 3 instrument development 70 3
12 4 crossed random effects 70 3
13 4 teacher effects 70 3
14 4 multilevel 70 3
15 4 growth model 70 3
16 4 piecewise 70 3
17 4 summer learning 70 3
18 4 value added 70 3
19 4 Bayesian 70 3
20 5 Graded Response Model 70 3
21 5 Multilevel Item Response Theory 70 3
22 5 health-related quality of life 70 3
23 5 Explanatory Item Response Model 70 3
24 5 nonlinear mixed model 70 3
25 5 PedsQL 70 3
26 6 university attachment 70 3
27 6 group attachment 70 3
28 6 belonging 70 3
29 6 construct validity 70 3
30 7 homework purpose 70 3
31 7 scale development 70 3
32 7 factor analysis 70 3
33 7 high school students 70 3
34 8 confirmatory factor analysis 70 3
35 8 ordinal data 70 3
36 8 structural equation modeling 70 3
37 8 multilevel modeling 70 3
38 8 APCCS II-HST 70 3
39 9 test score validity 70 3
40 9 factor analysis 70 3
41 9 cross-validation 70 3
42 9 multisample confirmatory factor analysis 70 3
43 10 work/family conflict 70 3
44 10 coworker support 70 3
45 10 organizational citizenship 70 3
46 10 coworker backup 70 3
As mentioned at the beginning, EPM has been publishing keywords starting from Volume 63 and Issue 1 published in February 2003. Also, each volume has six issues since then.
Now, let’s create a data frame for the volume and issue information since 2003.
EPM.issues <- expand.grid(volume=63:79,issue=1:6)
# Add year as a column
EPM.issues$year <- 2003:2019
EPM.issues <- EPM.issues[order(EPM.issues$volume),]
# The last four rows are not published yet.
EPM.issues <- EPM.issues[1:98,]
EPM.issues
volume issue year
1 63 1 2003
18 63 2 2003
35 63 3 2003
52 63 4 2003
69 63 5 2003
86 63 6 2003
2 64 1 2004
19 64 2 2004
36 64 3 2004
53 64 4 2004
70 64 5 2004
87 64 6 2004
3 65 1 2005
20 65 2 2005
37 65 3 2005
54 65 4 2005
71 65 5 2005
88 65 6 2005
4 66 1 2006
21 66 2 2006
38 66 3 2006
55 66 4 2006
72 66 5 2006
89 66 6 2006
5 67 1 2007
22 67 2 2007
39 67 3 2007
56 67 4 2007
73 67 5 2007
90 67 6 2007
6 68 1 2008
23 68 2 2008
40 68 3 2008
57 68 4 2008
74 68 5 2008
91 68 6 2008
7 69 1 2009
24 69 2 2009
41 69 3 2009
58 69 4 2009
75 69 5 2009
92 69 6 2009
8 70 1 2010
25 70 2 2010
42 70 3 2010
59 70 4 2010
76 70 5 2010
93 70 6 2010
9 71 1 2011
26 71 2 2011
43 71 3 2011
60 71 4 2011
77 71 5 2011
94 71 6 2011
10 72 1 2012
27 72 2 2012
44 72 3 2012
61 72 4 2012
78 72 5 2012
95 72 6 2012
11 73 1 2013
28 73 2 2013
45 73 3 2013
62 73 4 2013
79 73 5 2013
96 73 6 2013
12 74 1 2014
29 74 2 2014
46 74 3 2014
63 74 4 2014
80 74 5 2014
97 74 6 2014
13 75 1 2015
30 75 2 2015
47 75 3 2015
64 75 4 2015
81 75 5 2015
98 75 6 2015
14 76 1 2016
31 76 2 2016
48 76 3 2016
65 76 4 2016
82 76 5 2016
99 76 6 2016
15 77 1 2017
32 77 2 2017
49 77 3 2017
66 77 4 2017
83 77 5 2017
100 77 6 2017
16 78 1 2018
33 78 2 2018
50 78 3 2018
67 78 4 2018
84 78 5 2018
101 78 6 2018
17 79 1 2019
34 79 2 2019
Now, I will use a for
loop to run extract.Keywords()
function from Step 3 for every issue of EPM, and compile them in one data frame. Note that this took about an hour to run, so be patient if you are replicating this.
all.Keywords <- c()
for(rep in 1:nrow(EPM.issues)){
key <- extract.Keywords(i=EPM.issues[rep,]$volume,
j=EPM.issues[rep,]$issue)
key$year <- EPM.issues[rep,]$year
all.Keywords <- rbind(all.Keywords,key)
}
The dataset obtained from the code above can be downloaded from this link. This includes a total of 4,061 keywords from 901 published articles published in 98 issues since February 2003. Now, we can dive into this dataset and see what it reveals!
For attribution, please cite this work as
Zopluoglu (2019, April 1). Cengiz Zopluoglu: Compiling Keywords from the Published Articles in Educational and Pscyhological Measurement. Retrieved from https://github.com/czopluoglu/website/tree/master/docs/posts/epm-keywords/
BibTeX citation
@misc{zopluoglu2019compiling, author = {Zopluoglu, Cengiz}, title = {Cengiz Zopluoglu: Compiling Keywords from the Published Articles in Educational and Pscyhological Measurement}, url = {https://github.com/czopluoglu/website/tree/master/docs/posts/epm-keywords/}, year = {2019} }