diff --git a/.gitignore b/.gitignore index f949949..a01c46c 100644 --- a/.gitignore +++ b/.gitignore @@ -19,4 +19,5 @@ env /.Rhistory /_book/ -/.vscode/ \ No newline at end of file +/.vscode/ +.Rproj.user diff --git a/content/chapter02.qmd b/content/chapter02.qmd index d239376..106bd40 100644 --- a/content/chapter02.qmd +++ b/content/chapter02.qmd @@ -5,50 +5,46 @@ import warnings; warnings.filterwarnings('ignore') ``` -**Abstract.** - This chapter is a lightning tour of some of the cool (and informative) things you can do with R and Python. - Starting from a dataset of tweets about COVID-19, we show how you can analyze this data using - text analysis, network analysis, and using geographic information. - The goal of this chapter is not to teach you all these techniques in detail, - rather, each of the examples showcases a possibility and guides you to the chapter where it will be explained in more detail. - So don't worry too much about understanding every line of code, but relax and enjoy the ride! +**Abstract.** This chapter is a lightning tour of some of the cool (and informative) things you can do with R and Python. Starting from a dataset of tweets about COVID-19, we show how you can analyze this data using text analysis, network analysis, and using geographic information. The goal of this chapter is not to teach you all these techniques in detail, rather, each of the examples showcases a possibility and guides you to the chapter where it will be explained in more detail. So don't worry too much about understanding every line of code, but relax and enjoy the ride! -**Keywords.** basics of programming, data analysis +**Keywords.** basics of programming, data analysis **Objectives:** -- Get an overview of the possibilities of R and Python for data analysis and visualization - - Understand how different aspects of data gathering, cleaning, and analysis work together - - Have fun with data and visualizations! +- Get an overview of the possibilities of R and Python for data analysis and visualization +- Understand how different aspects of data gathering, cleaning, and analysis work together +- Have fun with data and visualizations! -::: {.callout-note icon=false collapse=true} +::: {.callout-note icon="false" collapse="true"} ## Packages used in this chapter -Since this chapter showcases a wide variety of possibilities, - it relies on quite a number of third party packages. - If needed, you can install these packages with the code below - (see Section [-@sec-installing] for more details): +Since this chapter showcases a wide variety of possibilities, it relies on quite a number of third party packages. If needed, you can install these packages with the code below (see Section [-@sec-installing] for more details): -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python chapter02install-python} #| eval: false -!pip3 install pandas matplotlib geopandas +!pip3 install pandas matplotlib geopandas !pip3 install descartes shifterator !pip3 install wordcloud gensim nltk networkx ``` + ## R code + ```{r chapter02install-r} #| eval: false install.packages(c("tidyverse", "igraph","maps", - "quanteda", "quanteda.textplots", + "quanteda", "quanteda.textplots", "quanteda.textstats", "topicmodels")) ``` ::: - After installing, you need to import (activate) the packages every session: -::: {.panel-tabset} +After installing, you need to import (activate) the packages every session: + +::: panel-tabset ## Python code + ```{python chapter02library-python} import re import pandas as pd @@ -63,7 +59,9 @@ from nltk.corpus import stopwords import networkx as nx ``` + ## R code + ```{r chapter02library-r} library(tidyverse) library(lubridate) @@ -79,33 +77,25 @@ library(maps) ## Fun With Tweets {#sec-funtweets} -The goal of this chapter is to showcase how you can use R or Python to quickly and easily -run some impressive analyses of real world data. -For this purpose, we will be using a dataset of tweets about the COVID pandemic that is -engulfing much of the world at the time this book is written. -Of course, tweets are probably only representative for what is said on Twitter, -but the data are (semi-)public and rich, containing text, location, and network characteristics. -This makes them ideal for exploring the many ways in which we can analyze and visualize information -with Python and R. - -Example [-@exm-funtweets] shows how you can read this dataset into memory using a single command. -Note that this does not retrieve the tweets from Twitter itself, but rather downloads -our cached version of the tweets. -In Chapter [-@sec-chap-scraping] we will show how you can download tweets and location data yourself, but to make sure -we can get down to business immediately we will start from this cached version. - -::: {.callout-note appearance="simple" icon=false} +The goal of this chapter is to showcase how you can use R or Python to quickly and easily run some impressive analyses of real world data. For this purpose, we will be using a dataset of tweets about the COVID pandemic that is engulfing much of the world at the time this book is written. Of course, tweets are probably only representative for what is said on Twitter, but the data are (semi-)public and rich, containing text, location, and network characteristics. This makes them ideal for exploring the many ways in which we can analyze and visualize information with Python and R. + +Example [-@exm-funtweets] shows how you can read this dataset into memory using a single command. Note that this does not retrieve the tweets from Twitter itself, but rather downloads our cached version of the tweets. In Chapter [-@sec-chap-scraping] we will show how you can download tweets and location data yourself, but to make sure we can get down to business immediately we will start from this cached version. + +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-funtweets} Retrieving cached tweets about COVID -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python tweets-python} tw = pd.read_csv("https://cssbook.net/d/covid.csv") tw.head() ``` + ## R code + ```{r tweets-r} tw = read_csv("https://cssbook.net/d/covid.csv") head(tw) @@ -114,28 +104,17 @@ head(tw) ::: ::: -As you can see, the dataset contains almost 10000 tweets, listing their -sender, their location and language, the text, the number of retweets, and whether it was a reply. -You can read the start of the three most retweeted messages, which contain one (political) tweet from India -and two seemingly political and factual tweets from the United States. +As you can see, the dataset contains almost 10000 tweets, listing their sender, their location and language, the text, the number of retweets, and whether it was a reply. You can read the start of the three most retweeted messages, which contain one (political) tweet from India and two seemingly political and factual tweets from the United States. -**My first bar plot.** Before diving into the textual, network, and geographic data in the dataset, -let's first make a simple visualization of the date on which the tweets were posted. -Example [-@exm-funtime] does this in two steps: -first, the number of tweets per hour is counted with an aggregation command. -Next, a bar plot is made of this calculated value with some options to make it look relatively clean and professional. -If you want to play around with this, you can for example try to plot the number of tweets per language, -or create a line plot instead of a bar plot. -For more information on visualization, please see Chapter [-@sec-chap-eda]. -See Chapter [-@sec-chap-datawrangling] for an in-depth explanation of the aggregation command. - -::: {.callout-note appearance="simple" icon=false} +**My first bar plot.** Before diving into the textual, network, and geographic data in the dataset, let's first make a simple visualization of the date on which the tweets were posted. Example [-@exm-funtime] does this in two steps: first, the number of tweets per hour is counted with an aggregation command. Next, a bar plot is made of this calculated value with some options to make it look relatively clean and professional. If you want to play around with this, you can for example try to plot the number of tweets per language, or create a line plot instead of a bar plot. For more information on visualization, please see Chapter [-@sec-chap-eda]. See Chapter [-@sec-chap-datawrangling] for an in-depth explanation of the aggregation command. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-funtime} Barplot of tweets over time -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python funtime-python} #| results: hide tw.index = pd.DatetimeIndex(tw["created_at"]) @@ -143,13 +122,15 @@ tw["status_id"].groupby(pd.Grouper(freq="H")).count().plot(kind="bar") # (note the use of \ to split a long line) ``` + ## R code + ```{r funtime-r} -tweets_per_hour = tw %>% - mutate(hour=round_date(created_at, "hour")) %>% - group_by(hour) %>% summarize(n=n()) -ggplot(tweets_per_hour, aes(x=hour, y=n)) + - geom_col() + theme_classic() + +tweets_per_hour = tw %>% + mutate(hour=round_date(created_at, "hour")) %>% + group_by(hour) %>% summarize(n=n()) +ggplot(tweets_per_hour, aes(x=hour, y=n)) + + geom_col() + theme_classic() + xlab("Time") + ylab("# of tweets") + ggtitle("Number of COVID tweets over time") ``` @@ -159,26 +140,15 @@ ggplot(tweets_per_hour, aes(x=hour, y=n)) + ## Fun With Textual Data {#sec-funtext} -**Corpus Analysis.** Next, we can analyze which hashtags are most frequently used in this dataset. -Example [-@exm-funcloud] does this by creating a *document-term matrix* using the package *quanteda* (in R) -or by manually counting the words using a defaultdict (in Python). -The code shows a number of steps that are made to create the final results, each of which represent -researcher choices about which data to keep and which to discard as noise. -In this case, we select English tweets, convert text to lower case, remove stop words, and keep only words that start with \#, -while dropping words starting with `#corona` and `#covid`. -To play around with this example, -see if you can adjust the code to e.g. include all words or only at-mentions instead of the hashtags -and make a different selection of tweets, for example Spanish language tweets or only popular (retweeted) tweets. -Please see Chapter [-@sec-chap-dtm] if you want to learn more about corpus analysis, -and see Chapter [-@sec-chap-datawrangling] for more information on how to select subsets of your data. - -::: {.callout-note appearance="simple" icon=false} +**Corpus Analysis.** Next, we can analyze which hashtags are most frequently used in this dataset. Example [-@exm-funcloud] does this by creating a *document-term matrix* using the package *quanteda* (in R) or by manually counting the words using a defaultdict (in Python). The code shows a number of steps that are made to create the final results, each of which represent researcher choices about which data to keep and which to discard as noise. In this case, we select English tweets, convert text to lower case, remove stop words, and keep only words that start with #, while dropping words starting with `#corona` and `#covid`. To play around with this example, see if you can adjust the code to e.g. include all words or only at-mentions instead of the hashtags and make a different selection of tweets, for example Spanish language tweets or only popular (retweeted) tweets. Please see Chapter [-@sec-chap-dtm] if you want to learn more about corpus analysis, and see Chapter [-@sec-chap-datawrangling] for more information on how to select subsets of your data. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-funcloud} My First Tag Cloud -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python funcloud-python} #| results: hide freq = defaultdict(int) @@ -189,44 +159,32 @@ for tweet in tw["text"]: wc = WordCloud().generate_from_frequencies(freq) plt.imshow(wc, interpolation="bilinear") plt.axis("off") - +plt.show() ``` + ## R code + ```{r funcloud-r} -dtm_tags = filter(tw, lang=="en") %>% - corpus() %>% tokens() %>% - dfm(tolower = T) %>% - dfm_select(pattern = "#*") %>% - dfm_remove(c("#corona*", "#covid*")) +dtm_tags = filter(tw, lang=="en") %>% + corpus() %>% tokens() %>% + dfm(tolower = T) %>% + dfm_select(pattern = "#*") %>% + dfm_remove(c("#corona*", "#covid*")) textplot_wordcloud(dtm_tags, max_words=100) ``` ::: ::: ::: -**Topic Model.** -Where a word cloud (or tag cloud) shows which words occur most frequently, -a `topic model` analysis shows which words co-occur in the same documents. -Using the most common topic modeling algorithm, Latent Dirichlet Allocation or LDA, -Example [-@exm-funlda] explores the tweets by automatically clustering the tags selected earlier into 10 *topics*. -Topic modeling is non-deterministic -- if you run it again you can get slightly different topics, -and topics are swapped around randomly as the topic numbers have no special meaning. -By setting the computer's *random seed* you can ensure that if you run it again you get the same results. -As you can see, some topics seem easily interpretable (such as topic 2 about social distancing, -and topic 8 on health care), it is always recommended that you inspect the clustered documents -and edge cases in addition to the top words (or tags) as shown here. -You can play around with this example by using a different selection of words -(modifying the code in Example [-@exm-funcloud]) or changing the number of topics. -You can also change (or remove) the random seed and see how running the same model multiple times will give different results. -See @sec-unsupervised for more information about fitting, interpreting, and validating topic models. - -::: {.callout-note appearance="simple" icon=false} +**Topic Model.** Where a word cloud (or tag cloud) shows which words occur most frequently, a `topic model` analysis shows which words co-occur in the same documents. Using the most common topic modeling algorithm, Latent Dirichlet Allocation or LDA, Example [-@exm-funlda] explores the tweets by automatically clustering the tags selected earlier into 10 *topics*. Topic modeling is non-deterministic -- if you run it again you can get slightly different topics, and topics are swapped around randomly as the topic numbers have no special meaning. By setting the computer's *random seed* you can ensure that if you run it again you get the same results. As you can see, some topics seem easily interpretable (such as topic 2 about social distancing, and topic 8 on health care), it is always recommended that you inspect the clustered documents and edge cases in addition to the top words (or tags) as shown here. You can play around with this example by using a different selection of words (modifying the code in Example [-@exm-funcloud]) or changing the number of topics. You can also change (or remove) the random seed and see how running the same model multiple times will give different results. See @sec-unsupervised for more information about fitting, interpreting, and validating topic models. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-funlda} Topic Model of the COVID tags -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python funlda-python} tags = [ [tag.lower() for tag in re.findall("#\w+", tweet)] for tweet in tw["text"] @@ -240,10 +198,12 @@ for topic, words in m.print_topics(num_words=3): print(f"{topic}: {words}") ``` + ## R code + ```{r funlda-r} set.seed(1) -m = convert(dtm_tags, to="topicmodel") %>% +m = convert(dtm_tags, to="topicmodel") %>% LDA(10, method="Gibbs") terms(m, 5) ``` @@ -252,31 +212,20 @@ terms(m, 5) ::: ## Fun With Visualizing Geographic Information {#sec-fungeo} -For the final set of examples, we will use the location information contained in the Twitter data. -This information is based on what Twitter users enter into their profile, and as such it is incomplete and noisy -with many users giving a nonsensical location such as `Ethereally here' or not filling in any location at all. -However, if we assume that most users that do enter a proper location (such as Lahore or Florida in the top tweets displayed above), -we can use it to map where most tweets are coming from. - -The first step in this analysis is to resolve a name such as `Lahore, Pakistan' to its geographical coordinates (in this case, about 31 degrees north and 74 degrees east). This is called geocoding, and both Google maps and Open Street Maps can be used -to perform this automatically. -As with the tweets themselves, we will use a cached version of the geocoding results here so we can proceed directly. -Please see https://cssbook.net/datasets for the code that was used to create this file so you can play around with it as well. - -Example [-@exm-funmap] shows how this data can be used to create a map of Twitter activity. -First, the cached user data is retrieved, showing the correct location for Lahore but also -illustrating the noisiness of the data with the location "Un peu partout". -Next, this data is `joined` to the Twitter data, so the coordinates are filled in where known. -Finally, we plot this information on a map, showing tweets with more retweets as larger dots. -See Chapter [-@sec-chap-eda] for more information on visualization. - -::: {.callout-note appearance="simple" icon=false} +For the final set of examples, we will use the location information contained in the Twitter data. This information is based on what Twitter users enter into their profile, and as such it is incomplete and noisy with many users giving a nonsensical location such as \`Ethereally here' or not filling in any location at all. However, if we assume that most users that do enter a proper location (such as Lahore or Florida in the top tweets displayed above), we can use it to map where most tweets are coming from. + +The first step in this analysis is to resolve a name such as \`Lahore, Pakistan' to its geographical coordinates (in this case, about 31 degrees north and 74 degrees east). This is called geocoding, and both Google maps and Open Street Maps can be used to perform this automatically. As with the tweets themselves, we will use a cached version of the geocoding results here so we can proceed directly. Please see https://cssbook.net/datasets for the code that was used to create this file so you can play around with it as well. + +Example [-@exm-funmap] shows how this data can be used to create a map of Twitter activity. First, the cached user data is retrieved, showing the correct location for Lahore but also illustrating the noisiness of the data with the location "Un peu partout". Next, this data is `joined` to the Twitter data, so the coordinates are filled in where known. Finally, we plot this information on a map, showing tweets with more retweets as larger dots. See Chapter [-@sec-chap-eda] for more information on visualization. + +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-funmap} Location of COVID tweets -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python funmap-python} #| results: hide url = "https://cssbook.net/d/covid_users.csv" @@ -289,44 +238,38 @@ gdf.plot(ax=ax, color="red", alpha=0.2, markersize=tw["retweet_count"]) plt.show() ``` + ## R code + ```{r funmap-r} url = "https://cssbook.net/d/covid_users.csv" users = read_csv(url) tw2 = left_join(tw, users) ggplot(mapping=aes(x=long, y=lat)) + - geom_polygon(aes(group=group), - data=map_data("world"), + geom_polygon(aes(group=group), + data=map_data("world"), fill="lightgray", colour = "white") + - geom_point(aes(size=retweet_count, - alpha=retweet_count), - data=tw2, color="red") + - theme_void() + theme(aspect.ratio=1) + - guides(alpha=FALSE, size=FALSE) + - ggtitle("Location of COVID tweets", + geom_point(aes(size=retweet_count, + alpha=retweet_count), + data=tw2, color="red") + + theme_void() + theme(aspect.ratio=1) + + guides(alpha=FALSE, size=FALSE) + + ggtitle("Location of COVID tweets", "Size indicates number of retweets") ``` ::: ::: ::: -**Combining textual and structured information.** -Since we know the location of a subset of our tweet's users, -we can differentiate between e.g. American, European, and Asian tweets. -Example [-@exm-funcompare] creates a very rough identification of North American tweets, -and uses that to compute the relative frequency of words in those tweets compared to the rest. -Not surprisingly, those tweets are much more about American politics, locations, and institutions. -The other tweets talk about UK politics but also use a variety of names to refer to the pandemic. -To play around with this, see if you can isolate e.g. Asian or South American tweets, -or compare Spanish tweets from different locations. - -::: {.callout-note appearance="simple" icon=false} +**Combining textual and structured information.** Since we know the location of a subset of our tweet's users, we can differentiate between e.g. American, European, and Asian tweets. Example [-@exm-funcompare] creates a very rough identification of North American tweets, and uses that to compute the relative frequency of words in those tweets compared to the rest. Not surprisingly, those tweets are much more about American politics, locations, and institutions. The other tweets talk about UK politics but also use a variety of names to refer to the pandemic. To play around with this, see if you can isolate e.g. Asian or South American tweets, or compare Spanish tweets from different locations. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-funcompare} Corpus comparison: North American tweets vs. the rest -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python funcompare-python} #| results: hide nltk.download("stopwords") @@ -338,24 +281,27 @@ for k in stopwords.words("english"): del cn[k] del cr[k] key = sh.ProportionShift(type2freq_1=cn, type2freq_2=cr) -key.get_shift_graph().plot() +# WvA: It looks like shifterator is not working anymore? +# key.get_shift_graph().plot() ``` + ## R code + ```{r funcompare-r} dfm = tw2 %>% mutate(northamerica=ifelse( long < -60 & lat > 25,"N. America","Rest"))%>% - filter(lang=="en") %>% - corpus(docid_field="status_id") %>% + filter(lang=="en") %>% + corpus(docid_field="status_id") %>% tokens(remove_punct=T) %>% tokens_group(northamerica) %>% - dfm(tolower=T) %>% + dfm(tolower=T) %>% dfm_remove(stopwords("en")) %>% dfm_select(min_nchar=4) key = textstat_keyness(dfm, target="N. America") textplot_keyness(key, margin=0.2) + - ggtitle("Words preferred by North Americans", - "(Only English-language tweets)") + + ggtitle("Words preferred by North Americans", + "(Only English-language tweets)") + theme_void() ``` @@ -365,31 +311,15 @@ textplot_keyness(key, margin=0.2) + ## Fun With Networks {#sec-funnet} -Twitter, of course, is a social network as well as a microblogging service: -users are connected to other users because they follow each other and retweet and like each others' tweets. -Using the `reply_to_screen_name` column, we can inspect the reply network contained in the COVID tweet dataset. -Example [-@exm-fungraph] first uses the data summarization commands from tidyverse(R) and pandas(Python) to -create a data frame of connections or `edges` listing how often each user replies each other user. -The second code block shows how the *igraph* (R) and *networkx* (Python) packages are used to convert this edge list into a graph. -From this graph, we select only the largest connected component and use a clustering algorithm to analyze which -nodes (users) form cohesive subnetworks. -Finally, a number of options are used to set the color and size of the edges, nodes, and labels, -and the resulting network is plotted. -As you can see, the central node is Donald Trump, who is replied by a large number of users, -some of which are then replied by other users. -You can play around with different settings for the plot options, -or try to filter e.g. only tweets from a certain language. -You could also easily compute social network metrics such as centrality on this network, -and/or export the network for further analysis in specialized social network analysis software. -See Chapter [-@sec-chap-network] for more information on network analysis, -and Chapter [-@sec-chap-datawrangling] for the summarization commands used to create the edge list. - -::: {.callout-note appearance="simple" icon=false} +Twitter, of course, is a social network as well as a microblogging service: users are connected to other users because they follow each other and retweet and like each others' tweets. Using the `reply_to_screen_name` column, we can inspect the reply network contained in the COVID tweet dataset. Example [-@exm-fungraph] first uses the data summarization commands from tidyverse(R) and pandas(Python) to create a data frame of connections or `edges` listing how often each user replies each other user. The second code block shows how the *igraph* (R) and *networkx* (Python) packages are used to convert this edge list into a graph. From this graph, we select only the largest connected component and use a clustering algorithm to analyze which nodes (users) form cohesive subnetworks. Finally, a number of options are used to set the color and size of the edges, nodes, and labels, and the resulting network is plotted. As you can see, the central node is Donald Trump, who is replied by a large number of users, some of which are then replied by other users. You can play around with different settings for the plot options, or try to filter e.g. only tweets from a certain language. You could also easily compute social network metrics such as centrality on this network, and/or export the network for further analysis in specialized social network analysis software. See Chapter [-@sec-chap-network] for more information on network analysis, and Chapter [-@sec-chap-datawrangling] for the summarization commands used to create the edge list. + +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-fungraph} Reply network in the COVID tweets. -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python fungraph-python} edges = tw2[["screen_name", "reply_to_screen_name"]] edges = edges.dropna().rename( @@ -398,20 +328,23 @@ edges = edges.dropna().rename( edges.groupby(["from", "to"]).size().head() ``` + ## R code + ```{r fungraph-r} -edges = tw2 %>% - select(from=screen_name, - to=reply_to_screen_name) %>% +edges = tw2 %>% + select(from=screen_name, + to=reply_to_screen_name) %>% filter(to != "") %>% - group_by(to, from) %>% + group_by(to, from) %>% summarize(n=n()) head(edges) ``` ::: -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python fungraphb-python} #| results: hide g1 = nx.Graph() @@ -429,7 +362,9 @@ nx.draw_networkx_edges(g2, pos) plt.show() ``` + ## R code + ```{r fungraphb-r} # create igraph and select largest component g = graph_from_data_frame(edges) @@ -443,7 +378,7 @@ V(g2)$frame.color = V(g2)$color # Set node (user) and edge (arrow) size V(g2)$size = degree(g2)^.5 V(g2)$label.cex = V(g2)$size/3 -V(g2)$label = ifelse(degree(g2)<=1,"",V(g2)$name) +V(g2)$label = ifelse(degree(g2)<=1,"",V(g2)$name) E(g2)$width = E(g2)$n E(g2)$arrow.size= E(g2)$width/10 plot(g2) @@ -452,22 +387,15 @@ plot(g2) ::: ::: -**Geographic networks.** -In the final example of this chapter, we will combine the geographic and network information to -show which regions of the world interact with each other. -For this, in Example [-@exm-fungeonet] we join the user information to the edges data frame created above twice: -once for the sender, once for the replied-to user. -Then, we adapt the earlier code for plotting the map by adding a line for each node in the network. -As you can see, users in the main regions (US, EU, India) mostly interact with each other, -with almost all regions also interacting with the US. - -::: {.callout-note appearance="simple" icon=false} +**Geographic networks.** In the final example of this chapter, we will combine the geographic and network information to show which regions of the world interact with each other. For this, in Example [-@exm-fungeonet] we join the user information to the edges data frame created above twice: once for the sender, once for the replied-to user. Then, we adapt the earlier code for plotting the map by adding a line for each node in the network. As you can see, users in the main regions (US, EU, India) mostly interact with each other, with almost all regions also interacting with the US. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-fungeonet} Reply Network of Tweets -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python fungeonet-python} #| results: hide u = users.drop(["location"], axis=1) @@ -505,25 +433,26 @@ py = lambda point: point.y # plt.show() ``` + ## R code + ```{r fungeonet-r} -edges2 = edges %>% +edges2 = edges %>% inner_join(users, by=c("from"="screen_name"))%>% - inner_join(users, by=c("to"="screen_name"), - suffix=c("", ".to")) %>% + inner_join(users, by=c("to"="screen_name"), + suffix=c("", ".to")) %>% filter(lat != lat.to | long != long.to ) ggplot(mapping=aes(x = long, y = lat)) + geom_polygon(aes(group=group),map_data("world"), fill="lightgray", colour = "white") + - geom_point(aes(size=retweet_count, + geom_point(aes(size=retweet_count, alpha=retweet_count), data=tw2, color="red")+ geom_curve(aes(xend=long.to,yend=lat.to,size=n), edges2, curvature=.1, alpha=.5) + theme_void() + guides(alpha=FALSE, size=FALSE) + - ggtitle("Retweet network of COVID tweets", + ggtitle("Retweet network of COVID tweets", "Bubble size indicates total no. of retweets") ``` ::: ::: ::: - diff --git a/content/chapter06.qmd b/content/chapter06.qmd index 827a295..1556ee5 100644 --- a/content/chapter06.qmd +++ b/content/chapter06.qmd @@ -80,7 +80,7 @@ This means that the original `d` is overwritten. ::: {.callout-note icon=false collapse=true} In R, the *tidyverse* function `select` is quite versatile. - You can specify multiple columns using `select(d, column1, column2)` + You can specify multiple columns using `select(d, column1, column2)` or by specifying a range of columns: `select(d, column1:column3)`. Both commands keep only the specified columns. As in the example, you can also specify a negative selection with the minus sign: @@ -129,7 +129,7 @@ d2 ```{r data-filter-r} url="https://cssbook.net/d/guns-polls.csv" d = read_csv(url) -d = rename(d, rep=`Republican Support`, +d = rename(d, rep=`Republican Support`, dem=`Democratic Support`) d = select(d, -URL) @@ -195,15 +195,15 @@ d2 = pd.read_csv(url) # Note that when creating a new column, # you have to use df["col"] rather than df.col d2["rep2"] = d2.rep.str.replace("[^0-9\\.]", "") -d2["rep2"] = pd.to_numeric(d2.rep2) -d2["Support2"] = d2.Support.fillna(d.Support.mean()) +d2["rep2"] = pd.to_numeric(d2.rep2, errors='coerce') +d2["Support2"] = d2.Support.fillna(d2.Support.mean()) # Alternatively, clean with .assign # Note the need to use an anonymous function # (lambda) to chain calculations cleaned = d2.assign( rep2=d2.rep.str.replace("[^0-9\\.]", ""), - rep3=lambda d2: pd.to_numeric(d2.rep2), + rep3=lambda d2: pd.to_numeric(d2.rep2, errors='coerce'), Support2=d2.Support.fillna(d2.Support.mean()), ) @@ -222,20 +222,20 @@ cleaned.head() url="https://cssbook.net/d/guns-polls-dirty.csv" d2 = read_csv(url) -# Option 1: clean with direct assignment. +# Option 1: clean with direct assignment. # Note the need to specify d2$ everywhere d2$rep2=str_replace_all(d2$rep, "[^0-9\\.]", "") d2$rep2 = as.numeric(d2$rep2) -d2$Support2 = replace_na(d2$Support, +d2$Support2 = replace_na(d2$Support, mean(d2$Support, na.rm=T)) # Alternative, clean with mutate -# No need to specify d2$, +# No need to specify d2$, # and we can assign to a new or existing object -cleaned = mutate(d2, +cleaned = mutate(d2, rep2 = str_replace_all(rep, "[^0-9\\.]", ""), rep2 = as.numeric(rep2), - Support2 = replace_na(Support, + Support2 = replace_na(Support, mean(Support, na.rm=TRUE))) # Finally, you can create your own function @@ -270,7 +270,7 @@ Note that all these versions work fine and produce the same result. In the end, it is up to the researcher to determine which feels most natural given the circumstances. As noted above, in R we would generally prefer `mutate` over direct assignment, mostly because it fits nicely into the *tidyverse* workflow and you do not need to repeat the data frame name. -In Python, we would generally prefer the direct assignment, unless a copy of the data with the changes made is convenient, +In Python, we would generally prefer the direct assignment, unless a copy of the data with the changes made is convenient, in which case `assign` can be more useful. ## Grouping and Aggregating {#sec-grouping} @@ -366,7 +366,7 @@ d.groupby("Question").agg({"Support": ["mean", "std"]}) ``` ## R code ```{r aggregate2-r} -d %>% group_by(Question) %>% +d %>% group_by(Question) %>% summarize(m=mean(Support), sd=sd(Support)) ``` ::: @@ -403,8 +403,8 @@ d.head() ``` ## R code ```{r transform-r} -d = d %>% group_by(Question) %>% - mutate(mean = mean(Support), +d = d %>% group_by(Question) %>% + mutate(mean = mean(Support), deviation=Support - mean) head(d) ``` @@ -523,9 +523,9 @@ capital_fr.head() ``` ## R code ```{r capital_1-r} -private_fr = private %>% +private_fr = private %>% select(Year, fr_private=France) -public_fr = public %>% +public_fr = public %>% select(Year, fr_public=France) capital_fr = full_join(private_fr, public_fr) # Data for Figure 3.6 (Piketty, 2014, p 128) @@ -544,7 +544,7 @@ print(f"Pearson correlation: rho={r:.2},p={p:.3}") ## R code ```{r capital_2-r} # Are private and public capital correlated? -cor.test(capital_fr$fr_private, +cor.test(capital_fr$fr_private, capital_fr$fr_public) ``` ::: @@ -645,7 +645,7 @@ results.head() ## R code ```{r primary-r} r="https://cssbook.net/d/2016_primary_results.csv" -results = read_csv(r) +results = read_csv(r) head(results) ``` ::: @@ -695,10 +695,10 @@ r.head() ``` ## R code ```{r nested-r} -c = counties %>% +c = counties %>% select("fips", "area_name", "Race_black_pct") -r = results %>% - filter(candidate == "Hillary Clinton") %>% +r = results %>% + filter(candidate == "Hillary Clinton") %>% select(fips, votes, fraction_votes) r = inner_join(r, c) cor.test(r$Race_black_pct, r$fraction_votes) @@ -796,15 +796,15 @@ d = bind_rows( private %>% add_column(type="private"), public %>% add_column(type="public")) countries = c("Germany", "France", "U.K.") -d %>% filter(country %in% countries) %>% - ggplot(aes(x=Year, y=capital, - color=country, lty=type)) + - geom_line()+ +d %>% filter(country %in% countries) %>% + ggplot(aes(x=Year, y=capital, + color=country, lty=type)) + + geom_line()+ ylab("Capital (% of national income)") + - guides(colour=guide_legend("Country"), - linetype=guide_legend("Capital")) + - theme_classic() + - ggtitle("Capital in Europe, 1970 - 2010", + guides(colour=guide_legend("Country"), + linetype=guide_legend("Capital")) + + theme_classic() + + ggtitle("Capital in Europe, 1970 - 2010", "Partial reproduction of Piketty fig 4.4") ``` @@ -880,14 +880,14 @@ d = read_excel(dest,sheet="TS8.2",skip=4) d = d%>% rename("year"=1) #2 Reshape: Pivoting to long, dropping missing -d = d%>%pivot_longer(-year, values_to="share")%>% +d = d%>%pivot_longer(-year, values_to="share")%>% na.omit() #3 Normalize cols = c(NA,"percent","type",NA,"capital_gains") d = d %>% separate(name, into=cols, - sep=" ", extra="merge", fill="right") %>% - mutate(year=as.numeric(year), + sep=" ", extra="merge", fill="right") %>% + mutate(year=as.numeric(year), capital_gains=!is.na(capital_gains)) head(d) ``` @@ -910,13 +910,13 @@ plt.set(xlabel="Year", ylabel="Share of income going to top-1%") ## R code ```{r excel2-r} #4 Filter for the desired data -subset = d %>% filter(year >=1910, - percent=="1%", +subset = d %>% filter(year >=1910, + percent=="1%", capital_gains==F) #5 Analyze and/or visualization ggplot(subset, aes(x=year, y=share, color=type)) + - geom_line() + xlab("Year") + + geom_line() + xlab("Year") + ylab("Share of income going to top-1%") + theme_classic() @@ -936,4 +936,3 @@ ylab("Share of income going to top-1%") + [^5]: See Section [-@sec-datatypes] for more information on working with dictionaries [^6]: Of course, the fact that this is time series data means that the independence assumption of regular correlation is violated badly, so this should be interpreted as a descriptive statistic, e.g. in the years with high private capital there is low public capital and the other way around. - diff --git a/content/chapter08.qmd b/content/chapter08.qmd index f4ed88e..48a3189 100644 --- a/content/chapter08.qmd +++ b/content/chapter08.qmd @@ -7,8 +7,8 @@ At the time of writing this chapter for the published book, `caret` was the state of the art Machine Learning package for R. We now think that the (newer) `tidymodels` package is a better choice in many regards. -For this reason, we are planning to rewrite this chapter using that package. -See the [relevant github issue](https://github.com/vanatteveldt/cssbook/issues/6) for more information. +For this reason, we are planning to rewrite this chapter using that package. +See the [relevant github issue](https://github.com/vanatteveldt/cssbook/issues/6) for more information. We also plan to add a section on Encoder Representation / Transformer models, see the [relevant github issue](https://github.com/vanatteveldt/cssbook/issues/4) ::: @@ -114,7 +114,7 @@ in the computational analysis of communication. In this chapter, we focus on *supervised* machine learning (SML) -- a form of machine learning, where we aim to predict a variable -that, for at least a part of our data, is known. SML is usually applied to *classification* and *regression* problems. To illustrate the +that, for at least a part of our data, is known. SML is usually applied to *classification* and *regression* problems. To illustrate the idea, imagine that you are interested in predicting gender, based on Twitter biographies. You determine the gender for some of the biographies yourself and hand these examples over to the computer. The @@ -193,7 +193,7 @@ mod.params df = read.csv("https://cssbook.net/d/media.csv") mod = lm(formula = "newspaper ~ age + gender", data = df) -# summary(mod) would give a lot more info, +# summary(mod) would give a lot more info, # but we only care about the coefficients: mod ``` @@ -368,7 +368,7 @@ print(f"We have {len(X_train)} training and " f"{len(X_test)} test cases.") ```{r preparedata-r} df = read.csv("https://cssbook.net/d/media.csv") df = na.omit(df %>% mutate( - usesinternet=recode(internet, + usesinternet=recode(internet, .default="user", `0`="non-user"))) set.seed(42) @@ -380,10 +380,10 @@ split = initial_split(df, prop = .8) traindata = training(split) testdata = testing(split) -X_train = select(traindata, +X_train = select(traindata, c("age", "gender", "education")) y_train = traindata$usesinternet -X_test = select(testdata, +X_test = select(testdata, c("age", "gender", "education")) y_test = testdata$usesinternet @@ -419,7 +419,7 @@ y_pred = myclassifier.predict(X_test) ``` ## R code ```{r nb-r} -myclassifier = train(x = X_train, y = y_train, +myclassifier = train(x = X_train, y = y_train, method = "naive_bayes") y_pred = predict(myclassifier, newdata = X_test) ``` @@ -577,7 +577,7 @@ calculate $P(features)$ and $P(features|label)$ by just multiplying the probabilities of each individual feature. Let's assume we have three features, $x_1, x_2, x_3$. We now simply calculate the percentage of *all* cases that contain these -features: $P(x_1)$, $P(x_2)$ and $P(x_3)$. +features: $P(x_1)$, $P(x_2)$ and $P(x_3)$. Then we do the same for the conditional probabilities and calculate the percentage of cases @@ -788,13 +788,13 @@ y_pred = myclassifier.predict(X_test_scaled) ## R code ```{r svm-r} #| cache: true -# !!! We normalize our features to have M=0 and -# SD=1. This is necessary as our features are not +# !!! We normalize our features to have M=0 and +# SD=1. This is necessary as our features are not # measured on the same scale, which SVM requires. # Alternatively, rescale to [0:1] or [-1:1] -myclassifier = train(x = X_train, y = y_train, - preProcess = c("center", "scale"), +myclassifier = train(x = X_train, y = y_train, + preProcess = c("center", "scale"), method = "svmLinear3") y_pred = predict(myclassifier, newdata = X_test) ``` @@ -892,7 +892,7 @@ y_pred = myclassifier.predict(X_test) ## R code ```{r randomforest-r} #| cache: true -myclassifier = train(x = X_train, y = y_train, +myclassifier = train(x = X_train, y = y_train, method = "rf") y_pred = predict(myclassifier, newdata = X_test) ``` @@ -1034,7 +1034,7 @@ Similarly, [@sec-chap-image] will show how a similar technique can be used to ex This involves creating a two-dimensional window over pixels rather than a unidimensional window over words, and often multiple convolutional layers are chained to detect features in increasingly large areas of the image. The underlying technique of convolutional networks, however, is the same in both cases. -## Validation and Best Practices {#sec-validation} +## Validation and Best Practices {#sec-validation} ### Finding a Balance Between Precision and Recall {#sec-balance} In the previous sections, we have learned how to fit different models: @@ -1119,8 +1119,8 @@ One approach is to print a table with three columns: the false positive rate, the true positive rate, and the threshold value. You then decide which FPR--TPR combination is most appealing to you, and use the corresponding threshold value. Alternatively, you can find the threshold value -with the maximum distance between TPR and FPR, an approach also known as Yoden's J (Example 8.9). -Plotting the ROC curve can also help interpreting which +with the maximum distance between TPR and FPR, an approach also known as Yoden's J (Example 8.9). +Plotting the ROC curve can also help interpreting which TPR/FPR combination is most promising (i.e., closest to the upper left corner). ::: {.callout-note appearance="simple" icon=false} @@ -1141,11 +1141,11 @@ print(confusion_matrix(y_test, y_pred)) ``` ## R code ```{r cutoffpoint-r} -m = glm(usesinternet ~ age + gender + education, +m = glm(usesinternet ~ age + gender + education, data=traindata, family="binomial") y_pred = predict(m, newdata = testdata, type = "response") -pred_default = as.factor(ifelse(y_pred>0.5, +pred_default = as.factor(ifelse(y_pred>0.5, "user", "non-user")) print("Confusion matrix, default threshold (0.5)") @@ -1323,7 +1323,7 @@ print(f"M={acc.mean():.2f}, SD={acc.std():.3f}") myclassifier = train(x = X_train, y = y_train, method = "glm", family="binomial", metric="Accuracy", trControl = trainControl( - method = "cv", number = 5, + method = "cv", number = 5, returnResamp ="all", savePredictions=TRUE),) print(myclassifier$resample) print(myclassifier$results) @@ -1413,7 +1413,7 @@ print(classification_report(y_test, search.predict(X_test_scaled))) ::: {.callout-note appearance="simple" icon=false} ::: {#exm-gridsearch3} -A gridsearch in R. +A gridsearch in R. ## R code ```{r gridsearch3-r} #| cache: true @@ -1421,12 +1421,12 @@ A gridsearch in R. grid = expand.grid(Loss=c("L1","L2"), cost=c(100,1000)) -# Train the model using our previously defined +# Train the model using our previously defined # parameters gridsearch = train(x = X_train, y = y_train, - preProcess = c("center", "scale"), - method = "svmLinear3", - trControl = trainControl(method = "cv", + preProcess = c("center", "scale"), + method = "svmLinear3", + trControl = trainControl(method = "cv", number = 5), tuneGrid = grid) gridsearch @@ -1484,4 +1484,3 @@ While in the end, you can find a supervised machine learning sampling. [^6]: ([jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html](https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html)) - diff --git a/content/chapter10.qmd b/content/chapter10.qmd index 1d2ef6e..c3dc500 100644 --- a/content/chapter10.qmd +++ b/content/chapter10.qmd @@ -1,41 +1,43 @@ # Text as data {#sec-chap-dtm} -::: {.callout-warning} +::: callout-warning # Update planned: R Tidytext -At the time of writing this chapter for the published book, `quanteda` was our package of choice for text analysis in R. -We now think that the `tidytext` package is easier to learn for students who have just leared the `tidyverse` data wrangling package. -For this reason, we are planning to rewrite this chapter using that package. -See the [relevant github issue](https://github.com/vanatteveldt/cssbook/issues/5) for more information. +At the time of writing this chapter for the published book, `quanteda` was our package of choice for text analysis in R. We now think that the `tidytext` package is easier to learn for students who have just leared the `tidyverse` data wrangling package. For this reason, we are planning to rewrite this chapter using that package. See the [relevant github issue](https://github.com/vanatteveldt/cssbook/issues/5) for more information. ::: +```{=html} + +``` {{< include common_setup.qmd >}} -**Abstract.** - This chapter shows how you can analyze texts that are stored as a data frame column or variable using functions from the package *quanteda* in R and the package *sklearn* in Python and R. - Please see Chapter [-@sec-chap-protext] for more information on reading and cleaning text. +**Abstract.** This chapter shows how you can analyze texts that are stored as a data frame column or variable using functions from the package *quanteda* in R and the package *sklearn* in Python and R. Please see Chapter [-@sec-chap-protext] for more information on reading and cleaning text. **Keywords.** Text as Data, Document-Term Matrix **Objectives:** -- Create a document-term matrix from text - - Perform document and feature selection and weighting - - Understand and use more advanced representations such as n-grams and embeddings - -::: {.callout-note icon=false collapse=true} +- Create a document-term matrix from text +- Perform document and feature selection and weighting +- Understand and use more advanced representations such as n-grams and embeddings -This chapter introduces the packages *quanteda* (R) and *sklearn* and *nltk* (Python) for converting text into a document-term matrix. It also introduces the *udpipe* package for natural language processing. -You can install these packages with the code below if needed (see Section [-@sec-installing] for more details): +::: {.callout-note icon="false" collapse="true"} +This chapter introduces the packages *quanteda* (R) and *sklearn* and *nltk* (Python) for converting text into a document-term matrix. It also introduces the *udpipe* package for natural language processing. You can install these packages with the code below if needed (see Section [-@sec-installing] for more details): -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python chapter10install-python} #| eval: false !pip3 install ufal.udpipe spacy nltk scikit-learn==0.24.2 !pip3 install gensim==4.0.1 wordcloud nagisa conllu tensorflow==2.5.0 tensorflow-estimator==2.5.0 ``` + ## R code + ```{r chapter10install-r} #| eval: false install.packages(c("glue","tidyverse","quanteda", @@ -43,10 +45,12 @@ install.packages(c("glue","tidyverse","quanteda", "udpipe", "spacyr")) ``` ::: - After installing, you need to import (activate) the packages every session: -::: {.panel-tabset} +After installing, you need to import (activate) the packages every session: + +::: panel-tabset ## Python code + ```{python chapter10library-python} # Standard library and basic data wrangling import os @@ -81,7 +85,9 @@ from ufal.udpipe import Model, Pipeline import conllu ``` + ## R code + ```{r chapter10library-r} library(glue) library(tidyverse) @@ -98,18 +104,15 @@ library(spacyr) ## The Bag of Words and the Term-Document Matrix {#sec-dtm} -Before you can conduct any computational analysis of text, you need to solve a problem: computations are usually done on numerical data -- but you have text. Hence, you must find a way to *represent* the text by numbers. -The document-term matrix (DTM, also called the term-document matrix or TDM) is one common numerical representation of text. -It represents a *corpus* (or set of documents) as a matrix or table, where each row represents a document, each column represents a term (word), -and the numbers in each cell show how often that word occurs in that document. - -::: {.callout-note appearance="simple" icon=false} +Before you can conduct any computational analysis of text, you need to solve a problem: computations are usually done on numerical data -- but you have text. Hence, you must find a way to *represent* the text by numbers. The document-term matrix (DTM, also called the term-document matrix or TDM) is one common numerical representation of text. It represents a *corpus* (or set of documents) as a matrix or table, where each row represents a document, each column represents a term (word), and the numbers in each cell show how often that word occurs in that document. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-dtm} Example document-term matrix -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python dtm-python} #| cache: true texts = [ @@ -121,10 +124,12 @@ d = cv.fit_transform(texts) # Create a dataframe of the word counts to inspect # - todense transforms the dtm into a dense matrix # - get_feature_names() gives a list words -pd.DataFrame(d.todense(), columns=cv.get_feature_names()) +pd.DataFrame(d.todense(), columns=cv.get_feature_names_out()) ``` + ## R code + ```{r dtm-r} #| cache: true texts = c( @@ -138,31 +143,21 @@ convert(d, "matrix") ::: ::: -As an example, Example [-@exm-dtm] shows a DTM made from two lines from the famous poem by Mary Angelou. -The resulting matrix has two rows, one for each line; and 11 columns, one for each unique term (word). -In the columns you see the document frequencies of each term: the word "bird" occurs once in each line, -but the word "with" occurs only in the first line (text1) and not in the second (text2). +As an example, Example [-@exm-dtm] shows a DTM made from two lines from the famous poem by Mary Angelou. The resulting matrix has two rows, one for each line; and 11 columns, one for each unique term (word). In the columns you see the document frequencies of each term: the word "bird" occurs once in each line, but the word "with" occurs only in the first line (text1) and not in the second (text2). -In R, you can use the `dfm` function from the *quanteda* package [@quanteda]. -This function can take a vector or column of texts and transforms it directly into a DTM -(which quanteda actually calls a document-*feature* matrix, hence the function name `dfm`). -In Python, you achieve the same by creating an object of the `CountVectorizer` class, which has a `fit_transform` function. +In R, you can use the `dfm` function from the *quanteda* package [@quanteda]. This function can take a vector or column of texts and transforms it directly into a DTM (which quanteda actually calls a document-*feature* matrix, hence the function name `dfm`). In Python, you achieve the same by creating an object of the `CountVectorizer` class, which has a `fit_transform` function. ### Tokenization {#sec-tokenizations} -In order to turn a corpus into a matrix, each text needs to be *tokenized*, -meaning that it must be split into a list (vector) of words. -This seems trivial, as English (and most western) text generally uses spaces to demarcate words. -However, even for English there are a number of edge cases. -For example, should "haven't" be seen as a single word, or two? - -::: {.callout-note appearance="simple" icon=false} +In order to turn a corpus into a matrix, each text needs to be *tokenized*, meaning that it must be split into a list (vector) of words. This seems trivial, as English (and most western) text generally uses spaces to demarcate words. However, even for English there are a number of edge cases. For example, should "haven't" be seen as a single word, or two? +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-tokenize} Differences between tokenizers -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python tokenize-python} #| cache: true text = "I haven't seen John's derring-do" @@ -170,7 +165,9 @@ tokenizer = CountVectorizer().build_tokenizer() print(tokenizer(text)) ``` + ## R code + ```{r tokenize-r} #| cache: true text = "I haven't seen John's derring-do" @@ -180,38 +177,20 @@ tokens(text) ::: ::: -Example [-@exm-tokenize] shows how Python and R deal with the sentence "I haven't seen John's derring-do". -For Python, we first use `CountVectorizer.build_tokenizer` to access the built-in tokenizer. -As you can see in the first line of input, this tokenizes "haven't" to `haven`, -which of course has a radically different meaning. Moreover, it silently drops all single-letter words, -including the `'t`, `'s`, and `I`. +Example [-@exm-tokenize] shows how Python and R deal with the sentence "I haven't seen John's derring-do". For Python, we first use `CountVectorizer.build_tokenizer` to access the built-in tokenizer. As you can see in the first line of input, this tokenizes "haven't" to `haven`, which of course has a radically different meaning. Moreover, it silently drops all single-letter words, including the `'t`, `'s`, and `I`. -In the box "Tokenizing in Python" below, we therefore discuss some alternatives. -For instance, the `TreebankWordTokenizer` included in the *nltk* package is a more reasonable tokenizer and -splits "haven't" into `have` and `n't`, which is a reasonable outcome. -Unfortunately, this tokenizer assumes that text has already been split into sentences, -and it also includes punctuation as tokens by default. -To circumvent this, we can introduce a custom tokenizer based on the Treebank tokenizer, -which splits text into sentences (using `nltk.sent_tokenize`) -- see the box for more details. +In the box "Tokenizing in Python" below, we therefore discuss some alternatives. For instance, the `TreebankWordTokenizer` included in the *nltk* package is a more reasonable tokenizer and splits "haven't" into `have` and `n't`, which is a reasonable outcome. Unfortunately, this tokenizer assumes that text has already been split into sentences, and it also includes punctuation as tokens by default. To circumvent this, we can introduce a custom tokenizer based on the Treebank tokenizer, which splits text into sentences (using `nltk.sent_tokenize`) -- see the box for more details. -For R, we simply call the `tokens` function from the *quanteda* package. -This keeps `haven't` and `John's` as a single word, which is probably less desirable than splitting the words -but at least better than outputting the word `haven`. +For R, we simply call the `tokens` function from the *quanteda* package. This keeps `haven't` and `John's` as a single word, which is probably less desirable than splitting the words but at least better than outputting the word `haven`. -As this simple example shows, even a relatively simple sentence is tokenized differently by the tokenizers considered here (and see the box on tokenization in Python). -Depending on the research question, these differences might or might not be important. -However, it is always a good idea to check the output of this (and other) preprocessing steps so you understand -what information is kept or discarded. +As this simple example shows, even a relatively simple sentence is tokenized differently by the tokenizers considered here (and see the box on tokenization in Python). Depending on the research question, these differences might or might not be important. However, it is always a good idea to check the output of this (and other) preprocessing steps so you understand what information is kept or discarded. -::: {.callout-note icon=false collapse=true} +::: {.callout-note icon="false" collapse="true"} ## Tokenization in Python -As you can see in the example, the built-in tokenizer in scikit-learnis not actually very good. - For example, *haven't* is tokenized to *haven*, which is an entirely different word. - Fortunately, there are other tokenizers in the *nltk.tokenize* package that do better. +As you can see in the example, the built-in tokenizer in scikit-learnis not actually very good. For example, *haven't* is tokenized to *haven*, which is an entirely different word. Fortunately, there are other tokenizers in the *nltk.tokenize* package that do better. -For example, the `TreebankTokenizer` uses the tokenization rules for the Penn Treebank - to tokenize, which produces better results: +For example, the `TreebankTokenizer` uses the tokenization rules for the Penn Treebank to tokenize, which produces better results: ```{python tokenizealt-python} text = """I haven't seen John's derring-do. @@ -220,9 +199,7 @@ print(TreebankWordTokenizer().tokenize(text)) ``` -Another example is the `WhitespaceTokenizer`, which simply uses whitespace to tokenize, - which can be useful if your input has already been tokenized, - and is used in Example [-@exm-tagcloud] below for tweets to conserve hash tags. +Another example is the `WhitespaceTokenizer`, which simply uses whitespace to tokenize, which can be useful if your input has already been tokenized, and is used in Example [-@exm-tagcloud] below for tweets to conserve hash tags. ```{python tokenizealt1-python} #| cache: true @@ -230,20 +207,14 @@ print(WhitespaceTokenizer().tokenize(text)) ``` -You can also write your own tokenizer if needed. - For example, the `TreebankTokenizer` assumes that text has already been split into sentences - (which is why the period is attached to the word *derring-do.*). - The code below shows how we can make our own tokenizer class, - which uses `nltk.sent_tokenize` to first split the text into sentences, - and then uses the `TreebankTokenizer` to tokenize each sentence, - keeping only tokens that include at least one letter character. - Although a bit more complicated, this approach can give you maximum flexibility. +You can also write your own tokenizer if needed. For example, the `TreebankTokenizer` assumes that text has already been split into sentences (which is why the period is attached to the word *derring-do.*). The code below shows how we can make our own tokenizer class, which uses `nltk.sent_tokenize` to first split the text into sentences, and then uses the `TreebankTokenizer` to tokenize each sentence, keeping only tokens that include at least one letter character. Although a bit more complicated, this approach can give you maximum flexibility. ```{python tokenizealt2-python} #| cache: true #| results: hide nltk.download("punkt") ``` + ```{python tokenizealt3-python} class MyTokenizer: def tokenize(self, text): @@ -262,12 +233,13 @@ print(mytokenizer.tokenize(text)) ``` ::: -::: {.callout-note appearance="simple" icon=false} +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-haiku} Tokenization of Japanese verse. -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python haiku-python} #| cache: true # this snippet uses the tokenizer created above @@ -279,7 +251,9 @@ print(f"Default: {mytokenizer.tokenize(haiku)}") print(f"Nagisa: {nagisa.tagging(haiku).words}") ``` + ## R code + ```{r haiku-r} #| cache: true haiku = "\u53e4\u6c60\u86d9 @@ -289,33 +263,22 @@ tokens(haiku) ``` ::: -::: {.panel-tabset} +::: panel-tabset ::: ::: ::: -Note that for languages such as Chinese, Japanese, and Korean, which do not use spaces to delimit words, the story is more difficult. -Although a full treatment is beyond the scope of this book, Example [-@exm-haiku] shows a small example of tokenizing Japanese text, -in this case the famous haiku "the sound of water" by Bashō. -The default tokenizer in quanteda actually does a good job, in contrast to the default Python tokenizer -that simply keeps the whole string as one word -(which makes sense since this tokenizer only looks for whitespace or punctuation). -For Python the best bet is to use a custom package for tokenizing Japanese, such as the *nagisa* package. -This package contains a tokenizer which is able to tokenize the Japanese text, and we could use this in the `CountVectorizer` -much like we used the `TreebankWordTokenizer` for English earlier. -Similarly, with heavily inflected languages such as Hungarian or Arabic, -it might be better to use preprocessing tools developed specifically for these languages, but treating those is -beyond the scope of this book. +Note that for languages such as Chinese, Japanese, and Korean, which do not use spaces to delimit words, the story is more difficult. Although a full treatment is beyond the scope of this book, Example [-@exm-haiku] shows a small example of tokenizing Japanese text, in this case the famous haiku "the sound of water" by Bashō. The default tokenizer in quanteda actually does a good job, in contrast to the default Python tokenizer that simply keeps the whole string as one word (which makes sense since this tokenizer only looks for whitespace or punctuation). For Python the best bet is to use a custom package for tokenizing Japanese, such as the *nagisa* package. This package contains a tokenizer which is able to tokenize the Japanese text, and we could use this in the `CountVectorizer` much like we used the `TreebankWordTokenizer` for English earlier. Similarly, with heavily inflected languages such as Hungarian or Arabic, it might be better to use preprocessing tools developed specifically for these languages, but treating those is beyond the scope of this book. ### The DTM as a Sparse Matrix {#sec-sparse} -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-sotu} Example document-term matrix -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python sotu-python} # this snippet uses the tokenizer created above # (example "Tokenization with Python") @@ -325,7 +288,9 @@ d = cv.fit_transform(sotu["text"]) d ``` + ## R code + ```{r sotu-r} url = "https://cssbook.net/d/sotu.csv" sotu = read_csv(url) %>% @@ -338,18 +303,15 @@ d ::: ::: -Example [-@exm-sotu] shows a more realistic example. -It downloads all US "State of the Union" speeches and creates a document-term matrix from them. -Since the matrix is now easily too large to print, both Python and R simply list the size of the matrix. -R lists $85$ documents (rows) and $17999$ features (columns), and Python reports that its size is $85\times17185$. -Note the difference in the number of columns (unique terms) due to the differences in tokenization as discussed above. +Example [-@exm-sotu] shows a more realistic example. It downloads all US "State of the Union" speeches and creates a document-term matrix from them. Since the matrix is now easily too large to print, both Python and R simply list the size of the matrix. R lists $85$ documents (rows) and $17999$ features (columns), and Python reports that its size is $85\times17185$. Note the difference in the number of columns (unique terms) due to the differences in tokenization as discussed above. -::: {.callout-note appearance="simple" icon=false} +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-freq} A look inside the DTM. -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python freq-python} def termstats(dfm, vectorizer): """Helper function to calculate term and @@ -361,10 +323,11 @@ def termstats(dfm, vectorizer): docfreqs = np.bincount(dfm.indices) freq_df = pd.DataFrame( dict(frequency=frequencies, docfreq=docfreqs), - index=vectorizer.get_feature_names(), + index=vectorizer.get_feature_names_out(), ) return freq_df.sort_values("frequency", ascending=False) ``` + ```{python freq-python2} #| cache: true termstats(d, cv).iloc[[0, 10, 100, 1000, 10000]] @@ -373,7 +336,9 @@ indices = [cv.vocabulary_[x] for x in words] d[[[0], [25], [50], [75]], indices].todense() ``` + ## R code + ```{r freq-r} #| cache: true textstat_frequency(d)[c(1, 10, 100, 1000, 10000), ] @@ -385,106 +350,47 @@ as.matrix(d[ ::: ::: -In Example [-@exm-freq] we show how you can look at the content of the DTM. First, we show the overall term and document frequencies of each word, where we showcase words at different frequencies. Unsurprisingly, the word *the* tops both charts, but further down there are minor differences. -In all cases, the highly frequent words are mostly functional words like *them* or *first*. More informative words such as *investments* are by their nature used much less often. -Such term statistics are very useful to check for noise in the data and get a feeling of the kind of language that is used. -Second, we take a look at the frequency of these same words in four speeches from Truman to Obama. All use words like *the* and *first*, but none of them talk about *defrauded* -- which is not surprising, since it was only used once in all the speeches in the corpus. - -However, the words that ranked around 1000 in the top frequency are still used in less than half of the documents. -Since there are about 17000 even less frequent words in the corpus, you can imagine that most of the document-term matrix consists of zeros. -The output also noted this *sparsity* in the first output above. -In fact, R reports that the dtm is $91\%$ sparse, meaning 91\% percent of all entries are zero. -Python reports a similar figure, namely that there are only just under 150000 non-zero entries -out of a possible $8\times22219$, which boils down to a 92\% sparse matrix. - -Note that to display the matrix we turned it from a *sparse matrix* representation into a *dense matrix*. -Briefly put, in a dense matrix, all entries are stored as a long list of numbers, including all the zeros. -In a sparse matrix, only the non-zero entries and their location are stored. -This conversion (using the function `as.matrix` and the method `todense` respectively), however, was only performed after selecting a small subset of the data. -In general, it is very inefficient to store and work with the matrix in a `dense` format. -For a reasonably large corpus with tens of thousands of documents and different words, this can quickly run to billions of numbers, -which can cause problems even on modern computers and is, moreover, very inefficient. -Because sparsity values are often higher than 99\%, using a sparse matrix representation can easily reduce storage requirements by a hundred times, and in the process speed up calculations by reducing the number of entries that need to be inspected. -Both *quanteda* and scikit-learnstore DTMs as sparse matrices by default, -and most analysis tools are able to deal with sparse matrices very efficiently -(see, however, Section [-@sec-workflow] for problems with machine learning on sparse matrices in R). - -A final note on the difference between Python and R in this example. -The code in R is much simpler and produces nicer results since it also shows the words and the speech names. -In Python, we wrote our own helper function to create the frequency statistics which is built into the R *quanteda* package. -These differences between Python and R reflect a pattern that is true in many (but not all) cases: -in Python libraries such as numpyand scikit-learnare setup to maximize performance, -while in R a library such as *quanteda* or *tidyverse* is more geared towards ease of use. -For that reason, the DTM in Python does not "remember" the actual words, it uses the index of each word, -so it consumes less memory if you don't need to use the actual words in e.g. a machine learning setup. -R, on the other hand, stores the words and also the document IDs and metadata in the DFM object. -This is easier to use if you need to look up a word or document, but it consumes (slightly) more memory. - -::: {.callout-note icon=false collapse=true} -**Python: Why fit_transform?** -In Python, you don't have a function that directly transforms text into a DTM. -Instead, you create an *transformer* called a CountVectorizer, -which can then be used to "vectorize" texts (turn it into a row of numbers) -by counting how often each word occurs. -This uses the `fit_transform` function which is offered by all scikit-learntransformers. -It "fits" the model on the training data, which in this case means learning the vocabulary. -It can then be used to transform other data into a DTM with the exact same columns, -which is often required for algorithms. -Because the feature names (the words themselves) are stored in the CountVectorizer -rather than the document-term matrix, you generally need to keep both objects. +In Example [-@exm-freq] we show how you can look at the content of the DTM. First, we show the overall term and document frequencies of each word, where we showcase words at different frequencies. Unsurprisingly, the word *the* tops both charts, but further down there are minor differences. In all cases, the highly frequent words are mostly functional words like *them* or *first*. More informative words such as *investments* are by their nature used much less often. Such term statistics are very useful to check for noise in the data and get a feeling of the kind of language that is used. Second, we take a look at the frequency of these same words in four speeches from Truman to Obama. All use words like *the* and *first*, but none of them talk about *defrauded* -- which is not surprising, since it was only used once in all the speeches in the corpus. + +However, the words that ranked around 1000 in the top frequency are still used in less than half of the documents. Since there are about 17000 even less frequent words in the corpus, you can imagine that most of the document-term matrix consists of zeros. The output also noted this *sparsity* in the first output above. In fact, R reports that the dtm is $91\%$ sparse, meaning 91% percent of all entries are zero. Python reports a similar figure, namely that there are only just under 150000 non-zero entries out of a possible $8\times22219$, which boils down to a 92% sparse matrix. + +Note that to display the matrix we turned it from a *sparse matrix* representation into a *dense matrix*. Briefly put, in a dense matrix, all entries are stored as a long list of numbers, including all the zeros. In a sparse matrix, only the non-zero entries and their location are stored. This conversion (using the function `as.matrix` and the method `todense` respectively), however, was only performed after selecting a small subset of the data. In general, it is very inefficient to store and work with the matrix in a `dense` format. For a reasonably large corpus with tens of thousands of documents and different words, this can quickly run to billions of numbers, which can cause problems even on modern computers and is, moreover, very inefficient. Because sparsity values are often higher than 99%, using a sparse matrix representation can easily reduce storage requirements by a hundred times, and in the process speed up calculations by reducing the number of entries that need to be inspected. Both *quanteda* and scikit-learnstore DTMs as sparse matrices by default, and most analysis tools are able to deal with sparse matrices very efficiently (see, however, Section [-@sec-workflow] for problems with machine learning on sparse matrices in R). + +A final note on the difference between Python and R in this example. The code in R is much simpler and produces nicer results since it also shows the words and the speech names. In Python, we wrote our own helper function to create the frequency statistics which is built into the R *quanteda* package. These differences between Python and R reflect a pattern that is true in many (but not all) cases: in Python libraries such as numpyand scikit-learnare setup to maximize performance, while in R a library such as *quanteda* or *tidyverse* is more geared towards ease of use. For that reason, the DTM in Python does not "remember" the actual words, it uses the index of each word, so it consumes less memory if you don't need to use the actual words in e.g. a machine learning setup. R, on the other hand, stores the words and also the document IDs and metadata in the DFM object. This is easier to use if you need to look up a word or document, but it consumes (slightly) more memory. + +::: {.callout-note icon="false" collapse="true"} +**Python: Why fit_transform?** In Python, you don't have a function that directly transforms text into a DTM. Instead, you create an *transformer* called a CountVectorizer, which can then be used to "vectorize" texts (turn it into a row of numbers) by counting how often each word occurs. This uses the `fit_transform` function which is offered by all scikit-learntransformers. It "fits" the model on the training data, which in this case means learning the vocabulary. It can then be used to transform other data into a DTM with the exact same columns, which is often required for algorithms. Because the feature names (the words themselves) are stored in the CountVectorizer rather than the document-term matrix, you generally need to keep both objects. ::: ### The DTM as a "Bag of Words" {#sec-bagofwords} -As you can see already in these simple examples, the document-term matrix discards quite a lot of information from text. -Specifically, it disregards the order or words in a text: "John fired Mary" and "Mary fired John" both result in the same DTM, -even though the meaning of the sentences is quite different. -For this reason, a DTM is often called a *bag of words*, in the sense that all words in the document are simply put in a big bag -without looking at the sentences or context of these words. - -Thus, the DTM can be said to be a specific and "lossy" representation of the text, that turns out to be quite useful for certain tasks: -the frequent occurrence of words like "employment", "great", or "I" might well be good indicators that a text is about the economy, -is positive, or contains personal expressions respectively. -As we will see in the next chapter, the DTM representation can be used for many different text analyses, from dictionaries to supervised and unsupervised machine learning. - -Sometimes, however, you need information that is encoded in the order of words. -For example, in analyzing conflict coverage it might be quite important to know who attacks whom, not just that an attack took place. -In the Section [-@sec-ngram] we will look at some ways to create a richer matrix-representation by using word pairs. -Although it is beyond the scope of this book, -you can also use automatic syntactic analysis to take grammatical relations into account as well. -As is always the case with automatic analyses, it is important to understand what information the computer is looking at, -as the computer cannot find patterns in information that it doesn't have. +As you can see already in these simple examples, the document-term matrix discards quite a lot of information from text. Specifically, it disregards the order or words in a text: "John fired Mary" and "Mary fired John" both result in the same DTM, even though the meaning of the sentences is quite different. For this reason, a DTM is often called a *bag of words*, in the sense that all words in the document are simply put in a big bag without looking at the sentences or context of these words. -### The (Unavoidable) Word Cloud {#sec-wordcloud} +Thus, the DTM can be said to be a specific and "lossy" representation of the text, that turns out to be quite useful for certain tasks: the frequent occurrence of words like "employment", "great", or "I" might well be good indicators that a text is about the economy, is positive, or contains personal expressions respectively. As we will see in the next chapter, the DTM representation can be used for many different text analyses, from dictionaries to supervised and unsupervised machine learning. + +Sometimes, however, you need information that is encoded in the order of words. For example, in analyzing conflict coverage it might be quite important to know who attacks whom, not just that an attack took place. In the Section [-@sec-ngram] we will look at some ways to create a richer matrix-representation by using word pairs. Although it is beyond the scope of this book, you can also use automatic syntactic analysis to take grammatical relations into account as well. As is always the case with automatic analyses, it is important to understand what information the computer is looking at, as the computer cannot find patterns in information that it doesn't have. -One of the most famous text visualizations is without doubt the word cloud. -Essentially, a word cloud is an image where each word is displayed in a size that is representative of its frequency. -Depending on preference, word position and color can be random, depending on word frequency, or in a decorative shape. +### The (Unavoidable) Word Cloud {#sec-wordcloud} -Word clouds are often criticized since they are (sometimes) pretty but mostly not very informative. -The core reason for that is that only a single aspect of the words is visualized (frequency), -and simple word frequency is often not that informative: the most frequent words are generally uninformative "stop words" like "the" and "I". +One of the most famous text visualizations is without doubt the word cloud. Essentially, a word cloud is an image where each word is displayed in a size that is representative of its frequency. Depending on preference, word position and color can be random, depending on word frequency, or in a decorative shape. -For example, Example [-@exm-wordcloud] shows the word cloud for the state of the union speeches downloaded above. -In R, this is done using the *quanteda* function `textplot_wordcloud`. -In Python we need to work a little harder, since it only has the counts, not the actual words. -So, we sum the DTM columns to get the frequency of each word, and combine that with the feature names (words) -from the `CountVectorized` object `cv`. Then we can create the word cloud and give it the frequencies to use. -Finally, we plot the cloud and remove the axes. +Word clouds are often criticized since they are (sometimes) pretty but mostly not very informative. The core reason for that is that only a single aspect of the words is visualized (frequency), and simple word frequency is often not that informative: the most frequent words are generally uninformative "stop words" like "the" and "I". -::: {.callout-note appearance="simple" icon=false} +For example, Example [-@exm-wordcloud] shows the word cloud for the state of the union speeches downloaded above. In R, this is done using the *quanteda* function `textplot_wordcloud`. In Python we need to work a little harder, since it only has the counts, not the actual words. So, we sum the DTM columns to get the frequency of each word, and combine that with the feature names (words) from the `CountVectorized` object `cv`. Then we can create the word cloud and give it the frequencies to use. Finally, we plot the cloud and remove the axes. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-wordcloud} Word cloud of the US State of the Union corpus -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python wordcloud-python} #| results: hide #| cache: true def wordcloud(dfm, vectorizer, **options): freq_dict = dict( - zip(vectorizer.get_feature_names(), dfm.sum(axis=0).tolist()[0]) + zip(vectorizer.get_feature_names_out(), dfm.sum(axis=0).tolist()[0]) ) wc = WordCloud(**options) return wc.generate_from_frequencies(freq_dict) @@ -492,8 +398,11 @@ def wordcloud(dfm, vectorizer, **options): wc = wordcloud(d, cv, background_color="white") plt.imshow(wc) plt.axis("off") +plt.show() ``` + ## R code + ```{r wordcloud-r} #| cache: true textplot_wordcloud(d, max_words=200) @@ -502,72 +411,46 @@ textplot_wordcloud(d, max_words=200) ::: ::: -The results from Python and R look different at first -- for one thing, R is nice and round but Python has more colors! -However, if you look at the cloud you can see both are not very meaningful: the largest words are all punctuation or words like -"a", "and", or "the". -You have to look closely to find words like "federal" or "security" that give a hint on what the texts were actually about. +The results from Python and R look different at first -- for one thing, R is nice and round but Python has more colors! However, if you look at the cloud you can see both are not very meaningful: the largest words are all punctuation or words like "a", "and", or "the". You have to look closely to find words like "federal" or "security" that give a hint on what the texts were actually about. ## Weighting and Selecting Documents and Terms {#sec-dtmselect} -So far, the DTMs you made in this chapter simply show the count of each word in each document. -Many words, however, are not informative for many questions. -This is especially apparent if you look at a *word cloud*, -essentially a plot of the most frequent words in a *corpus* (set of documents). +So far, the DTMs you made in this chapter simply show the count of each word in each document. Many words, however, are not informative for many questions. This is especially apparent if you look at a *word cloud*, essentially a plot of the most frequent words in a *corpus* (set of documents). -::: {.callout-note icon=false collapse=true} +::: {.callout-note icon="false" collapse="true"} ## Vectors and a geometric interpretation of document-term matrices -We said that a document is represented by a "vector" of numbers, where each number (for a document-term matrix) - is the frequency of a specific word in that document. This term is also seen in the name for the tokenizer scikit-learn: - a *vectorizer* or function to turn texts into vectors. +We said that a document is represented by a "vector" of numbers, where each number (for a document-term matrix) is the frequency of a specific word in that document. This term is also seen in the name for the tokenizer scikit-learn: a *vectorizer* or function to turn texts into vectors. -The term *vector* here can be read as just a fancy word for a group of numbers. - In this meaning, the term is also often used in R, where a column of a data frame is called a vector, - and where functions that can be called on a whole vector at once are called *vectorized*. +The term *vector* here can be read as just a fancy word for a group of numbers. In this meaning, the term is also often used in R, where a column of a data frame is called a vector, and where functions that can be called on a whole vector at once are called *vectorized*. -More generally, however, a vector in geometry is a point (or line from the origin) in an $n$-dimensional space, - where $n$ is the length or dimensionality of the vector. - This is also a very useful interpretation for vectors in text analysis: - the dimensionality of the space is the number of unique words (columns) in the document-term matrix, - and each document is a point in that $n$-dimensional space. +More generally, however, a vector in geometry is a point (or line from the origin) in an $n$-dimensional space, where $n$ is the length or dimensionality of the vector. This is also a very useful interpretation for vectors in text analysis: the dimensionality of the space is the number of unique words (columns) in the document-term matrix, and each document is a point in that $n$-dimensional space. -In that interpretation, various geometric distances between documents can be calculated as an indicator for how similar - two documents are. Techniques that reduce the number of columns in the matrix (such as clustering or topic modeling) - can then be seen as dimensionality reduction techniques since they turn the DTM into a matrix with lower dimensionality - (while hopefully retaining as much of the relevant information as possible). +In that interpretation, various geometric distances between documents can be calculated as an indicator for how similar two documents are. Techniques that reduce the number of columns in the matrix (such as clustering or topic modeling) can then be seen as dimensionality reduction techniques since they turn the DTM into a matrix with lower dimensionality (while hopefully retaining as much of the relevant information as possible). ::: -More formally, a document-term matrix can be seen as a representation of data points about documents: -each document (row) is represented as a vector containing the count per word (column). -Although it is a simplification compared to the original text, -an unfiltered document-term matrix contains a lot of relevant information. -For example, if a president uses the word "terrorism" more often than the word "economy", that could be an indication of their policy priorities. +More formally, a document-term matrix can be seen as a representation of data points about documents: each document (row) is represented as a vector containing the count per word (column). Although it is a simplification compared to the original text, an unfiltered document-term matrix contains a lot of relevant information. For example, if a president uses the word "terrorism" more often than the word "economy", that could be an indication of their policy priorities. -However, there is also a lot of *noise* crowding out this *signal*: -as seen in the word cloud in the previous section the most frequent words are generally quite uninformative. -The same holds for words that hardly occur in any document (but still require a column to be represented) -and noisy "words" such as punctuation or technical artifacts like HTML code. +However, there is also a lot of *noise* crowding out this *signal*: as seen in the word cloud in the previous section the most frequent words are generally quite uninformative. The same holds for words that hardly occur in any document (but still require a column to be represented) and noisy "words" such as punctuation or technical artifacts like HTML code. -This section will discuss a number of techniques for cleaning a corpus or document-term matrix in order to minimize the amount of noise: removing stop words, cleaning punctuation and other artifacts, and trimming and weighting. -As a running example in this section, we will use a collection of tweets from US president Donald Trump. -Example [-@exm-trumptweets] shows how to load these tweets into a data frame containing the ID and text of the tweets. -As you can see, this dataset contains a lot of non-textual features such as hyperlinks and hash tags as well as regular punctuation and stop words. -Before we can start analyzing this data, we need to decide on and perform multiple cleaning steps such as detailed below. - -::: {.callout-note appearance="simple" icon=false} +This section will discuss a number of techniques for cleaning a corpus or document-term matrix in order to minimize the amount of noise: removing stop words, cleaning punctuation and other artifacts, and trimming and weighting. As a running example in this section, we will use a collection of tweets from US president Donald Trump. Example [-@exm-trumptweets] shows how to load these tweets into a data frame containing the ID and text of the tweets. As you can see, this dataset contains a lot of non-textual features such as hyperlinks and hash tags as well as regular punctuation and stop words. Before we can start analyzing this data, we need to decide on and perform multiple cleaning steps such as detailed below. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-trumptweets} Top words used in Trump Tweets -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python trumptweets-python} url = "https://cssbook.net/d/trumptweets.csv" tweets = pd.read_csv(url, usecols=["status_id", "text"], index_col="status_id") tweets.head() ``` + ## R code + ```{r trumptweets-r} url = "https://cssbook.net/d/trumptweets.csv" tweets = read_csv(url, @@ -578,24 +461,19 @@ head(tweets) ::: ::: -Please note that although tweets are perhaps overused as a source of scientific information, -we use them here because they nicely exemplify issues around non-textual elements such as hyperlinks. -See Chapter [-@sec-chap-scraping] for information on how to use the Twitter and other APIs to collect your own data. +Please note that although tweets are perhaps overused as a source of scientific information, we use them here because they nicely exemplify issues around non-textual elements such as hyperlinks. See Chapter [-@sec-chap-scraping] for information on how to use the Twitter and other APIs to collect your own data. ### Removing stopwords {#sec-stopwords} -A first step in cleaning a DTM is often *stop word removal*. -Words such as "a" and "the" are often called stop words, i.e. words that do not tell us much about the content. -Both *quanteda* and scikit-learninclude built-in lists of stop words, making it very easy to remove the most common words. -Example [-@exm-stopwords] shows the result of specifying "English" stop words to be removed for both packages. - -::: {.callout-note appearance="simple" icon=false} +A first step in cleaning a DTM is often *stop word removal*. Words such as "a" and "the" are often called stop words, i.e. words that do not tell us much about the content. Both *quanteda* and scikit-learninclude built-in lists of stop words, making it very easy to remove the most common words. Example [-@exm-stopwords] shows the result of specifying "English" stop words to be removed for both packages. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-stopwords} Simple stop word removal -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python stopwords-python} #| results: hide #| cache: true @@ -606,9 +484,11 @@ d = cv.fit_transform(tweets.text) wc = wordcloud(d, cv, background_color="white") plt.imshow(wc) plt.axis("off") - +plt.show() ``` + ## R code + ```{r stopwords-r} #| cache: true d = corpus(tweets) %>% @@ -622,53 +502,31 @@ textplot_wordcloud(d, max_words=100) ::: ::: -Note, however, that it might seem easy to list words like "a" and "and", -but as it turns out there is no single well-defined list of stop words, -and (as always) the best choice depends on your data and your research question. +Note, however, that it might seem easy to list words like "a" and "and", but as it turns out there is no single well-defined list of stop words, and (as always) the best choice depends on your data and your research question. -Linguistically, stop words are generally function words or closed word classes such as determiner or pronoun, -with closed classes meaning that while you can coin new nouns, you can't simply invent new determiners or prepositions. -However, there are many different stop word lists around which make different choices and are compatible with -different kinds of preprocessing. -The Python word cloud in Example [-@exm-stopwords] shows a nice example of the importance of matching stopwords with the used -tokenization: a central "word" in the cloud is the contraction *'s*. -We are using the NLTK tokenizer, which splits *'s* from the word it was attached to, but the scikit-learnstop word list -does not include that term. -So, it is important to make sure that the words created by the tokenization match the way that words appear in the stop word list. +Linguistically, stop words are generally function words or closed word classes such as determiner or pronoun, with closed classes meaning that while you can coin new nouns, you can't simply invent new determiners or prepositions. However, there are many different stop word lists around which make different choices and are compatible with different kinds of preprocessing. The Python word cloud in Example [-@exm-stopwords] shows a nice example of the importance of matching stopwords with the used tokenization: a central "word" in the cloud is the contraction *'s*. We are using the NLTK tokenizer, which splits *'s* from the word it was attached to, but the scikit-learnstop word list does not include that term. So, it is important to make sure that the words created by the tokenization match the way that words appear in the stop word list. -As an example of the substantive choices inherent in using a stop word lists, -consider the word "will". -As an auxiliary verb, this is probably indeed a stop word: for most substantive questions, there is no difference -whether you will do something or simply do it. -However, "will" can also be a noun (a testament) and a name (e.g. Will Smith). -Simply dropping such words from the corpus can be problematic; see Section [-@sec-nlp] for ways of telling nouns and verbs apart -for more fine-grained filtering. +As an example of the substantive choices inherent in using a stop word lists, consider the word "will". As an auxiliary verb, this is probably indeed a stop word: for most substantive questions, there is no difference whether you will do something or simply do it. However, "will" can also be a noun (a testament) and a name (e.g. Will Smith). Simply dropping such words from the corpus can be problematic; see Section [-@sec-nlp] for ways of telling nouns and verbs apart for more fine-grained filtering. -Moreover, some research questions might actually be interested in certain stop words. -If you are interested in references to the future or specific modalities, -the word might actually be a key indicator. -Similarly, if you are studying self-expression on Internet forums, social identity theory, or populist rhetoric, -words like "I", "us" and "them" can actually be very informative. +Moreover, some research questions might actually be interested in certain stop words. If you are interested in references to the future or specific modalities, the word might actually be a key indicator. Similarly, if you are studying self-expression on Internet forums, social identity theory, or populist rhetoric, words like "I", "us" and "them" can actually be very informative. -For this reason, it is always a good idea to understand and inspect what stop word list you are using, -and use a different one or customize it as needed [see also @nothman18]. -Example [-@exm-stopwords2] shows how you can inspect and customize stop word lists. -For more details on which lists are available and what choices these lists make, -see the package documentation for the *stopwords* package in Python (part of NLTK) and R (part of quanteda) - -::: {.callout-note appearance="simple" icon=false} +For this reason, it is always a good idea to understand and inspect what stop word list you are using, and use a different one or customize it as needed [see also @nothman18]. Example [-@exm-stopwords2] shows how you can inspect and customize stop word lists. For more details on which lists are available and what choices these lists make, see the package documentation for the *stopwords* package in Python (part of NLTK) and R (part of quanteda) +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-stopwords2} Inspecting and Customizing stop word lists -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python stopwords2-python} mystopwords = ["go", "to"] + stopwords.words("english") print(f"{len(mystopwords)} stopwords:" f"{', '.join(mystopwords[:5])}...") ``` + ## R code + ```{r stopwords2-r} mystopwords = stopwords("english", source="snowball") @@ -682,34 +540,21 @@ mystopwords[1:5] ### Removing Punctuation and Noise {#sec-punctuation} -Next to stop words, text often contains punctuation and other things that can be considered "noise" for most research questions. -For example, it could contain emoticons or emoji, Twitter hashtags or at-mentions, or HTML tags or other annotations. - -In both Python and R, we can use regular expressions to remove (parts of) words. -As explained above in Section [-@sec-regular], regular expressions are a powerful way to specify (sequences of) characters which are to be kept or removed. -You can use this, for example, to remove things like punctuation, emoji, or HTML tags. -This can be done either before or after tokenizing (splitting the text into words): -in other words, we can clean the raw texts or the individual words (tokens). - -In general, if you only want to keep or remove certain words, it is often easiest to do so after tokenization -using a regular expression to select the words to keep or remove. -If you want to remove parts of words (e.g. to remove the leading "\#" in hashtags) it is easiest to do that before tokenization, -that is, as a preprocessing step before the tokenization. -Similarly, if you want to remove a term that would be split by the tokenization (such as hyperlinks), -if can be better to remove them before the tokenization occurs. - -Example [-@exm-noise] shows how we can use regular expressions to remove noise in Python and R. -For clarity, it shows the result of each processing step on a single tweet that exemplifies many of the problems described above. -To better understand the tokenization process, we print the tokens in that tweet separated by a vertical bar (`|`). -As a first cleaning step, we will use a regular expression to remove hyperlinks and HTML entities like `&` from the untokenized texts. -Since both hyperlinks and HTML entities are split over multiple tokens, it would be hard to remove them after tokenization. - -::: {.callout-note appearance="simple" icon=false} +Next to stop words, text often contains punctuation and other things that can be considered "noise" for most research questions. For example, it could contain emoticons or emoji, Twitter hashtags or at-mentions, or HTML tags or other annotations. + +In both Python and R, we can use regular expressions to remove (parts of) words. As explained above in Section [-@sec-regular], regular expressions are a powerful way to specify (sequences of) characters which are to be kept or removed. You can use this, for example, to remove things like punctuation, emoji, or HTML tags. This can be done either before or after tokenizing (splitting the text into words): in other words, we can clean the raw texts or the individual words (tokens). + +In general, if you only want to keep or remove certain words, it is often easiest to do so after tokenization using a regular expression to select the words to keep or remove. If you want to remove parts of words (e.g. to remove the leading "\#" in hashtags) it is easiest to do that before tokenization, that is, as a preprocessing step before the tokenization. Similarly, if you want to remove a term that would be split by the tokenization (such as hyperlinks), if can be better to remove them before the tokenization occurs. + +Example [-@exm-noise] shows how we can use regular expressions to remove noise in Python and R. For clarity, it shows the result of each processing step on a single tweet that exemplifies many of the problems described above. To better understand the tokenization process, we print the tokens in that tweet separated by a vertical bar (`|`). As a first cleaning step, we will use a regular expression to remove hyperlinks and HTML entities like `&` from the untokenized texts. Since both hyperlinks and HTML entities are split over multiple tokens, it would be hard to remove them after tokenization. + +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-noise} Cleaning a single tweet at the text and token level -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python noise1-python} #| cache: true id = "x263687274812813312" @@ -732,7 +577,9 @@ tweet_tokens = [ print("After pruning tokens:") print(" | ".join(tweet_tokens)) ``` + ## R code + ```{r noise1-r} #| cache: true id="x263687274812813312" @@ -757,31 +604,19 @@ print(paste(tweet_tokens, collapse=" | ")) ::: ::: -Regular expressions are explained fully in Section [-@sec-regular], so we will keep the explanation short: -the bar `|` splits the pattern in two parts, i.e. it will match if it finds either of the subpatterns. -The first pattern looks for the literal text `http`, followed by an optional `s` and the sequence `://`. -Then, it takes all non-whitespace characters it finds, i.e. the pattern ends at the next whitespace or end of the text. -The second pattern looks for an ampersand (`&`) followed by one or more letters (`\\w+`), followed by a semicolon (`;`). -This matches HTML escapes like `&` for an ampersand. - -In the next step, we process the tokenized text to remove every token that is either a stopword or does not start with a letter. -In Python, this is done by using a list comprehension (`[process(item) for item in list]`) for tokenizing each document; and a nested list comprehension for filtering each token in each document. -In R this is not needed as the `tokens_\*` functions are *vectorized*, that is, they directly run over all the tokens. +Regular expressions are explained fully in Section [-@sec-regular], so we will keep the explanation short: the bar `|` splits the pattern in two parts, i.e. it will match if it finds either of the subpatterns. The first pattern looks for the literal text `http`, followed by an optional `s` and the sequence `://`. Then, it takes all non-whitespace characters it finds, i.e. the pattern ends at the next whitespace or end of the text. The second pattern looks for an ampersand (`&`) followed by one or more letters (`\\w+`), followed by a semicolon (`;`). This matches HTML escapes like `&` for an ampersand. -Comparing R and Python, we see that the different tokenization functions mean that `#trump` is removed in R (since it is a token that does not start with a letter), -but in Python the tokenization splits the `#` from the name and the resulting token `trump` is kept. -If we would have used a different tokenizer for Python (e.g. the `WhitespaceTokenizer`) this would have been different again. -This underscores the importance of inspecting and understanding the results of the specific tokenizer used, -and to make sure that subsequent steps match these tokenization choices. -Concretely, with the `TreebankWordtokenizer` we would have had to also remove hashtags at the text level rather than the token level. +In the next step, we process the tokenized text to remove every token that is either a stopword or does not start with a letter. In Python, this is done by using a list comprehension (`[process(item) for item in list]`) for tokenizing each document; and a nested list comprehension for filtering each token in each document. In R this is not needed as the `tokens_\*` functions are *vectorized*, that is, they directly run over all the tokens. -::: {.callout-note appearance="simple" icon=false} +Comparing R and Python, we see that the different tokenization functions mean that `#trump` is removed in R (since it is a token that does not start with a letter), but in Python the tokenization splits the `#` from the name and the resulting token `trump` is kept. If we would have used a different tokenizer for Python (e.g. the `WhitespaceTokenizer`) this would have been different again. This underscores the importance of inspecting and understanding the results of the specific tokenizer used, and to make sure that subsequent steps match these tokenization choices. Concretely, with the `TreebankWordtokenizer` we would have had to also remove hashtags at the text level rather than the token level. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-tagcloud} Cleaning the whole corpus and making a tag cloud -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python tagcloud-python} #| results: hide #| cache: true @@ -800,9 +635,11 @@ dtm_emoji = cv.fit_transform(tokens) wc = wordcloud(dtm_emoji, cv, background_color="white") plt.imshow(wc) plt.axis("off") - +plt.show() ``` + ## R code + ```{r tagcloud-r} #| cache: true dfm_cleaned = tweets %>% @@ -819,45 +656,25 @@ textplot_wordcloud(dfm_cleaned, max_words=100, ::: ::: -As a final example, Example [-@exm-tagcloud] shows how to filter tokens for the whole corpus, but rather than removing hashtags it keeps only the hashtags to produce a tag cloud. -In R, this is mostly a pipeline of *quanteda* functions to create the corpus, tokenize, keep only hashtags, and create a DFM. -To spice up the output we use the *RColorBrewer* package to set random colors for the tags. -In Python, you can see that we now have a nested list comprehension, where the outer loop iterates over the texts and the inner loop iterates over the tokens in each text. -Next, we make a `do_nothing` function for the vectorizer since the results are already tokenized. -Note that we need to disable lowercasing as otherwise it will try to call `.lower()` on the token lists. +As a final example, Example [-@exm-tagcloud] shows how to filter tokens for the whole corpus, but rather than removing hashtags it keeps only the hashtags to produce a tag cloud. In R, this is mostly a pipeline of *quanteda* functions to create the corpus, tokenize, keep only hashtags, and create a DFM. To spice up the output we use the *RColorBrewer* package to set random colors for the tags. In Python, you can see that we now have a nested list comprehension, where the outer loop iterates over the texts and the inner loop iterates over the tokens in each text. Next, we make a `do_nothing` function for the vectorizer since the results are already tokenized. Note that we need to disable lowercasing as otherwise it will try to call `.lower()` on the token lists. -::: {.callout-note icon=false collapse=true} +::: {.callout-note icon="false" collapse="true"} ## Lambda functions in Python. -Sometimes, we need to define a function that is very simple and that we need only once. - An example for such a throwaway function is `do_nothing` in Example [-@exm-tagcloud]. - Instead of defining a reusable function with the `def` keyword and then to call it by its name when we need it later, - we can therefore also directly define an unnamed function when we need it with the `lambda` keyword. - The syntax is simple: `lambda argument: returnvalue`. - A function that maps a value onto itself can therefore be written as `lambda x: x`. - In Example [-@exm-tagcloud], instead of defining a named function, - we could therefore also simply write `v = CountVectorizer(tokenizer=lambda x: x, lowercase=False)`. - The advantages are that it saves you two lines of code here and you don't clutter your environment with functions you do not intend to re-use anyway. - The disadvantage is that it may be less clear what is happening, at least for people not familiar with lambda functions. +Sometimes, we need to define a function that is very simple and that we need only once. An example for such a throwaway function is `do_nothing` in Example [-@exm-tagcloud]. Instead of defining a reusable function with the `def` keyword and then to call it by its name when we need it later, we can therefore also directly define an unnamed function when we need it with the `lambda` keyword. The syntax is simple: `lambda argument: returnvalue`. A function that maps a value onto itself can therefore be written as `lambda x: x`. In Example [-@exm-tagcloud], instead of defining a named function, we could therefore also simply write `v = CountVectorizer(tokenizer=lambda x: x, lowercase=False)`. The advantages are that it saves you two lines of code here and you don't clutter your environment with functions you do not intend to re-use anyway. The disadvantage is that it may be less clear what is happening, at least for people not familiar with lambda functions. ::: ### Trimming a DTM {#sec-trimdtm} -The techniques above both drop terms from the DTM based on specific choices or patterns. -It can also be beneficial to trim a DTM by removing words that occur very infrequently or overly frequently. -For the former, the reason is that if a word only occurs in a very small percentage of documents it is unlikely to be very informative. -Overly frequent words, for example occurring in more than half or 75\% of all documents, function basically like stopwords for this corpus. -In many cases, this can be a result of the selection strategy. If we select all tweets containing "Trump", the word Trump itself is no longer informative about their content. -It can also be that some words are used as standard phrases, for example "fellow Americans" in state of the union speeches. -If every president in the corpus uses those terms, they are no longer informative about differences between presidents. - -::: {.callout-note appearance="simple" icon=false} +The techniques above both drop terms from the DTM based on specific choices or patterns. It can also be beneficial to trim a DTM by removing words that occur very infrequently or overly frequently. For the former, the reason is that if a word only occurs in a very small percentage of documents it is unlikely to be very informative. Overly frequent words, for example occurring in more than half or 75% of all documents, function basically like stopwords for this corpus. In many cases, this can be a result of the selection strategy. If we select all tweets containing "Trump", the word Trump itself is no longer informative about their content. It can also be that some words are used as standard phrases, for example "fellow Americans" in state of the union speeches. If every president in the corpus uses those terms, they are no longer informative about differences between presidents. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-trimming} Trimming a Document-Term Matrix -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python trimming-python} #| cache: true print(f"# of words before trimming: {d.shape[1]}") @@ -871,7 +688,9 @@ d_trim = cv_trim.fit_transform(tweets.text) print(f" after trimming: {d_trim.shape[1]}") ``` + ## R code + ```{r trimming-r} #| cache: true glue("# of words before trimming: {ncol(d)}") @@ -885,33 +704,27 @@ glue("# of word after trimming: {ncol(d_trim)}") ::: ::: -Example [-@exm-trimming] shows how you can use the *relative document frequency* to trim a DTM in Python and R. -We keep only words with a document frequency of between 0.5\% and 75\%. +Example [-@exm-trimming] shows how you can use the *relative document frequency* to trim a DTM in Python and R. We keep only words with a document frequency of between 0.5% and 75%. Although these are reasonable numbers every choice depends on the corpus and the research question, so it can be a good idea to check which words are dropped. -Note that dropping words that occur almost never should normally not influence the results that much, since those words do not occur anyway. -However, trimming a DTM to e.g. at least 1\% document frequency often radically reduces the number of words (columns) in the DTM. -Since many algorithms have to assign weights or parameters to each word, this can provide a significant improvement in computing speed or memory use. +Note that dropping words that occur almost never should normally not influence the results that much, since those words do not occur anyway. However, trimming a DTM to e.g. at least 1% document frequency often radically reduces the number of words (columns) in the DTM. Since many algorithms have to assign weights or parameters to each word, this can provide a significant improvement in computing speed or memory use. ### Weighting a DTM {#sec-dtmweight} -The DTMs created above all use the raw frequencies as cell values. -It can also be useful to weight the words so more informative words have a higher weight than less informative ones. -A common technique for this is *tf$\cdot$idf* weighting. -This stands for *term frequency $\cdot$ inverse document frequency* and weights each occurrence by its raw frequency (term frequency) corrected for how often it occurs in all documents (inverse document frequency). In a formula, the most common implementation of this weight is given as follows: +The DTMs created above all use the raw frequencies as cell values. It can also be useful to weight the words so more informative words have a higher weight than less informative ones. A common technique for this is *tf*$\cdot$idf weighting. This stands for *term frequency* $\cdot$ inverse document frequency and weights each occurrence by its raw frequency (term frequency) corrected for how often it occurs in all documents (inverse document frequency). In a formula, the most common implementation of this weight is given as follows: $tf\cdot idf(t,d)=tf(t,d)\cdot idf(t)=f_{t,d}\cdot -\log \frac{n_t}{N}$ Where $f_{t,d}$ is the frequency of term $t$ in document $d$, $N$ is the total number of documents, and $n_t$ is the number of documents in which term $t$ occurs. In other words, the term frequency is weighted by the negative log of the fraction of documents in which that term occurs. Since $\log(1)$ is zero, terms that occur in every document are disregarded, and in general the less frequent a term is, the higher the weight will be. -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-tfidf} Tf$\cdot$Idf weighting -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python tfidf-python} #| cache: true tfidf_vectorizer = TfidfVectorizer( @@ -925,7 +738,9 @@ indices = [ d_w[[[0], [25], [50], [75]], indices].todense() ``` + ## R code + ```{r tfidf-r} #| cache: true d_tf = corpus(sotu) %>% @@ -940,77 +755,45 @@ as.matrix( ::: ::: -tf$\cdot$idf weighting is a fairly common technique and can improve the results of subsequent analyses such as supervised machine learning. -As such, it is no surprise that it is easy to apply this in both Python and R, as shown in Example [-@exm-tfidf]. -This example uses the same data as Example [-@exm-sotu] above, so you can compare the resulting weighted values with the results reported there. -As you can see, the tf$\cdot$idf weighting in both languages have roughly the same effect: -very frequent terms such as *the* are made less important compared to less frequent words such as *submit*. -For example, in the raw frequencies for the 1965 Johnson speech, *the* occurred 355 times compared to *submit* only once. -In the weighted matrix, the weight for *submit* is four times as low as the weight for *the*. +tf$\cdot$idf weighting is a fairly common technique and can improve the results of subsequent analyses such as supervised machine learning. As such, it is no surprise that it is easy to apply this in both Python and R, as shown in Example [-@exm-tfidf]. This example uses the same data as Example [-@exm-sotu] above, so you can compare the resulting weighted values with the results reported there. As you can see, the tf$\cdot$idf weighting in both languages have roughly the same effect: very frequent terms such as *the* are made less important compared to less frequent words such as *submit*. For example, in the raw frequencies for the 1965 Johnson speech, *the* occurred 355 times compared to *submit* only once. In the weighted matrix, the weight for *submit* is four times as low as the weight for *the*. -There are two more things to note if you compare the examples from R and Python. -First, to make the two cases somewhat comparable we have to use two options for R, namely to set the term frequency to proportional (`scheme_tf='prop'`), -and to add smoothing to the document frequencies (`smooth=1`). -Without those options, the counts for the first columns would all be zero (since they occur in all documents, and $\log \frac{85}{85}=0$), -and the other counts would be greater than one since they would only be weighted, not normalized. +There are two more things to note if you compare the examples from R and Python. First, to make the two cases somewhat comparable we have to use two options for R, namely to set the term frequency to proportional (`scheme_tf='prop'`), and to add smoothing to the document frequencies (`smooth=1`). Without those options, the counts for the first columns would all be zero (since they occur in all documents, and $\log \frac{85}{85}=0$), and the other counts would be greater than one since they would only be weighted, not normalized. -Even with those options the results are still different (in details if not in proportions), -mainly because R normalizes the frequencies before weighting, while Python normalizes after the weighting. -Moreover, Python by default uses L2 normalization, meaning that the length of the document vectors will be one, -while R uses L1 normalization, that is, the row sums are one (before weighting). -Both R and Python have various parameters to control these choices which are explained in their respective help pages. -However, although the differences in absolute values look large, the relative effect of making more frequent terms less important is the same, -and the specific weighting scheme and options will probably not matter that much for the final results. -However, it is always good to be aware of the specific options available and try out which work best for your specific research question. +Even with those options the results are still different (in details if not in proportions), mainly because R normalizes the frequencies before weighting, while Python normalizes after the weighting. Moreover, Python by default uses L2 normalization, meaning that the length of the document vectors will be one, while R uses L1 normalization, that is, the row sums are one (before weighting). Both R and Python have various parameters to control these choices which are explained in their respective help pages. However, although the differences in absolute values look large, the relative effect of making more frequent terms less important is the same, and the specific weighting scheme and options will probably not matter that much for the final results. However, it is always good to be aware of the specific options available and try out which work best for your specific research question. ## Advanced Representation of Text {#sec-ngram} -The examples above all created document-term matrices where each column actually represents a word. -There is more information in a text, however, than pure word counts. -The phrases: *the movie was not good, it was in fact quite bad* and *the movie was not bad, in fact it was quite good* -have exactly the same word frequencies, but are quite different in meaning. -Similarly, *the new kings of York* and *the kings of New York* refer to very different people. +The examples above all created document-term matrices where each column actually represents a word. There is more information in a text, however, than pure word counts. The phrases: *the movie was not good, it was in fact quite bad* and *the movie was not bad, in fact it was quite good* have exactly the same word frequencies, but are quite different in meaning. Similarly, *the new kings of York* and *the kings of New York* refer to very different people. -Of course, in the end which aspect of the meaning of a text is important depends on your research question: -if you want to know the sentiment about the movie, it is important to take a word like "not" into account; -but if you are interested in the topic or genre of the review, or the extremity of the language used, this might not be relevant. +Of course, in the end which aspect of the meaning of a text is important depends on your research question: if you want to know the sentiment about the movie, it is important to take a word like "not" into account; but if you are interested in the topic or genre of the review, or the extremity of the language used, this might not be relevant. -The core idea of this section is that in many cases this information can be captured in a DTM by having the columns represent different information than just words, for example word combinations or groups of related words. -This is often called *feature engineering*, as we are using our domain expertise to find the right features (columns, independent variables) to capture the relevant meaning for our research question. -If we are using other columns than words it is also technically more correct to use the name *document-feature matrix*, as *quanteda* does, but we will stick to the most common name here and simply continue using the name DTM. +The core idea of this section is that in many cases this information can be captured in a DTM by having the columns represent different information than just words, for example word combinations or groups of related words. This is often called *feature engineering*, as we are using our domain expertise to find the right features (columns, independent variables) to capture the relevant meaning for our research question. If we are using other columns than words it is also technically more correct to use the name *document-feature matrix*, as *quanteda* does, but we will stick to the most common name here and simply continue using the name DTM. ### $n$-grams {#sec-ngrams} -The first feature we will discuss are n-grams. -The simplest case is a bigram (or 2-gram), where each feature is a pair of adjacent words. -The example used above, *the movie was not bad*, will yield the following bigrams: *the-movie*, *movie-was*, *was-not*, and *not-bad*. -Each of those bigrams is then treated as a feature, that is, a DTM would contain one column for each word pair. - -As you can see in this example, we can now see the difference between *not-bad* and *not-good*. -The downside of using n-grams is that there are many more unique word pairs than unique words, -so the resulting DTM will have many more columns. -Moreover, there is a bigger *data scarcity problem*, as each of those pairs will be less frequent, -making it more difficult to find sufficient examples of each to generalize over. +The first feature we will discuss are n-grams. The simplest case is a bigram (or 2-gram), where each feature is a pair of adjacent words. The example used above, *the movie was not bad*, will yield the following bigrams: *the-movie*, *movie-was*, *was-not*, and *not-bad*. Each of those bigrams is then treated as a feature, that is, a DTM would contain one column for each word pair. -Although bigrams are the most frequent use case, trigrams (3-grams) and (rarely) higher-order n-grams can also be used. -As you can imagine, this will create even bigger DTMs and worse data scarcity problems, -so even more attention must be paid to feature selection and/or trimming. +As you can see in this example, we can now see the difference between *not-bad* and *not-good*. The downside of using n-grams is that there are many more unique word pairs than unique words, so the resulting DTM will have many more columns. Moreover, there is a bigger *data scarcity problem*, as each of those pairs will be less frequent, making it more difficult to find sufficient examples of each to generalize over. -::: {.callout-note appearance="simple" icon=false} +Although bigrams are the most frequent use case, trigrams (3-grams) and (rarely) higher-order n-grams can also be used. As you can imagine, this will create even bigger DTMs and worse data scarcity problems, so even more attention must be paid to feature selection and/or trimming. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-ngram} Generating n-grams -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python ngram-python} #| cache: true cv = CountVectorizer(ngram_range=(1, 3), tokenizer=mytokenizer.tokenize) cv.fit_transform(["This is a test"]) -cv.get_feature_names() +cv.get_feature_names_out() ``` + ## R code + ```{r ngram-r} #| cache: true text = "This is a test" @@ -1022,20 +805,15 @@ tokens(text) %>% ::: ::: -Example [-@exm-ngram] shows how n-grams can be created and used in Python and R. -In Python, you can pass the `ngram_range=(n, m)` option to the vectorizer, -while R has a `tokens_ngrams(n:m)` function. -Both will post-process the tokens to create all n-grams in the range of n to m. -In this example, we are asking for unigrams (i.e., the words themselves), bigrams and trigrams of a simple example sentence. -Both languages produce the same output, with R separating the words with an underscore while Python uses a simple space. - -::: {.callout-note appearance="simple" icon=false} +Example [-@exm-ngram] shows how n-grams can be created and used in Python and R. In Python, you can pass the `ngram_range=(n, m)` option to the vectorizer, while R has a `tokens_ngrams(n:m)` function. Both will post-process the tokens to create all n-grams in the range of n to m. In this example, we are asking for unigrams (i.e., the words themselves), bigrams and trigrams of a simple example sentence. Both languages produce the same output, with R separating the words with an underscore while Python uses a simple space. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-ngram2} Words and bigrams containing "government" -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python ngram2-python} #| cache: true cv = CountVectorizer( @@ -1046,7 +824,9 @@ ts = termstats(dfm, cv) ts.filter(like="government", axis=0).head(10) ``` + ## R code + ```{r ngram2-r} #| cache: true sotu_tokens = corpus(sotu) %>% @@ -1064,45 +844,23 @@ textstat_frequency(dfm_bigram) %>% ::: ::: -Example [-@exm-ngram2] shows how you can generate n-grams for a whole corpus. -In this case, we create a DTM of the state of the union matrix with all bigrams included. -A glance at the frequency table for all words containing *government* shows that, -besides the word itself and its plural and possessive forms, the bigrams include compound words (federal and local government), -phrases with the government as subject (the government can and must), and nouns for which the government is an adjective -(government spending and government programs). +Example [-@exm-ngram2] shows how you can generate n-grams for a whole corpus. In this case, we create a DTM of the state of the union matrix with all bigrams included. A glance at the frequency table for all words containing *government* shows that, besides the word itself and its plural and possessive forms, the bigrams include compound words (federal and local government), phrases with the government as subject (the government can and must), and nouns for which the government is an adjective (government spending and government programs). -You can imagine that including all these words as features will add many possibilities for analysis of the DTM -which would not be possible in a normal bag-of-words approach. -The terms local and federal government can be quite important to understand policy positions, -but for e.g. sentiment analysis a bigram like *not good* would also be insightful -(but make sure "not" is not on your stop word list!). +You can imagine that including all these words as features will add many possibilities for analysis of the DTM which would not be possible in a normal bag-of-words approach. The terms local and federal government can be quite important to understand policy positions, but for e.g. sentiment analysis a bigram like *not good* would also be insightful (but make sure "not" is not on your stop word list!). ### Collocations {#sec-collocations} -A special case of n-grams are collocations. -In the strict corpus linguistic sense of the word, collocations are pairs of words that occur more frequently than expected -based on their underlying occurrence. -For example, the phrase *crystal clear* presumably occurs much more often than would be expected by chance given -how often *crystal* and *clear* occur separately. -Collocations are important for text analysis since they often have a specific meaning, -for example because they refer to names such as *New York* or disambiguate a term like *sound* in *sound asleep*, -a *sound proposal*, or *loud sound*. - -Example [-@exm-colloc] shows how to identify the most "surprising" collocations using R and Python. -For Python, we use the *gensim* package which we will also use for topic modeling in Section [-@sec-unsupervised]. -This package has a `Phrases` class which can identify the bigrams in a list of tokens. -In R, we use the `textstat_collocations` function from *quanteda*. -These packages each use a different implementation: *gensim* uses pointwise mutual information, -i.e. how much information about finding the second word does seeing the first word give you? -Quanteda estimates an interaction parameter in a loglinear model. -Nonetheless, both methods give very similar results, with Saddam Hussein, the Iron Curtain, Al Qaida, and red tape topping the list for each. - -::: {.callout-note appearance="simple" icon=false} +A special case of n-grams are collocations. In the strict corpus linguistic sense of the word, collocations are pairs of words that occur more frequently than expected based on their underlying occurrence. For example, the phrase *crystal clear* presumably occurs much more often than would be expected by chance given how often *crystal* and *clear* occur separately. Collocations are important for text analysis since they often have a specific meaning, for example because they refer to names such as *New York* or disambiguate a term like *sound* in *sound asleep*, a *sound proposal*, or *loud sound*. + +Example [-@exm-colloc] shows how to identify the most "surprising" collocations using R and Python. For Python, we use the *gensim* package which we will also use for topic modeling in Section [-@sec-unsupervised]. This package has a `Phrases` class which can identify the bigrams in a list of tokens. In R, we use the `textstat_collocations` function from *quanteda*. These packages each use a different implementation: *gensim* uses pointwise mutual information, i.e. how much information about finding the second word does seeing the first word give you? Quanteda estimates an interaction parameter in a loglinear model. Nonetheless, both methods give very similar results, with Saddam Hussein, the Iron Curtain, Al Qaida, and red tape topping the list for each. + +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-colloc} Identifying and applying collocations in the US State of the Union. -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python colloc-python} #| results: hide tokenized_texts = [mytokenizer.tokenize(t) for t in sotu.text] @@ -1114,9 +872,11 @@ phrases_model = Phrases(tokens, min_count=10, scoring="npmi", threshold=0.5) score_dict = phrases_model.export_phrases() scores = pd.DataFrame(score_dict.items(), columns=["phrase", "score"]) ``` + ```{python colloc-python2} scores.sort_values("score", ascending=False).head() ``` + ```{python colloc-python3} #| results: hide phraser = Phraser(phrases_model) @@ -1124,10 +884,13 @@ tokens_phrases = [phraser[doc] for doc in tokens] cv = CountVectorizer(tokenizer=lambda x: x, lowercase=False) dtm = cv.fit_transform(tokens_phrases) ``` + ```{python colloc-python4} termstats(dtm, cv).filter(like="hussein", axis=0) ``` + ## R code + ```{r colloc-r} #| cache: true sotu_tokens = corpus(sotu) %>% @@ -1154,66 +917,31 @@ textstat_frequency(dfm) %>% ::: ::: -The next block demonstrates how to use these collocations in further processing. -In R, we filter the collocations list on $lambda>8$ and use the `tokens_compound` function to compound bigrams from that list. -As you can see in the term frequencies filtered on "Hussein", the regular terms (apart from the possessive) are removed and the compounded term now has 26 occurrences. -For Python, we use the `PhraseTransformer` class, which is an adaptation of the `Phrases` class to the scikit-learnmethodology. -After setting a standard threshold of 0.7, we can use `fit_transform` to change the tokens. -The term statistics again show how the individual terms are now replaced by their compound. +The next block demonstrates how to use these collocations in further processing. In R, we filter the collocations list on $lambda>8$ and use the `tokens_compound` function to compound bigrams from that list. As you can see in the term frequencies filtered on "Hussein", the regular terms (apart from the possessive) are removed and the compounded term now has 26 occurrences. For Python, we use the `PhraseTransformer` class, which is an adaptation of the `Phrases` class to the scikit-learnmethodology. After setting a standard threshold of 0.7, we can use `fit_transform` to change the tokens. The term statistics again show how the individual terms are now replaced by their compound. ### Word Embeddings {#sec-wordembeddings} -A recent addition to the text analysis toolbox are *word embeddings*. -Although it is beyond the scope of this book to give a full explanation of the algorithms behind word embeddings, -they are relatively easy to understand and use at an intuitive level. - -The first core idea behind word embeddings is that the meaning of a word can be expressed using a relatively small *embedding vector*, generally consisting of around 300 numbers which can be interpreted as dimensions of meaning. -The second core idea is that these embedding vectors can be derived by scanning the context of each word in millions and millions of documents. - -These embedding vectors can then be used as features or DTM columns for further analysis. -Using embedding vectors instead of word frequencies has the advantages of strongly reducing the dimensionality of the DTM: -instead of (tens of) thousands of columns for each unique word we only need hundreds of columns for the embedding vectors. -This means that further processing can be more efficient as fewer parameters need to be fit, -or conversely that more complicated models can be used without blowing up the parameter space. -Another advantage is that a model can also give a result for words it never saw before, as these words most likely will have an embedding vector and so can be fed into the model. -Finally, since words with similar meanings should have similar vectors, -a model fit on embedding vectors gets a "head start" since the vectors for words like "great" and "fantastic" will already be relatively close to each other, while all columns in a normal DTM are treated independently. - -The assumption that words with similar meanings have similar vectors can also be used directly to extract synonyms. -This can be very useful, for example for (semi-)automatically expanding a dictionary for a concept. -Example [-@exm-embedding] shows how to download and use pre-trained embedding vectors to extract synonyms. -First, we download a very small subset of the pre-trained Glove embedding vectors[^1], -wrapping the download call in a condition to only download it when needed. - -Then, for Python, we use the excellent support from the *gensim* package to load the embeddings into a `KeyedVectors` object. -Although not needed for the rest of the example, we create a *Pandas* data frame from the internal embedding values so the internal structure becomes clear: each row is a word, and the columns (in this case 50) are the different (semantic) dimensions that characterize that word according to the embeddings model. -This data frame is sorted on the first dimension, which shows that negative values on that dimension are related to various sports. -Next, we switch back to the `KeyedVectors` object to get the most similar words to the word *fraud*, which is apparently related to similar words like *bribery* and *corruption* but also to words like *charges* and *alleged*. -These similarities are a good way to (semi-)automatically expand a dictionary: start from a small list of words, -find all words that are similar to those words, and if needed manually curate that list. -Finally, we use the embeddings to solve the "analogies" that famously showcase the geometric nature of these vectors: -if you take the vector for *king*, subtract the vector for *man* and add that for *woman*, -the closest word to the resulting vector is *queen*. -Amusingly, it turns out that soccer is a female form of football, probably showing the American cultural origin of the source material. - -For R, there was less support from existing packages so we decided to use the opportunity to show both the conceptual simplicity of embeddings vectors and the power of matrix manipulation in R. -Thus, we directly read in the word vector file which has a head line and then on each line a word followed by its 50 values. -This is converted to a matrix with the row names showing the word, -which we normalize to (Euclidean) length of one for each vector for easier processing. -To determine similarity, we take the cosine distance between the vector representing a word with all other words in the matrix. -As you might remember from algebra, the cosine distance is the dot product between the vectors normalized to have length one -(just like Pearson's product--moment correlation is the dot product between the vectors normalized to z-scores per dimension). -Thus, we can simply multiply the normalized target vector with the normalized matrix to get the similarity scores. -These are then sorted, renamed, and the top values are taken using the basic functions from Chapter [-@sec-chap-datawrangling]. -Finally, analogies are solved by simply adding and subtracting the vectors as explained above, and then listing the closest words to the resulting vector -(excluding the words in the analogy itself). - -::: {.callout-note appearance="simple" icon=false} +A recent addition to the text analysis toolbox are *word embeddings*. Although it is beyond the scope of this book to give a full explanation of the algorithms behind word embeddings, they are relatively easy to understand and use at an intuitive level. + +The first core idea behind word embeddings is that the meaning of a word can be expressed using a relatively small *embedding vector*, generally consisting of around 300 numbers which can be interpreted as dimensions of meaning. The second core idea is that these embedding vectors can be derived by scanning the context of each word in millions and millions of documents. + +These embedding vectors can then be used as features or DTM columns for further analysis. Using embedding vectors instead of word frequencies has the advantages of strongly reducing the dimensionality of the DTM: instead of (tens of) thousands of columns for each unique word we only need hundreds of columns for the embedding vectors. This means that further processing can be more efficient as fewer parameters need to be fit, or conversely that more complicated models can be used without blowing up the parameter space. Another advantage is that a model can also give a result for words it never saw before, as these words most likely will have an embedding vector and so can be fed into the model. Finally, since words with similar meanings should have similar vectors, a model fit on embedding vectors gets a "head start" since the vectors for words like "great" and "fantastic" will already be relatively close to each other, while all columns in a normal DTM are treated independently. + +The assumption that words with similar meanings have similar vectors can also be used directly to extract synonyms. This can be very useful, for example for (semi-)automatically expanding a dictionary for a concept. Example [-@exm-embedding] shows how to download and use pre-trained embedding vectors to extract synonyms. First, we download a very small subset of the pre-trained Glove embedding vectors[^chapter10-1], wrapping the download call in a condition to only download it when needed. + +[^chapter10-1]: The full embedding models can be downloaded from https://nlp.stanford.edu/projects/glove/. To make the file easier to download, we took only the 10000 most frequent words of the smallest embeddings file (the 50 dimension version of the 6B tokens model). For serious applications you probably want to download the larger files, in our experience the 300 dimension version usually gives good results. Note that the files on that site are in a slightly different format which lacks the initial header line, so if you want to use other vectors for the examples here you can convert them with the `glove2word2vec` function in the *gensim* package. For R, you can also simply omit the `skip=1` argument as apart from the header line the formats are identical. + +Then, for Python, we use the excellent support from the *gensim* package to load the embeddings into a `KeyedVectors` object. Although not needed for the rest of the example, we create a *Pandas* data frame from the internal embedding values so the internal structure becomes clear: each row is a word, and the columns (in this case 50) are the different (semantic) dimensions that characterize that word according to the embeddings model. This data frame is sorted on the first dimension, which shows that negative values on that dimension are related to various sports. Next, we switch back to the `KeyedVectors` object to get the most similar words to the word *fraud*, which is apparently related to similar words like *bribery* and *corruption* but also to words like *charges* and *alleged*. These similarities are a good way to (semi-)automatically expand a dictionary: start from a small list of words, find all words that are similar to those words, and if needed manually curate that list. Finally, we use the embeddings to solve the "analogies" that famously showcase the geometric nature of these vectors: if you take the vector for *king*, subtract the vector for *man* and add that for *woman*, the closest word to the resulting vector is *queen*. Amusingly, it turns out that soccer is a female form of football, probably showing the American cultural origin of the source material. + +For R, there was less support from existing packages so we decided to use the opportunity to show both the conceptual simplicity of embeddings vectors and the power of matrix manipulation in R. Thus, we directly read in the word vector file which has a head line and then on each line a word followed by its 50 values. This is converted to a matrix with the row names showing the word, which we normalize to (Euclidean) length of one for each vector for easier processing. To determine similarity, we take the cosine distance between the vector representing a word with all other words in the matrix. As you might remember from algebra, the cosine distance is the dot product between the vectors normalized to have length one (just like Pearson's product--moment correlation is the dot product between the vectors normalized to z-scores per dimension). Thus, we can simply multiply the normalized target vector with the normalized matrix to get the similarity scores. These are then sorted, renamed, and the top values are taken using the basic functions from Chapter [-@sec-chap-datawrangling]. Finally, analogies are solved by simply adding and subtracting the vectors as explained above, and then listing the closest words to the resulting vector (excluding the words in the analogy itself). + +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-embedding} Using word embeddings for finding similar and analogous words. -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python embeddings0-python} # Download the model if needed glove_fn = "glove.6B.50d.10k.w2v.txt" @@ -1221,6 +949,7 @@ url = f"https://cssbook.net/d/{glove_fn}" if not os.path.exists(glove_fn): urllib.request.urlretrieve(url, glove_fn) ``` + ```{python embeddings1-python} #| results: hide # Load the vectors @@ -1243,7 +972,9 @@ for x in words: y = analogy("man", x, "woman") print(f"Man is to {x} as woman is to {y}") ``` + ## R code + ```{r embeddings0-r} #| cache: true glove_fn = "glove.6B.50d.10k.w2v.txt" @@ -1287,66 +1018,35 @@ for (x in words) { ### Linguistic Preprocessing {#sec-nlp} -A final technique to be discussed here is the use of linguistic preprocessing steps to enrich and filter a DTM. -So far, all techniques discussed here are language independent. -However, there are also many language-specific tools for automatically enriching text developed by computational linguistics communities around the world. -Two techniques will be discussed here as they are relatively widely available for many languages and easy and quick to apply: *Part-of-speech tagging* and *lemmatizing*. - -In *part-of-speech tagging* or POS-tagging, each word is enriched with information on its function in the sentence: verb, noun, determiner etc. -For most languages, this can be determined with very high accuracy, although sometimes text can be ambiguous: -in one famous example, the word flies in *fruit flies* is generally a noun (fruit flies are a type of fly), but it can also be a verb (if fruit could fly). -Although there are different sets of POS tags used by different tools, there is broad agreement on the core set of tags listed in Table [-@tbl-postags]. - -|Part of speech | Example | UDPipe/Spacy Tag | Penn Treebank Tag| -|-|-|-|-| -|Noun | apple | NOUN | NN, NNS| -|Proper Name | Carlos | PROPN | NNP| -|Verb | write | VERB | VB, VBD, VBP, ..| -|Auxiliary verb | be, have | AUX | (same as verb)| -|Adjective | quick | ADJ | JJ, JJR, JJS| -|Adverb | quickly | ADV | RB| -|Pronoun | I, him | PRON | PRP| -|Adposition | of, in | ADP | IN| -|Determiner | the, a | DET | DT| +A final technique to be discussed here is the use of linguistic preprocessing steps to enrich and filter a DTM. So far, all techniques discussed here are language independent. However, there are also many language-specific tools for automatically enriching text developed by computational linguistics communities around the world. Two techniques will be discussed here as they are relatively widely available for many languages and easy and quick to apply: *Part-of-speech tagging* and *lemmatizing*. + +In *part-of-speech tagging* or POS-tagging, each word is enriched with information on its function in the sentence: verb, noun, determiner etc. For most languages, this can be determined with very high accuracy, although sometimes text can be ambiguous: in one famous example, the word flies in *fruit flies* is generally a noun (fruit flies are a type of fly), but it can also be a verb (if fruit could fly). Although there are different sets of POS tags used by different tools, there is broad agreement on the core set of tags listed in Table [-@tbl-postags]. + +| Part of speech | Example | UDPipe/Spacy Tag | Penn Treebank Tag | +|----------------|----------|------------------|-------------------| +| Noun | apple | NOUN | NN, NNS | +| Proper Name | Carlos | PROPN | NNP | +| Verb | write | VERB | VB, VBD, VBP, .. | +| Auxiliary verb | be, have | AUX | (same as verb) | +| Adjective | quick | ADJ | JJ, JJR, JJS | +| Adverb | quickly | ADV | RB | +| Pronoun | I, him | PRON | PRP | +| Adposition | of, in | ADP | IN | +| Determiner | the, a | DET | DT | + : Overview of part-of-speech (POS) tags. {#tbl-postags} -POS tags are useful since they allow us for example to analyze only the *nouns* if we care about the things that are discussed, only the *verbs* if we care about actions that are described, or only the *adjectives* if we care about the characteristics given to a noun. -Moreover, knowing the POS tag of a word can help disambiguate it. -For example, like as a verb (I like books) is generally positive, but like as a preposition (a day like no other) has no clear sentiment attached. - -*Lemmatizing* is a technique for reducing each word to its root or *lemma* (plural: lemmata). -For example, the lemma of the verb *reads* is (to) *read* and the lemma of the noun *books* is *book*. -Lemmatizing is useful since for most of our research questions we do not care about these different conjugations of the same word. -By lemmatizing the texts, we do not need to include all conjugations in a dictionary, -and it reduces the dimensionality of the DTM -- and thus also the data scarcity. - -Note that lemmatizing is related to a technique called *stemming*, which removes known suffixes (endings) from words. -For example, for English it will remove the "s" from both reads and books. -Stemming is much less sophisticated than lemmatizing, however, and will trip over irregular conjugations -(e.g. *are* as a form of to be) and regular word endings that look like conjugations (e.g. *virus* will be stemmed to *viru*). -English has relatively simple conjugations and stemming can produce adequate results. -For morphologically richer languages such as German or French, however, it is strongly advised to use lemmatizing instead of stemming. -Even for English we would generally advise lemmatization since it is so easy nowadays and will yield better results than stemming. - -For Example [-@exm-udpipe], we use the *UDPipe* natural language processing toolkit [@udpipe], -a "Pipeline" that parses text into "Universal Dependencies", a representation of the syntactic structure of the text. -For R, we can immediately call the `udpipe` function from the package of the same name. -This parses the given text and returns the result as a data frame with one token (word) per row, -and the various features in the columns. -For Python, we need to take some more steps ourselves. -First, we download the English models if they aren't present. -Second, we load the model and create a pipeline with all default settings, -and use that to parse the same sentence. -Finally, we use the *conllu* package to read the results into a form that can be turned into a data frame. - -In both cases, the resulting tokens clearly show some of the potential advantages of linguistic processing: -the lemma column shows that it correctly deals with irregular verbs and plural forms. -Looking at the upos (universal part-of-speech) column, John is recognized as a proper name (PROPN), bought as a verb, and knives as a noun. -Finally, the `head_token_id` and `dep_rel` columns represent the syntactic information in the sentence: -"Bought" (token 2) is the root of the sentence, and "John" is the subject (nsubj) while "knives" is the object of the buying. - -::: {.callout-note appearance="simple" icon=false} +POS tags are useful since they allow us for example to analyze only the *nouns* if we care about the things that are discussed, only the *verbs* if we care about actions that are described, or only the *adjectives* if we care about the characteristics given to a noun. Moreover, knowing the POS tag of a word can help disambiguate it. For example, like as a verb (I like books) is generally positive, but like as a preposition (a day like no other) has no clear sentiment attached. + +*Lemmatizing* is a technique for reducing each word to its root or *lemma* (plural: lemmata). For example, the lemma of the verb *reads* is (to) *read* and the lemma of the noun *books* is *book*. Lemmatizing is useful since for most of our research questions we do not care about these different conjugations of the same word. By lemmatizing the texts, we do not need to include all conjugations in a dictionary, and it reduces the dimensionality of the DTM -- and thus also the data scarcity. + +Note that lemmatizing is related to a technique called *stemming*, which removes known suffixes (endings) from words. For example, for English it will remove the "s" from both reads and books. Stemming is much less sophisticated than lemmatizing, however, and will trip over irregular conjugations (e.g. *are* as a form of to be) and regular word endings that look like conjugations (e.g. *virus* will be stemmed to *viru*). English has relatively simple conjugations and stemming can produce adequate results. For morphologically richer languages such as German or French, however, it is strongly advised to use lemmatizing instead of stemming. Even for English we would generally advise lemmatization since it is so easy nowadays and will yield better results than stemming. +For Example [-@exm-udpipe], we use the *UDPipe* natural language processing toolkit [@udpipe], a "Pipeline" that parses text into "Universal Dependencies", a representation of the syntactic structure of the text. For R, we can immediately call the `udpipe` function from the package of the same name. This parses the given text and returns the result as a data frame with one token (word) per row, and the various features in the columns. For Python, we need to take some more steps ourselves. First, we download the English models if they aren't present. Second, we load the model and create a pipeline with all default settings, and use that to parse the same sentence. Finally, we use the *conllu* package to read the results into a form that can be turned into a data frame. + +In both cases, the resulting tokens clearly show some of the potential advantages of linguistic processing: the lemma column shows that it correctly deals with irregular verbs and plural forms. Looking at the upos (universal part-of-speech) column, John is recognized as a proper name (PROPN), bought as a verb, and knives as a noun. Finally, the `head_token_id` and `dep_rel` columns represent the syntactic information in the sentence: "Bought" (token 2) is the root of the sentence, and "John" is the subject (nsubj) while "knives" is the object of the buying. + +::: {.callout-note appearance="simple" icon="false"} ```{python udipe0} #| echo: false udpipe_model = "english-ewt-ud-2.4-190531.udpipe" @@ -1357,12 +1057,12 @@ if not os.path.exists(udpipe_model): urllib.request.urlretrieve(url, udpipe_model) ``` - ::: {#exm-udpipe} Using UDPipe to analyze a sentence -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python udpipe-python} #| cache: true udpipe_model = "english-ewt-ud-2.4-190531.udpipe" @@ -1373,7 +1073,9 @@ tokenlist = conllu.parse(pipeline.process(text)) pd.DataFrame(tokenlist[0]) ``` + ## R code + ```{r udpipe-r} #| cache: true udpipe("John bought new knives", "english") %>% @@ -1383,26 +1085,17 @@ udpipe("John bought new knives", "english") %>% ::: ::: -The syntactic relations can be useful if you need to differentiate between who is doing something and whom it was done to. -For example, one of the authors of this book used syntactic relations to analyze conflict coverage, -where there is an important difference between attacking and getting attacked [@clause]. -However, in most cases you probably don't need this information and analyzing dependency graphs is relatively complex. -We would advise you to almost always consider lemmatizing and tagging your texts, as lemmatizing is simply so much better than stemming -(especially for languages other than English), and the part-of-speech can be very useful for analyzing different aspects of a text. - -If you only need the lemmatizer and tagger, you can speed up processing by setting `udpipe(.., parser='none')` (R) or setting the third argument to Pipeline (the parser) to `Pipeline.NONE` (Python). -Example [-@exm-nouncloud] shows how this can be used to extract only the nouns from the most recent state of the union speeches, -create a DTM with these nouns, and then visualize them as a word cloud. -As you can see, these words (such as student, hero, childcare, healthcare, and terrorism), are much more indicative of the topic of a text than the general words used earlier. -In the next chapter we will show how you can further analyze these data, for example by analyzing usage patterns per person or over time, or using an unsupervised topic model to cluster words into topics. +The syntactic relations can be useful if you need to differentiate between who is doing something and whom it was done to. For example, one of the authors of this book used syntactic relations to analyze conflict coverage, where there is an important difference between attacking and getting attacked [@clause]. However, in most cases you probably don't need this information and analyzing dependency graphs is relatively complex. We would advise you to almost always consider lemmatizing and tagging your texts, as lemmatizing is simply so much better than stemming (especially for languages other than English), and the part-of-speech can be very useful for analyzing different aspects of a text. -::: {.callout-note appearance="simple" icon=false} +If you only need the lemmatizer and tagger, you can speed up processing by setting `udpipe(.., parser='none')` (R) or setting the third argument to Pipeline (the parser) to `Pipeline.NONE` (Python). Example [-@exm-nouncloud] shows how this can be used to extract only the nouns from the most recent state of the union speeches, create a DTM with these nouns, and then visualize them as a word cloud. As you can see, these words (such as student, hero, childcare, healthcare, and terrorism), are much more indicative of the topic of a text than the general words used earlier. In the next chapter we will show how you can further analyze these data, for example by analyzing usage patterns per person or over time, or using an unsupervised topic model to cluster words into topics. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-nouncloud} Nouns used in the most recent State of the Union addresses -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python nouncloud-python} #| results: hide #| cache: true @@ -1424,7 +1117,9 @@ plt.imshow(wc) plt.axis("off") ``` + ## R code + ```{r nouncloud-r} #| cache: true tokens = sotu %>% @@ -1445,32 +1140,18 @@ nouns %>% ::: ::: -As an alternative to UDPipe, you can also use Spacy, -which is another free and popular natural language toolkit. -It is written in Python, but the *spacyr* package offers an easy way to use it from R. -For R users, installation of *spacyr* on MacOS and Linux is easy, -but note that on Windows there are some additional steps, see -[cran.r-project.org/web/packages/spacyr/readme/README.html](https://cran.r-project.org/web/packages/spacyr/readme/README.html) for more details. +As an alternative to UDPipe, you can also use Spacy, which is another free and popular natural language toolkit. It is written in Python, but the *spacyr* package offers an easy way to use it from R. For R users, installation of *spacyr* on MacOS and Linux is easy, but note that on Windows there are some additional steps, see [cran.r-project.org/web/packages/spacyr/readme/README.html](https://cran.r-project.org/web/packages/spacyr/readme/README.html) for more details. -Example [-@exm-spacy] shows how you can use Spacy to analyze the proverb "all roads lead to Rome" in Spanish. -In the first block, the Spanish language model is downloaded (this is only needed once). -The second block loads the language model and parses the sentence. -You can see that the output is quite similar to UDPipe, but one additional feature is the inclusion of -*Named Entity Recognition*: -Spacy can automatically identify persons, locations, organizations and other entities. -In this example, it identifies "Rome" as a location. -This can be very useful to extract e.g. all persons from a newspaper corpus automatically. -Note that in R, you can use the *quanteda* function `as.tokens` to directly use the Spacy output in quanteda. +Example [-@exm-spacy] shows how you can use Spacy to analyze the proverb "all roads lead to Rome" in Spanish. In the first block, the Spanish language model is downloaded (this is only needed once). The second block loads the language model and parses the sentence. You can see that the output is quite similar to UDPipe, but one additional feature is the inclusion of *Named Entity Recognition*: Spacy can automatically identify persons, locations, organizations and other entities. In this example, it identifies "Rome" as a location. This can be very useful to extract e.g. all persons from a newspaper corpus automatically. Note that in R, you can use the *quanteda* function `as.tokens` to directly use the Spacy output in quanteda. -::: {.callout-note appearance="simple" icon=false} +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-spacy} Using Spacy to analyze a Spanish sentence. -::: {.panel-tabset} +::: panel-tabset ## Python code -First, download the model using the command below and restart python: -(in jupyter, try `!python` or `!python3` instead of plain `python`) +First, download the model using the command below and restart python: (in jupyter, try `!python` or `!python3` instead of plain `python`) ```{bash} #| eval: false @@ -1478,6 +1159,7 @@ python -m spacy download es_core_news_sm ``` ## R code + ```{r spacymodel-r} #| eval: false # Only needed once @@ -1487,8 +1169,9 @@ spacy_download_langmodel("es_core_news_sm") ``` ::: -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python spacy-python} #| cache: true nlp = spacy.load("es_core_news_sm") @@ -1508,7 +1191,9 @@ pd.DataFrame( ) ``` + ## R code + ```{r spacy-r} #| eval: false # I couldn't get this to work properly in the renv environment @@ -1521,47 +1206,18 @@ spacy_finalize() ::: ::: -As you can see, nowadays there are a number of good and relatively easy to use linguistic toolkits that can be used. -Especially *Stanza* [@stanza] is also a very good and flexible toolkit with support for multiple (human) languages and good integration especially with Python. -If you want to learn more about natural language processing, the book *Speech and Language Processing* by Jurafsky and Martin is a very good starting point (@jurafsky)[^2]. +As you can see, nowadays there are a number of good and relatively easy to use linguistic toolkits that can be used. Especially *Stanza* [@stanza] is also a very good and flexible toolkit with support for multiple (human) languages and good integration especially with Python. If you want to learn more about natural language processing, the book *Speech and Language Processing* by Jurafsky and Martin is a very good starting point (@jurafsky)[^chapter10-2]. + +[^chapter10-2]: See [web.stanford.edu/\~jurafsky/slp3/](https://web.stanford.edu/~jurafsky/slp3/) for their draft of a new edition, which is (at the time of writing) free to download. ## Which Preprocessing to Use? -This chapter has shown how to create a DTM and especially introduced a number of different steps that can be used to clean and preprocess the DTM before analysis. -All of these steps are used by text analysis practitioners and in the relevant literature. -However, no study ever uses all of these steps on top of each other. -This of courses raises the question of how to know which preprocessing steps to use for your research question. - -First, there are a number of things that you should (almost) always do. -If your data contains noise such as boilerplate language, HTML artifacts, etc., you should generally strip these out before proceeding. -Second, text almost always has an abundance of uninformative (stop) words and a very long tail of very rare words. -Thus, it is almost always a good idea to use a combination of stop word removal, trimming based on document frequency, and/or `tf.idf` weighting. -Note that when using a stop word list, you should always manually inspect and/or fine-tune the word list to make sure it matches your domain and research question. - -The other steps such as n-grams, collocations, and tagging and lemmatization are more optional but can be quite important depending on the specific research. -For this (and for choosing a specific combination of trimming and weighting), it is always good to know your domain well, look at the results, and think whether you think they make sense. -Using the example given above, bigrams can make more sense for sentiment analysis (since *not good* is quite different from *good*), -but for analyzing the topic of texts it may be less important. - -Ultimately, however, many of these questions have no good theoretical answer, and the only way to find a good preprocessing "pipeline" for your research question is to try many different -options and see which works best. -This might feel like "cheating" from a social science perspective, since it is generally frowned upon to just test many different statistical models and report on what works best. -There is a difference, however, between substantive statistical modeling where you actually want to understand the mechanisms, -and technical processing steps where you just want the best possible measurement of an underlying variable (presumably to be used in a subsequent substantive model). -@mousetrap uses the analogy of the mouse trap and the human condition: in engineering you want to make the best possible mouse trap, -while in social science we want to understand the human condition. -For the mouse trap, it is OK if it is a black box for which we have no understanding of how it works, as long as we are sure that it does work. -For the social science model, this is not the case as it is exactly the inner workings we are interested in. - -Technical (pre)processing steps such as those reviewed in this chapter are primarily engineering devices: -we don't really care how something like `tfc.idf` works, as long as it produces the best possible measurement of the variables we need for our analysis. -In other words, it is an engineering challenge, not a social science research question. -As a consequence, the key criterion by which to judge these steps is validity, not explainability. -Thus, it is fine to try out different options, as long as you validate the results properly. -If you have many different choices to evaluate against some metric such as performance on a subsequent prediction task, -using the split-half or cross-validation techniques discussed in chapter Chapter [-@sec-chap-introsml] are also relevant here to avoid biasing the evaluation. - -[^1]: The full embedding models can be downloaded from https://nlp.stanford.edu/projects/glove/. To make the file easier to download, we took only the 10000 most frequent words of the smallest embeddings file (the 50 dimension version of the 6B tokens model). For serious applications you probably want to download the larger files, in our experience the 300 dimension version usually gives good results. Note that the files on that site are in a slightly different format which lacks the initial header line, so if you want to use other vectors for the examples here you can convert them with the `glove2word2vec` function in the *gensim* package. For R, you can also simply omit the `skip=1` argument as apart from the header line the formats are identical. - -[^2]: See [web.stanford.edu/~jurafsky/slp3/](https://web.stanford.edu/~jurafsky/slp3/) for their draft of a new edition, which is (at the time of writing) free to download. +This chapter has shown how to create a DTM and especially introduced a number of different steps that can be used to clean and preprocess the DTM before analysis. All of these steps are used by text analysis practitioners and in the relevant literature. However, no study ever uses all of these steps on top of each other. This of courses raises the question of how to know which preprocessing steps to use for your research question. + +First, there are a number of things that you should (almost) always do. If your data contains noise such as boilerplate language, HTML artifacts, etc., you should generally strip these out before proceeding. Second, text almost always has an abundance of uninformative (stop) words and a very long tail of very rare words. Thus, it is almost always a good idea to use a combination of stop word removal, trimming based on document frequency, and/or `tf.idf` weighting. Note that when using a stop word list, you should always manually inspect and/or fine-tune the word list to make sure it matches your domain and research question. + +The other steps such as n-grams, collocations, and tagging and lemmatization are more optional but can be quite important depending on the specific research. For this (and for choosing a specific combination of trimming and weighting), it is always good to know your domain well, look at the results, and think whether you think they make sense. Using the example given above, bigrams can make more sense for sentiment analysis (since *not good* is quite different from *good*), but for analyzing the topic of texts it may be less important. + +Ultimately, however, many of these questions have no good theoretical answer, and the only way to find a good preprocessing "pipeline" for your research question is to try many different options and see which works best. This might feel like "cheating" from a social science perspective, since it is generally frowned upon to just test many different statistical models and report on what works best. There is a difference, however, between substantive statistical modeling where you actually want to understand the mechanisms, and technical processing steps where you just want the best possible measurement of an underlying variable (presumably to be used in a subsequent substantive model). @mousetrap uses the analogy of the mouse trap and the human condition: in engineering you want to make the best possible mouse trap, while in social science we want to understand the human condition. For the mouse trap, it is OK if it is a black box for which we have no understanding of how it works, as long as we are sure that it does work. For the social science model, this is not the case as it is exactly the inner workings we are interested in. +Technical (pre)processing steps such as those reviewed in this chapter are primarily engineering devices: we don't really care how something like `tfc.idf` works, as long as it produces the best possible measurement of the variables we need for our analysis. In other words, it is an engineering challenge, not a social science research question. As a consequence, the key criterion by which to judge these steps is validity, not explainability. Thus, it is fine to try out different options, as long as you validate the results properly. If you have many different choices to evaluate against some metric such as performance on a subsequent prediction task, using the split-half or cross-validation techniques discussed in chapter Chapter [-@sec-chap-introsml] are also relevant here to avoid biasing the evaluation. diff --git a/content/chapter11.qmd b/content/chapter11.qmd index 1f14e23..e280c30 100644 --- a/content/chapter11.qmd +++ b/content/chapter11.qmd @@ -1,55 +1,49 @@ +# Automatic analysis of text {#sec-chap-text} -# Automatic analysis of text {#sec-chap-text} - -::: {.callout-warning} +::: callout-warning # Update planned: R Tidymodels and tidytext -At the time of writing this chapter for the published book, `quanteda` and `caret` were our packages of choice for machine learning and text analysis in R, -even though (as noted below) we felt that especially supervised text analysis was much better supported in Python than in R. -We now think that the `tidymodels` and `tidytext` packages are a better choice for text analysis, -especially for students who have just learned to work with the `tidyverse` data wrangling package. -For this reason, we are planning to rewrite this chapter using those packages. -See the relevant github issues for [tidytext](https://github.com/vanatteveldt/cssbook/issues/5) and [tidymodels](https://github.com/vanatteveldt/cssbook/issues/5) for more information. +At the time of writing this chapter for the published book, `quanteda` and `caret` were our packages of choice for machine learning and text analysis in R, even though (as noted below) we felt that especially supervised text analysis was much better supported in Python than in R. We now think that the `tidymodels` and `tidytext` packages are a better choice for text analysis, especially for students who have just learned to work with the `tidyverse` data wrangling package. For this reason, we are planning to rewrite this chapter using those packages. See the relevant github issues for [tidytext](https://github.com/vanatteveldt/cssbook/issues/5) and [tidymodels](https://github.com/vanatteveldt/cssbook/issues/5) for more information. ::: - +```{=html} - +``` {{< include common_setup.qmd >}} - -**Abstract.** - In this chapter, we discuss different approaches to the automatic analysis of text; or automated content analysis. We combine techniques from earlier chapters, such as transforming texts into a matrix of term frequencies and machine learning. In particular, we describe three different approaches (dictionary-based analyses, supervised machine learning, unsupervised machine learning). The chapter provides guidance on how to conduct such analyses, and also on how to decide which of the approaches is most suitable for which types of question. +**Abstract.** In this chapter, we discuss different approaches to the automatic analysis of text; or automated content analysis. We combine techniques from earlier chapters, such as transforming texts into a matrix of term frequencies and machine learning. In particular, we describe three different approaches (dictionary-based analyses, supervised machine learning, unsupervised machine learning). The chapter provides guidance on how to conduct such analyses, and also on how to decide which of the approaches is most suitable for which types of question. **Keywords.** dictionary approaches, supervised machine learning, unsupervised machine learning, topic models, automated content analysis, sentiment analysis **Objectives:** -- Understand different approaches to automatic analysis of text - - Be able to decide on whether to use a dictionary approach, supervised machine learning, or unsupervised machine learning - - Be able to use these techniques +- Understand different approaches to automatic analysis of text +- Be able to decide on whether to use a dictionary approach, supervised machine learning, or unsupervised machine learning +- Be able to use these techniques -::: {.callout-note icon=false collapse=true} +::: {.callout-note icon="false" collapse="true"} ## Packages used in this chapter -This chapter uses the basic text and data handling that were described in [@sec-chap-dtm] (*tidyverse*, *readtext*, and *quanteda* for R, *pandas* and *nltk* for Python). - For supervised text analysis, we use *quanteda.textmodels* in R, and *sklearn* and *keras* in Python. - For topic models we use *topicmodels* (R) and *gensim* (Python). - You can install these packages with the code below if needed (see [@sec-installing] for more details): +This chapter uses the basic text and data handling that were described in [@sec-chap-dtm] (*tidyverse*, *readtext*, and *quanteda* for R, *pandas* and *nltk* for Python). For supervised text analysis, we use *quanteda.textmodels* in R, and *sklearn* and *keras* in Python. For topic models we use *topicmodels* (R) and *gensim* (Python). You can install these packages with the code below if needed (see [@sec-installing] for more details): -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python chapter11install-python} #| eval: false !pip3 install nltk scikit-learn pandas !pip3 install gensim eli5 keras keras_preprocessing tensorflow ``` + ## R code + ```{r chapter11install-r} #| eval: false install.packages(c("tidyverse", "readtext", @@ -57,10 +51,12 @@ install.packages(c("tidyverse", "readtext", "topicmodels", "keras", "topicdoc", "MLmetrics")) ``` ::: - After installing, you need to import (activate) the packages every session: -::: {.panel-tabset} +After installing, you need to import (activate) the packages every session: + +::: panel-tabset ## Python code + ```{python chapter11library-python} # General packages and dictionary analysis import os @@ -70,7 +66,6 @@ import urllib.request import re import pickle import nltk -import eli5 import joblib import requests import pandas as pd @@ -86,7 +81,7 @@ from sklearn.pipeline import make_pipeline, Pipeline from sklearn.model_selection import GridSearchCV from sklearn import metrics import joblib -import eli5 +# import eli5 from nltk.sentiment import vader from nltk.sentiment.vader import SentimentIntensityAnalyzer @@ -105,7 +100,9 @@ from gensim.models.ldamodel import LdaModel from gensim.models.coherencemodel import CoherenceModel ``` + ## R code + ```{r chapter11library-r} # General packages and dictionary analysis library(glue) @@ -127,148 +124,41 @@ library(topicdoc) ::: ::: -In earlier chapters, you learned about both supervised and unsupervised machine learning as well about dealing with texts. -This chapter brings together these elements and discusses how to combine them to automatically analyze large corpora of texts. After presenting guidelines for choosing an appropriate approach in [@sec-deciding] and downloading an example dataset in [@sec-reviewdataset], we discuss multiple techniques in detail. -We begin with a very simple top-down approach in [@sec-dictionary], in which we count occurrences of words from an *a priori* defined list of words. In [@sec-supervised], we still use pre-defined categories that we want to code, but let the machine "learn" the rules of the coding itself. Finally, in [@sec-unsupervised], we employ a bottom-up approach in which we do not use any *a priori* defined lists or coding schemes, but inductively extract topics from our data. +In earlier chapters, you learned about both supervised and unsupervised machine learning as well about dealing with texts. This chapter brings together these elements and discusses how to combine them to automatically analyze large corpora of texts. After presenting guidelines for choosing an appropriate approach in [@sec-deciding] and downloading an example dataset in [@sec-reviewdataset], we discuss multiple techniques in detail. We begin with a very simple top-down approach in [@sec-dictionary], in which we count occurrences of words from an *a priori* defined list of words. In [@sec-supervised], we still use pre-defined categories that we want to code, but let the machine "learn" the rules of the coding itself. Finally, in [@sec-unsupervised], we employ a bottom-up approach in which we do not use any *a priori* defined lists or coding schemes, but inductively extract topics from our data. ## Deciding on the Right Method {#sec-deciding} -When thinking about the computational analysis of texts, it is -important to realize that there is no method that is *the one* to do -so. While there are good choices and bad choices, we also cannot say -that one method is necessarily and always superior to another. Some -methods are more fashionable than others. For instance, there has been -a growing interest in topic models (see [@sec-unsupervised]) in the -past few years. There are indeed very good applications for such models, -they are also sometimes applied to research questions and/or data -where they make much less sense. As always, the choice of method -should follow the research question and not the other way round. We -therefore caution you about reading [@sec-chap-text] selectively because you -want, for instance, to learn about supervised machine learning or about -unsupervised topic models. Instead, you should be aware of very -different approaches to make an informed decision on what to use when. - -@Boumans2016 provide useful guidelines for this. They place -automatic text analysis approaches on a continuum from deductive (or -top-down) to inductive (or bottom-up). At the deductive end of the -spectrum, they place dictionary approaches -([@sec-dictionary]). Here, the researcher has strong *a priori* -(theoretical) assumptions (for instance, which topics exist in a news -data set; or which words are positive or negative) and can compile -lists of words or rules based on these assumptions. The computer then -only needs to execute these rules. At the inductive end of the -spectrum, in contrast, lie approaches such as topic models -([@sec-unsupervised]) where little or no *a priori* assumptions are -made, and where we exploratively look for patterns in the data. Here, -we typically do not know which topics exist in advance. Supervised -approaches ([@sec-supervised]) can be placed in between: here, we do -define categories *a priori* (we do know which topics exist, and given -an article, we know to which topic it belongs), but we do not have -any set of rules: we do not know which words to look for or which -exact rules to follow. These rules are to be "learned" by the computer -from the data. - -Before we get into the details and implementations, let us discuss some -use cases of the three main approaches for the -computational analysis of text: dictionary (or rule-based) approaches, -supervised machine learning, and unsupervised machine learning. - -Dictionary approaches excel under three conditions. First, the -variable we want to code is *manifest and concrete* rather than -*latent and abstract*: names of actors, specific physical -objects, specific phrases, etc., rather than feelings, frames, or -topics. Second, all synonyms to be included must be known -beforehand. And third, the dictionary entries must not have multiple -meanings. -For instance, coding for how often gun control is mentioned in political -speeches fits these criteria. There are only so many ways to talk -about it, and it is rather unlikely that speeches about other topics -contain a phrase like "gun control". Similarly, if we want to find -references to Angela Merkel, Donald Trump, or any other well-known -politician, we can just directly search for their names -- even though -problems arise when people have very common surnames and are referred -to by their surnames only. - -Sadly, most interesting concepts are more complex to code. Take a -seemingly straightforward problem: distinguishing whether a news -article is about the economy or not. This is really easy to do for -humans: there may be some edge cases, but in general, people rarely -need longer than a few seconds to grasp whether an article is about the -economy rather than about sports, culture, etc. Yet, many of these -articles won't directly state that they are about the economy by -explicitly using the word "economy". - -We may think of extending our dictionary not only with `econom.+` (a -regular expression that includes economists, economic, and so on), but -also come up with other words like "stock exchange", "market", -"company." Unfortunately, we will quickly run into a problem that we also -faced when we discussed the precision-recall trade-off in -[@sec-validation]: the more terms we add to our -dictionary, the more false positives we will get: articles about -the geographical space called "market", about some celebrity being seen -in "company" of someone else, and so on. - -From this example, we can conclude that often (1) it is easy for -humans to decide to which class a text belongs, but (2) it is very -hard for humans to come up with a list of words (or rules) on which -their judgment is based. Such a situation is perfect for applying -supervised machine learning: after all, it won't take us much time to -annotate, say, 1000 articles based on whether they are about the -economy or not (probably this takes less time than thoroughly -fine tuning a list of words to include or exclude); and the difficult part, -deciding on the exact rules underlying the decision to classify an -article as economic is done by the computer in seconds. Supervised -machine learning, therefore, has replaced dictionary approaches in -many areas. - -Both dictionary (or rule-based) approaches and supervised machine learning assume that you -know in advance which categories (positive versus negative; sports -versus economy versus politics; …) exist. The big strength of unsupervised -approaches such as topic models is that you can also apply them -without this knowledge. They therefore allow you to find patterns -in data that you did not expect and can generate new insights. This -makes them particularly suitable for explorative research questions. -Using them for confirmatory tests, in contrast, is less defensible: -after all, if we are interested in knowing whether, say, news site A -published more about the economy than news site B, then it would be -a bit weird to pretend not to know that the topic "economy" exists. -Also practically, mapping the resulting topics that the topic model -produces onto such *a priori* existing categories can be challenging. - -Despite all differences, all approaches share one requirement: you -need to "Validate. Validate. Validate" [@Grimmer2013]. Though -it has been done in the past, simply applying a dictionary without -comparing the performance to manual coding of the same concepts -is not acceptable; neither is using a supervised machine learning -classifier without doing the same; or blindly trusting a topic model -without at least manually checking whether the scores the model assigns -to documents really capture what the documents are about. +When thinking about the computational analysis of texts, it is important to realize that there is no method that is *the one* to do so. While there are good choices and bad choices, we also cannot say that one method is necessarily and always superior to another. Some methods are more fashionable than others. For instance, there has been a growing interest in topic models (see [@sec-unsupervised]) in the past few years. There are indeed very good applications for such models, they are also sometimes applied to research questions and/or data where they make much less sense. As always, the choice of method should follow the research question and not the other way round. We therefore caution you about reading [@sec-chap-text] selectively because you want, for instance, to learn about supervised machine learning or about unsupervised topic models. Instead, you should be aware of very different approaches to make an informed decision on what to use when. + +@Boumans2016 provide useful guidelines for this. They place automatic text analysis approaches on a continuum from deductive (or top-down) to inductive (or bottom-up). At the deductive end of the spectrum, they place dictionary approaches ([@sec-dictionary]). Here, the researcher has strong *a priori* (theoretical) assumptions (for instance, which topics exist in a news data set; or which words are positive or negative) and can compile lists of words or rules based on these assumptions. The computer then only needs to execute these rules. At the inductive end of the spectrum, in contrast, lie approaches such as topic models ([@sec-unsupervised]) where little or no *a priori* assumptions are made, and where we exploratively look for patterns in the data. Here, we typically do not know which topics exist in advance. Supervised approaches ([@sec-supervised]) can be placed in between: here, we do define categories *a priori* (we do know which topics exist, and given an article, we know to which topic it belongs), but we do not have any set of rules: we do not know which words to look for or which exact rules to follow. These rules are to be "learned" by the computer from the data. + +Before we get into the details and implementations, let us discuss some use cases of the three main approaches for the computational analysis of text: dictionary (or rule-based) approaches, supervised machine learning, and unsupervised machine learning. + +Dictionary approaches excel under three conditions. First, the variable we want to code is *manifest and concrete* rather than *latent and abstract*: names of actors, specific physical objects, specific phrases, etc., rather than feelings, frames, or topics. Second, all synonyms to be included must be known beforehand. And third, the dictionary entries must not have multiple meanings. For instance, coding for how often gun control is mentioned in political speeches fits these criteria. There are only so many ways to talk about it, and it is rather unlikely that speeches about other topics contain a phrase like "gun control". Similarly, if we want to find references to Angela Merkel, Donald Trump, or any other well-known politician, we can just directly search for their names -- even though problems arise when people have very common surnames and are referred to by their surnames only. + +Sadly, most interesting concepts are more complex to code. Take a seemingly straightforward problem: distinguishing whether a news article is about the economy or not. This is really easy to do for humans: there may be some edge cases, but in general, people rarely need longer than a few seconds to grasp whether an article is about the economy rather than about sports, culture, etc. Yet, many of these articles won't directly state that they are about the economy by explicitly using the word "economy". + +We may think of extending our dictionary not only with `econom.+` (a regular expression that includes economists, economic, and so on), but also come up with other words like "stock exchange", "market", "company." Unfortunately, we will quickly run into a problem that we also faced when we discussed the precision-recall trade-off in [@sec-validation]: the more terms we add to our dictionary, the more false positives we will get: articles about the geographical space called "market", about some celebrity being seen in "company" of someone else, and so on. + +From this example, we can conclude that often (1) it is easy for humans to decide to which class a text belongs, but (2) it is very hard for humans to come up with a list of words (or rules) on which their judgment is based. Such a situation is perfect for applying supervised machine learning: after all, it won't take us much time to annotate, say, 1000 articles based on whether they are about the economy or not (probably this takes less time than thoroughly fine tuning a list of words to include or exclude); and the difficult part, deciding on the exact rules underlying the decision to classify an article as economic is done by the computer in seconds. Supervised machine learning, therefore, has replaced dictionary approaches in many areas. + +Both dictionary (or rule-based) approaches and supervised machine learning assume that you know in advance which categories (positive versus negative; sports versus economy versus politics; ...) exist. The big strength of unsupervised approaches such as topic models is that you can also apply them without this knowledge. They therefore allow you to find patterns in data that you did not expect and can generate new insights. This makes them particularly suitable for explorative research questions. Using them for confirmatory tests, in contrast, is less defensible: after all, if we are interested in knowing whether, say, news site A published more about the economy than news site B, then it would be a bit weird to pretend not to know that the topic "economy" exists. Also practically, mapping the resulting topics that the topic model produces onto such *a priori* existing categories can be challenging. + +Despite all differences, all approaches share one requirement: you need to "Validate. Validate. Validate" [@Grimmer2013]. Though it has been done in the past, simply applying a dictionary without comparing the performance to manual coding of the same concepts is not acceptable; neither is using a supervised machine learning classifier without doing the same; or blindly trusting a topic model without at least manually checking whether the scores the model assigns to documents really capture what the documents are about. ## Obtaining a Review Dataset {#sec-reviewdataset} -For the sections on dictionary and supervised approaches we will use a dataset of movie reviews -from the IMDB database [@aclimdb]. -This dataset is published as a compressed set of folders, with separate folders for the train and test datasets and subfolders for positive and negative reviews. -Lots of other review datasets are available online, for example for Amazon review data ([jmcauley.ucsd.edu/data/amazon/](https://jmcauley.ucsd.edu/data/amazon/)). - -The IMDB dataset we will use is a relatively large file and it requires bit of processing, -so it is smart to *cache* the data rather than downloading and processing it every time you need it. -This is done in [@exm-reviewdata], which also serves as a nice example of how to download and process files. -Both R and Python follow the same basic pattern. -First, we check whether the cached file exists, and if it does we read the data from that file. -For R, we use the standard *RDS* format, while for Python we use a compressed *pickle* file. -The format of the data is also slightly different, following the convention for each language: -In R we use the data frame returned by `readtext`, -which can read files from a folder or zip archive and return a data frame containing one text per row. -In Python, we have separate lists for the train and test datasets and for the full texts and labels: -`text_train` are the training texts and `y_train` are the corresponding labels. - -::: {.callout-note appearance="simple" icon=false} +For the sections on dictionary and supervised approaches we will use a dataset of movie reviews from the IMDB database [@aclimdb]. This dataset is published as a compressed set of folders, with separate folders for the train and test datasets and subfolders for positive and negative reviews. Lots of other review datasets are available online, for example for Amazon review data ([jmcauley.ucsd.edu/data/amazon/](https://jmcauley.ucsd.edu/data/amazon/)). + +The IMDB dataset we will use is a relatively large file and it requires bit of processing, so it is smart to *cache* the data rather than downloading and processing it every time you need it. This is done in [@exm-reviewdata], which also serves as a nice example of how to download and process files. Both R and Python follow the same basic pattern. First, we check whether the cached file exists, and if it does we read the data from that file. For R, we use the standard *RDS* format, while for Python we use a compressed *pickle* file. The format of the data is also slightly different, following the convention for each language: In R we use the data frame returned by `readtext`, which can read files from a folder or zip archive and return a data frame containing one text per row. In Python, we have separate lists for the train and test datasets and for the full texts and labels: `text_train` are the training texts and `y_train` are the corresponding labels. + +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-reviewdata} Downloading and caching IMDB review data. -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python data-python} filename = "reviewdata.pickle.bz2" if os.path.exists(filename): @@ -301,7 +191,9 @@ else: with bz2.BZ2File(filename, "w") as zipfile: pickle.dump(data, zipfile) ``` + ## R code + ```{r data-r} #| cache: true filename = "reviewdata.rds" @@ -330,44 +222,21 @@ if (file.exists(filename)) { ::: ::: -If the cached data file does not exist yet, -the file is downloaded from the Internet. -In R, we then extract the file and call `readtext` on the resulting folder. -This automatically creates columns for the subfolders, so in this case for the dataset and label. -After this, we remove the download file and the extracted folder, -clean up the `reviewdata`, and save it to the `reviewdata.rds` file. -In Python, we can extract files from the downloaded file directly, -so we do not need to explicitly extract it. -We loop over all files in the archive, and use a regular expression to -select only text files and extract the label and dataset name -(see [@sec-regular] for more information about regular expressions). -Then, we extract the text from the archive, and add the text and the label to the appropriate list. -Finally, the data is saved as a compressed pickle file, -so the next time we run this cell it does not need to download the file again. +If the cached data file does not exist yet, the file is downloaded from the Internet. In R, we then extract the file and call `readtext` on the resulting folder. This automatically creates columns for the subfolders, so in this case for the dataset and label. After this, we remove the download file and the extracted folder, clean up the `reviewdata`, and save it to the `reviewdata.rds` file. In Python, we can extract files from the downloaded file directly, so we do not need to explicitly extract it. We loop over all files in the archive, and use a regular expression to select only text files and extract the label and dataset name (see [@sec-regular] for more information about regular expressions). Then, we extract the text from the archive, and add the text and the label to the appropriate list. Finally, the data is saved as a compressed pickle file, so the next time we run this cell it does not need to download the file again. ## Dictionary Approaches to Text Analysis {#sec-dictionary} -A straightforward way to automatically analyze text is to compile a -list of terms you are interested in and simply count how often they -occur in each document. For example, if you are interested in finding out -whether mentions of political parties in news articles change over -the years, you only need to compile a list of all party names and -write a small script to count them. +A straightforward way to automatically analyze text is to compile a list of terms you are interested in and simply count how often they occur in each document. For example, if you are interested in finding out whether mentions of political parties in news articles change over the years, you only need to compile a list of all party names and write a small script to count them. -Historically, this is how sentiment analysis was -done. Example [-@exm-sentsimple] shows how to do a simple sentiment analysis -based on a list of positive and negative words. The logic is -straightforward: you count how often each positive word occurs in a -text, you do the same for the negative words, and then determine which -occur more often. - -::: {.callout-note appearance="simple" icon=false} +Historically, this is how sentiment analysis was done. Example [-@exm-sentsimple] shows how to do a simple sentiment analysis based on a list of positive and negative words. The logic is straightforward: you count how often each positive word occurs in a text, you do the same for the negative words, and then determine which occur more often. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-sentsimple} Different approaches to a simple dictionary-based sentiment analysis: counting and summing all words using a for-loop over all reviews (Python) versus constructing a term-document matrix and looking up the words in there (R). Note that both approaches would be possible in either language. -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python sentsimple-python} #| cache: true poswords = "https://cssbook.net/d/positive.txt" @@ -387,7 +256,9 @@ for review in text_train[:100]: scores.append(sum(sentimentdict.get(word, 0) for word in words)) print(scores) ``` + ## R code + ```{r sentsimple-r} #| cache: true poswords = "https://cssbook.net/d/positive.txt" @@ -409,62 +280,23 @@ head(scores) ::: ::: -As you may already realize, there are a lot of downsides to this -approach. Most notably, our bag-of-words approach does not allow us to -account for negation: "not good" will be counted as -positive. Relatedly, we cannot handle modifiers such as "very -good". Also, all words are either positive or negative, while -"great" should be more positive than "good". More advanced -dictionary-based sentiment analysis packages like Vader [@Hutto2014] -or SentiStrength [@Thelwall2012] include such functionalities. Yet, as we will -discuss in Section [-@sec-supervised], also these off-the-shelf -packages perform very poorly in many sentiment analysis tasks, -especially outside of the domains they were developed for. -Dictionary-based sentiment analysis has been shown to be problematic -when analyzing news content [e.g. @Gonzalez-Bailon2015;@ - Boukes2019]. They are problematic when accuracy at the sentence -level is important, but may be satisfactory with longer texts for -comparatively easy tasks such as movie review classification -[@Reagan2017], where there is clear ground truth data and the -genre convention implies that the whole text is evaluative and -evaluates one object (the film). - -Still, there are many use cases where dictionary approaches work very -well. Because your list of words can contain anything, not just -positive or negative words, dictionary approaches have been used, for -instance, to measure the use of racist words or swearwords in online -fora [e.g., @Tulkens2016]. Dictionary approaches are simple to understand and -straightforward, which can be a good argument for using them when it -is important that the method is no black-box but fully transparent -even without technical knowledge. Especially when the dictionary already -exists or is easy to create, it is also a very cheap method. -However, this is at the expense of their limitation to only performing well when measuring easy to operationalize concepts. To put it bluntly: it's great for measuring the visibility of parties or organizations in the news, but it's not -good for measuring concepts such as emotions or frames. - -What gave dictionary approaches a bit of a bad name is that many -researchers applied them without validating them. This is especially -problematic when a dictionary is applied in a slightly different -domain than that for which it was originally made. - -If you want to use a dictionary-based approach, we advise the -following procedure: - -- Construct a dictionary based on theoretical considerations and by closely reading a sample of example texts. - - Code some articles manually and compare with the automated coding. - - Improve your dictionary and check again. - - Manually code a validation dataset of sufficient size. The required size depends a bit on how balanced your data is -- if one code occurs very infrequently, you will need more data. - - Calculate the agreement. You could use standard intercoder reliability measures used in manual content analysis, but we would also advise you to calculate precision and recall (see Section [-@sec-validation]). - -Very extensive dictionaries will have a high recall (it becomes -increasingly unlikely that you "miss" a relevant document), but -often suffer from low precision (more documents will contain one of -the words even though they are irrelevant). Vice versa, a very short -dictionary will often be very precise, but miss a lot of documents. -It depends on your research question where the right balance lies, but -to substantially interpret your results, you need to be able to -quantify the performance of your dictionary-based approach. - -::: {.callout-note icon=false collapse=true} +As you may already realize, there are a lot of downsides to this approach. Most notably, our bag-of-words approach does not allow us to account for negation: "not good" will be counted as positive. Relatedly, we cannot handle modifiers such as "very good". Also, all words are either positive or negative, while "great" should be more positive than "good". More advanced dictionary-based sentiment analysis packages like Vader [@Hutto2014] or SentiStrength [@Thelwall2012] include such functionalities. Yet, as we will discuss in Section [-@sec-supervised], also these off-the-shelf packages perform very poorly in many sentiment analysis tasks, especially outside of the domains they were developed for. Dictionary-based sentiment analysis has been shown to be problematic when analyzing news content [e.g. @Gonzalez-Bailon2015;\@ Boukes2019]. They are problematic when accuracy at the sentence level is important, but may be satisfactory with longer texts for comparatively easy tasks such as movie review classification [@Reagan2017], where there is clear ground truth data and the genre convention implies that the whole text is evaluative and evaluates one object (the film). + +Still, there are many use cases where dictionary approaches work very well. Because your list of words can contain anything, not just positive or negative words, dictionary approaches have been used, for instance, to measure the use of racist words or swearwords in online fora [e.g., @Tulkens2016]. Dictionary approaches are simple to understand and straightforward, which can be a good argument for using them when it is important that the method is no black-box but fully transparent even without technical knowledge. Especially when the dictionary already exists or is easy to create, it is also a very cheap method. However, this is at the expense of their limitation to only performing well when measuring easy to operationalize concepts. To put it bluntly: it's great for measuring the visibility of parties or organizations in the news, but it's not good for measuring concepts such as emotions or frames. + +What gave dictionary approaches a bit of a bad name is that many researchers applied them without validating them. This is especially problematic when a dictionary is applied in a slightly different domain than that for which it was originally made. + +If you want to use a dictionary-based approach, we advise the following procedure: + +- Construct a dictionary based on theoretical considerations and by closely reading a sample of example texts. + - Code some articles manually and compare with the automated coding. + - Improve your dictionary and check again. + - Manually code a validation dataset of sufficient size. The required size depends a bit on how balanced your data is -- if one code occurs very infrequently, you will need more data. + - Calculate the agreement. You could use standard intercoder reliability measures used in manual content analysis, but we would also advise you to calculate precision and recall (see Section [-@sec-validation]). + +Very extensive dictionaries will have a high recall (it becomes increasingly unlikely that you "miss" a relevant document), but often suffer from low precision (more documents will contain one of the words even though they are irrelevant). Vice versa, a very short dictionary will often be very precise, but miss a lot of documents. It depends on your research question where the right balance lies, but to substantially interpret your results, you need to be able to quantify the performance of your dictionary-based approach. + +::: {.callout-note icon="false" collapse="true"} ## How many documents do you need to calculate agreement with human annotators? To determine the number of documents one needs to determine the agreement between a human and a machine, one can follow the same standards that are recommended for traditional manual content analysis. @@ -474,152 +306,47 @@ For instance, @Krippendorff2004 provides a convenience table to look up the requ ## Supervised Text Analysis: Automatic Classification and Sentiment Analysis {#sec-supervised} -For many applications, there are good reasons to use the dictionary -approach presented in the previous section. First, it is intuitively -understandable and results can -- in principle -- -even be verified by hand, which can be an advantage when transparency -or communicability is of high importance. Second, it is very easy to -use. But as we have discussed in [@sec-deciding], dictionary approaches -in general perform less well the more abstract, non-manifest, or -complex a concept becomes. In the next section, we will make the case -that topics, but also sentiment, in fact, are quite a complex concepts -that are often hard to capture with dictionaries (or at least, crafting -a custom dictionary would be difficult). For instance, while "positive" -and "negative" seem straightforward categories at first sight, -the more we think about it, the more apparent it becomes how context-dependent -it actually is: in a dataset about the economy and stock -market returns, "increasing" may indicate something positive, -in a dataset about unemployment rates the same word would be something -negative. Thus, machine learning can be a more appropriate technique for such tasks. +For many applications, there are good reasons to use the dictionary approach presented in the previous section. First, it is intuitively understandable and results can -- in principle -- even be verified by hand, which can be an advantage when transparency or communicability is of high importance. Second, it is very easy to use. But as we have discussed in [@sec-deciding], dictionary approaches in general perform less well the more abstract, non-manifest, or complex a concept becomes. In the next section, we will make the case that topics, but also sentiment, in fact, are quite a complex concepts that are often hard to capture with dictionaries (or at least, crafting a custom dictionary would be difficult). For instance, while "positive" and "negative" seem straightforward categories at first sight, the more we think about it, the more apparent it becomes how context-dependent it actually is: in a dataset about the economy and stock market returns, "increasing" may indicate something positive, in a dataset about unemployment rates the same word would be something negative. Thus, machine learning can be a more appropriate technique for such tasks. ### Putting Together a Workflow {#sec-workflow} -With the knowledge we gained in previous chapters, it is not difficult -to set up a supervised machine learning classifier to automatically -determine, for instance, the topic of a news article. - -Let us recap the building blocks that we need. In -[@sec-chap-introsml], you learned how to use different -classifiers, how to evaluate them, and how to choose the best -settings. However, in these examples, we used numerical data as -features; now, we have text. In [@sec-chap-dtm], -you learned how to turn text into numerical -features. And that's all we need to get started! - -Typical examples for supervised machine learning in the analysis of -communication include the classification of topics -[e.g., @Scharkow2011], frames [e.g., @Burscher2014], -user characteristics such as gender or ideology, -or sentiment. - -Let us consider the case of sentiment analysis in more -detail. Classical sentiment analysis is done with a dictionary -approach: you take a list of positive words, a list of negative words, -and count which occur more frequently. Additionally, one may attach a weight to -each word, such that "perfect" gets a higher weight than "good", -for instance. An obvious drawback is that these pure -bag-of-words approaches cannot cope with negation ("not good") and -intensifiers ("very good"), which is why extensions have been -developed that take these (and other features, such as punctuation) -into account [@Thelwall2012;@Hutto2014;@DeSmedt2012]. - -But while available off-the-shelf packages that implement these -extended dictionary-based methods are very easy to use (in fact, they -spit out a sentiment score with one single line of code), it is -questionable how well they work in practice. After all, "sentiment" -is not exactly a clear, manifest concept for which we can enumerate a -list of words. It has been shown that results obtained with multiple -of these packages correlate very poorly with each other and with human -annotations [@Boukes2019;@Chan2021]. - -Consequently, it has been suggested that it is better to use -supervised machine learning to automatically code the sentiment of -texts [@Gonzalez-Bailon2015;@vermeer2019seeing]. However, you may need to annotate documents from your own dataset: training a classifier -on, for instance, movie reviews and then using it to predict sentiment -in political texts violates the assumption that training set, test -set, and the unlabeled data that are to be classified are (at least in -principle and approximately) drawn from the same population. - -To illustrate the workflow, we will use the ACL IMDB dataset, a large -dataset that consists of a training dataset of 25000 movie -reviews (of which 12500 are positive and 12500 are negative) -and an equally sized test dataset [@aclimdb]. It can be -downloaded at -[ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz) - -These data do not come in one file, but rather in a set of text files -that are sorted in different folders named after the dataset to which they -belong (`test` or `train`) and their label (`pos` and `neg`). This -means that we cannot simply use a pre-defined function to read them, -but we need to think of a way of reading the content into a -data structure that we can use. -This data was loaded in [@exm-reviewdata] above. - -::: {.callout-note icon=false collapse=true} +With the knowledge we gained in previous chapters, it is not difficult to set up a supervised machine learning classifier to automatically determine, for instance, the topic of a news article. + +Let us recap the building blocks that we need. In [@sec-chap-introsml], you learned how to use different classifiers, how to evaluate them, and how to choose the best settings. However, in these examples, we used numerical data as features; now, we have text. In [@sec-chap-dtm], you learned how to turn text into numerical features. And that's all we need to get started! + +Typical examples for supervised machine learning in the analysis of communication include the classification of topics [e.g., @Scharkow2011], frames [e.g., @Burscher2014], user characteristics such as gender or ideology, or sentiment. + +Let us consider the case of sentiment analysis in more detail. Classical sentiment analysis is done with a dictionary approach: you take a list of positive words, a list of negative words, and count which occur more frequently. Additionally, one may attach a weight to each word, such that "perfect" gets a higher weight than "good", for instance. An obvious drawback is that these pure bag-of-words approaches cannot cope with negation ("not good") and intensifiers ("very good"), which is why extensions have been developed that take these (and other features, such as punctuation) into account [@Thelwall2012; @Hutto2014; @DeSmedt2012]. + +But while available off-the-shelf packages that implement these extended dictionary-based methods are very easy to use (in fact, they spit out a sentiment score with one single line of code), it is questionable how well they work in practice. After all, "sentiment" is not exactly a clear, manifest concept for which we can enumerate a list of words. It has been shown that results obtained with multiple of these packages correlate very poorly with each other and with human annotations [@Boukes2019; @Chan2021]. + +Consequently, it has been suggested that it is better to use supervised machine learning to automatically code the sentiment of texts [@Gonzalez-Bailon2015; @vermeer2019seeing]. However, you may need to annotate documents from your own dataset: training a classifier on, for instance, movie reviews and then using it to predict sentiment in political texts violates the assumption that training set, test set, and the unlabeled data that are to be classified are (at least in principle and approximately) drawn from the same population. + +To illustrate the workflow, we will use the ACL IMDB dataset, a large dataset that consists of a training dataset of 25000 movie reviews (of which 12500 are positive and 12500 are negative) and an equally sized test dataset [@aclimdb]. It can be downloaded at [ai.stanford.edu/\~amaas/data/sentiment/aclImdb_v1.tar.gz](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz) + +These data do not come in one file, but rather in a set of text files that are sorted in different folders named after the dataset to which they belong (`test` or `train`) and their label (`pos` and `neg`). This means that we cannot simply use a pre-defined function to read them, but we need to think of a way of reading the content into a data structure that we can use. This data was loaded in [@exm-reviewdata] above. + +::: {.callout-note icon="false" collapse="true"} ## Sparse versus dense matrices in Python and R - In a - document-term matrix, you would typically find a lot of zeros: most - words do *not* appear in any given document. For instance, the - reviews in the IMDB dataset contain more than 100000 unique - words. Hence, the matrix has more than 100000 columns. Yet, most - reviews only consist of a couple of hundred words. As a - consequence, more than 99\% of the cells in the table contain a - zero. In a sparse matrix, we do not store all these zeros, but only - store the values for cells that actually contain a value. This - drastically reduces the memory needed. But even if you have a huge - amount of memory, this does not solve the issue: in R, the number of - cells in a matrix is limited to 2147483647. It is therefore - impossible to store a matrix with 100000 features and 25000 - documents as a dense matrix. Unfortunately, many models that you can - run via *caret* in R will convert your sparse document-term - matrix to a dense matrix, and hence are effectively only usable for - very small datasets. An alternative is using the *quanteda* package, - which does use sparse matrices throughout. However, at the time of - writing this book, quanteda only provides a very limited number of - models. As all of these problems do not arise in *scikit-learn*, - you may want to consider using Python for many text classification tasks. +In a document-term matrix, you would typically find a lot of zeros: most words do *not* appear in any given document. For instance, the reviews in the IMDB dataset contain more than 100000 unique words. Hence, the matrix has more than 100000 columns. Yet, most reviews only consist of a couple of hundred words. As a consequence, more than 99% of the cells in the table contain a zero. In a sparse matrix, we do not store all these zeros, but only store the values for cells that actually contain a value. This drastically reduces the memory needed. But even if you have a huge amount of memory, this does not solve the issue: in R, the number of cells in a matrix is limited to 2147483647. It is therefore impossible to store a matrix with 100000 features and 25000 documents as a dense matrix. Unfortunately, many models that you can run via *caret* in R will convert your sparse document-term matrix to a dense matrix, and hence are effectively only usable for very small datasets. An alternative is using the *quanteda* package, which does use sparse matrices throughout. However, at the time of writing this book, quanteda only provides a very limited number of models. As all of these problems do not arise in *scikit-learn*, you may want to consider using Python for many text classification tasks. ::: -Let us now train our first classifier. We choose a Naïve Bayes -classifier with a simple count vectorizer ([@exm-imdbbaseline]). In -the Python example, pay attention to the fitting of the vectorizer: we -fit on the training data *and* transform the training data with -it, but we only transform the test data *without re-fitting the - vectorizer*. Fitting, here, includes the decision about which words to -include (by definition, words that are not present in the training -data are not included; but we could also choose additional -constraints, such as excluding very rare or very common words), but -also assigning an (internally used) identifier (variable name) to each -word. If we fit the classifier again, these would not be -compatible any more. In R, the same is achieved in a slightly -different way: two term-document matrices are created independently, -before they are matched in such a way that only the features that are -present in the training matrix are retained in the test matrix. - -::: {.callout-note icon=false collapse=true} - -A word that is not present in the training data, but is present - in the test data, is thus ignored. If you want to use the - information such out-of-vocabulary words can entail (e.g., they may - be synonyms), consider using a word embedding approach (see [@sec-wordembeddings]) -::: +Let us now train our first classifier. We choose a Naïve Bayes classifier with a simple count vectorizer ([@exm-imdbbaseline]). In the Python example, pay attention to the fitting of the vectorizer: we fit on the training data *and* transform the training data with it, but we only transform the test data *without re-fitting the vectorizer*. Fitting, here, includes the decision about which words to include (by definition, words that are not present in the training data are not included; but we could also choose additional constraints, such as excluding very rare or very common words), but also assigning an (internally used) identifier (variable name) to each word. If we fit the classifier again, these would not be compatible any more. In R, the same is achieved in a slightly different way: two term-document matrices are created independently, before they are matched in such a way that only the features that are present in the training matrix are retained in the test matrix. -We do not necessarily expect this first model to be the best -classifier we can come up with, but it provides us with a reasonable -baseline. In fact, even without any further adjustments, it works -reasonably well: precision is higher for positive reviews and recall -is higher for negative reviews (classifying a positive review as -negative happens twice as much as the reverse), but none of the values -is concerningly low. +::: {.callout-note icon="false" collapse="true"} +A word that is not present in the training data, but is present in the test data, is thus ignored. If you want to use the information such out-of-vocabulary words can entail (e.g., they may be synonyms), consider using a word embedding approach (see [@sec-wordembeddings]) +::: -::: {.callout-note appearance="simple" icon=false} +We do not necessarily expect this first model to be the best classifier we can come up with, but it provides us with a reasonable baseline. In fact, even without any further adjustments, it works reasonably well: precision is higher for positive reviews and recall is higher for negative reviews (classifying a positive review as negative happens twice as much as the reverse), but none of the values is concerningly low. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-imdbbaseline} Training a Naïve Bayes classifier with simple word counts as features -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python imdbbaseline-python} #| cache: true vectorizer = CountVectorizer(stop_words="english") @@ -634,7 +361,9 @@ y_pred = nb.predict(X_test) rep = metrics.classification_report(y_test, y_pred) print(rep) ``` + ## R code + ```{r imdbbaseline-r} #| cache: true dfm_train = reviewdata %>% @@ -670,41 +399,13 @@ bind_rows(results, .id="label") ### Finding the Best Classifier {#sec-bestclassifier} -Let us start by comparing the two simple classifiers we know (Naïve -Bayes and Logistic Regression (see [@sec-nb2dnn]) and the two -vectorizers that transform our texts into two numerical -representations that we know: word counts and `tf.idf` scores -(see [@sec-chap-dtm]). - -We can also tune some things in the vectorizer, such as filtering out -stopwords, or specifying a minimum number (or proportion) of documents -in which a word needs to occur in order to be included, or the maximum -number (or proportion) of documents in which it is allowed to -occur. For instance, it could make sense to say that a word that -occurs in less than $n=5$ documents is probably a spelling mistake or -so unusual that it just unnecessarily bloats our feature matrix; and -on the other hand, a word that is so common that it occurs in more -than 50\% of all documents is so common that it does not help us to -distinguish between different classes. - -We can try all of these things out by hand by just re-running the code -from [@exm-imdbbaseline] and only changing the line in which the -vectorizer is specified and the line in which the classifier is -specified. -However, copy-pasting essentially the -same code is generally not a good idea, as it makes your code unnecessary -long and increases the likelihood of errors creeping in when you, for -instance, need to apply the same changes to multiple copies of the -code. A more elegant approach is outlined in -[@exm-basiccomparisons]: We define a function that gives us a short -summary of only the output we are interested in, and then use a -for-loop to iterate over all configurations we want to evaluate, fit -them and call the function we defined before. In fact, with 23 lines -of code, we manage to compare four different models, while we already -needed 15 lines (in [@exm-imdbbaseline]) to evaluate only one model. - -::: {.callout-note appearance="simple" icon=false} +Let us start by comparing the two simple classifiers we know (Naïve Bayes and Logistic Regression (see [@sec-nb2dnn]) and the two vectorizers that transform our texts into two numerical representations that we know: word counts and `tf.idf` scores (see [@sec-chap-dtm]). + +We can also tune some things in the vectorizer, such as filtering out stopwords, or specifying a minimum number (or proportion) of documents in which a word needs to occur in order to be included, or the maximum number (or proportion) of documents in which it is allowed to occur. For instance, it could make sense to say that a word that occurs in less than $n=5$ documents is probably a spelling mistake or so unusual that it just unnecessarily bloats our feature matrix; and on the other hand, a word that is so common that it occurs in more than 50% of all documents is so common that it does not help us to distinguish between different classes. +We can try all of these things out by hand by just re-running the code from [@exm-imdbbaseline] and only changing the line in which the vectorizer is specified and the line in which the classifier is specified. However, copy-pasting essentially the same code is generally not a good idea, as it makes your code unnecessary long and increases the likelihood of errors creeping in when you, for instance, need to apply the same changes to multiple copies of the code. A more elegant approach is outlined in [@exm-basiccomparisons]: We define a function that gives us a short summary of only the output we are interested in, and then use a for-loop to iterate over all configurations we want to evaluate, fit them and call the function we defined before. In fact, with 23 lines of code, we manage to compare four different models, while we already needed 15 lines (in [@exm-imdbbaseline]) to evaluate only one model. + +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-basiccomparisons} An example of a custom function to give a brief overview of the performance of four simple vectorizer-classifier combinations. @@ -716,6 +417,7 @@ def short_classification_report(y_test, y_pred): re = metrics.recall_score(y_test, y_pred, pos_label=label) print(f"{label}:\t{pr:0.2f}\t\t{re:0.2f}") ``` + ```{python basiccomparisons-python} #| cache: true configs = [ @@ -745,41 +447,16 @@ for name, vectorizer, classifier in configs: ::: ::: -The output of this little example already gives us quite a bit of -insight into how to tackle our specific classification tasks: first, we -see that a $tf\cdot idf$ classifier seems to be slightly but -consistently superior to a count classifier (this is often, but not -always the case). Second, we see that the logistic regression performs -better than the Naïve Bayes classifier (again, this is often, but not -always, the case). In particular, in our case, the logistic regression -improved on the excessive misclassification of positive reviews as -negative, and achieves a very balanced performance. - -There may be instances where one nevertheless may want to use a Count -Vectorizer with a Naïve Bayes classifier instead (especially if it -is too computationally expensive to estimate the other model), but for -now, we may settle on the best performing combination, logistic -regression with a `tf.idf` vectorizer. You could also try fitting -a Support Vector Machine instead, but we have little reason to believe -that our data isn't linearly separable, which means that there is -little reason to believe that the SVM will perform better. Given the -good performance we already achieved, we decide to stick to the -logistic regression for now. - -We can now go as far as we like, include more models, use -crossvalidation and gridsearch (see -[@sec-crossvalidation]), etc. However, our workflow now -consists of *two* steps: fitting/transforming our input data -using a vectorizer, and fitting a classifier. To make things easier, -in scikit-learn, both steps can be combined into a so-called -pipe. [@exm-basicpipe] shows how the loop in -[@exm-basiccomparisons] can be re-written using pipes (the -result stays the same). - -::: {.callout-note appearance="simple" icon=false} +The output of this little example already gives us quite a bit of insight into how to tackle our specific classification tasks: first, we see that a $tf\cdot idf$ classifier seems to be slightly but consistently superior to a count classifier (this is often, but not always the case). Second, we see that the logistic regression performs better than the Naïve Bayes classifier (again, this is often, but not always, the case). In particular, in our case, the logistic regression improved on the excessive misclassification of positive reviews as negative, and achieves a very balanced performance. + +There may be instances where one nevertheless may want to use a Count Vectorizer with a Naïve Bayes classifier instead (especially if it is too computationally expensive to estimate the other model), but for now, we may settle on the best performing combination, logistic regression with a `tf.idf` vectorizer. You could also try fitting a Support Vector Machine instead, but we have little reason to believe that our data isn't linearly separable, which means that there is little reason to believe that the SVM will perform better. Given the good performance we already achieved, we decide to stick to the logistic regression for now. +We can now go as far as we like, include more models, use crossvalidation and gridsearch (see [@sec-crossvalidation]), etc. However, our workflow now consists of *two* steps: fitting/transforming our input data using a vectorizer, and fitting a classifier. To make things easier, in scikit-learn, both steps can be combined into a so-called pipe. [@exm-basicpipe] shows how the loop in [@exm-basiccomparisons] can be re-written using pipes (the result stays the same). + +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-basicpipe} Instead of fitting vectorizer and classifier separately, they can be combined in a pipeline. + ```{python basicpipe-python} #| cache: true for name, vectorizer, classifier in configs: @@ -794,26 +471,11 @@ for name, vectorizer, classifier in configs: ::: ::: -Such a pipeline lends itself very well to performing a -gridsearch. [@exm-gridsearchlogreg] gives you an example. With -`LogisticRegression?` and `TfIdfVectorizer?`, we can get a list of all -possible hyperparameters that we may want to tune. For instance, these -could be the minimum and maximum frequency for words to be included or -whether we want to use only unigrams (single words) or also bigrams -(combinations of two words, see [@sec-ngram]). -For the Logistic Regression, it may be the -regularization hyperparameter C, which applies a penalty for too -complex models. We can put all values for these parameters -that we want to consider in a dictionary, with a descriptive key (i.e., a string with the step of the pipeline followed by two underscores and the name of the hyperparameter) and a list of all values we want to consider as the corresponding value. - -The gridsearch procedure will then estimate all combinations of all -values, using cross-validation (see [@sec-validation]). In -our example, we have $2 x 2 x 2 x 2 x 3 = 24$ -different models, and $24 models x 5 folds = 120$ models to -estimate. Hence, it may take you some time to run the code. - -::: {.callout-note appearance="simple" icon=false} +Such a pipeline lends itself very well to performing a gridsearch. [@exm-gridsearchlogreg] gives you an example. With `LogisticRegression?` and `TfIdfVectorizer?`, we can get a list of all possible hyperparameters that we may want to tune. For instance, these could be the minimum and maximum frequency for words to be included or whether we want to use only unigrams (single words) or also bigrams (combinations of two words, see [@sec-ngram]). For the Logistic Regression, it may be the regularization hyperparameter C, which applies a penalty for too complex models. We can put all values for these parameters that we want to consider in a dictionary, with a descriptive key (i.e., a string with the step of the pipeline followed by two underscores and the name of the hyperparameter) and a list of all values we want to consider as the corresponding value. + +The gridsearch procedure will then estimate all combinations of all values, using cross-validation (see [@sec-validation]). In our example, we have $2 x 2 x 2 x 2 x 3 = 24$ different models, and $24 models x 5 folds = 120$ models to estimate. Hence, it may take you some time to run the code. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-gridsearchlogreg} A gridsearch to find the best hyperparameters for a pipeline consisting of a vectorizer and a classifier. Note that we can tune any parameter that either the vectorizer or the classifier accepts as an input, not only the four hyperparameters we chose in this example. @@ -842,33 +504,11 @@ print(short_classification_report(y_test, pred)) ::: ::: -We see that we could further improve our model to precision and recall -values of 0.90, by excluding extremely infrequent and extremely -frequent words, including both unigrams and bigrams (which, we may -speculate, help us to account for the "not good" versus "not", -"good" problem), and changing the default penalty of $C=1$ to $C=100$. - -Let us, just for the sake of it, compare the performance of our model -with an off-the-shelf sentiment analysis package, in this case Vader -[@Hutto2014]. For any text, it will directly estimate sentiment -scores (more specifically, a positivity score, a negativity score, a -neutrality score, and a compound measure that combines them), without -any need to have training data. However, as Example [-@exm-vader] shows, such -a method is clearly inferior to a supervised machine learning -approach. While in almost all cases (except for $n=11$ cases), Vader was able to -make a choice (getting scores of 0 is a notorious problem in very -short texts), precision and recall are clearly worse than even the -simple baseline model we started with, and much worse than those of -the final model we finished with. In fact, we miss half (!) of the -negative reviews. There are probably very few applications in the -analysis of communication in which we would find this acceptable. -It is important to highlight that this is not because the off-the-shelf -package we chose is a particularly bad one (on the contrary, it is -actually comparatively good), but because of the inherent limitations -of dictionary-based sentiment analysis. - -::: {.callout-note appearance="simple" icon=false} +We see that we could further improve our model to precision and recall values of 0.90, by excluding extremely infrequent and extremely frequent words, including both unigrams and bigrams (which, we may speculate, help us to account for the "not good" versus "not", "good" problem), and changing the default penalty of $C=1$ to $C=100$. +Let us, just for the sake of it, compare the performance of our model with an off-the-shelf sentiment analysis package, in this case Vader [@Hutto2014]. For any text, it will directly estimate sentiment scores (more specifically, a positivity score, a negativity score, a neutrality score, and a compound measure that combines them), without any need to have training data. However, as Example [-@exm-vader] shows, such a method is clearly inferior to a supervised machine learning approach. While in almost all cases (except for $n=11$ cases), Vader was able to make a choice (getting scores of 0 is a notorious problem in very short texts), precision and recall are clearly worse than even the simple baseline model we started with, and much worse than those of the final model we finished with. In fact, we miss half (!) of the negative reviews. There are probably very few applications in the analysis of communication in which we would find this acceptable. It is important to highlight that this is not because the off-the-shelf package we chose is a particularly bad one (on the contrary, it is actually comparatively good), but because of the inherent limitations of dictionary-based sentiment analysis. + +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-vader} For the sake of comparison, we calculate how an off-the-shelf sentiment analysis package would have performed in this task @@ -892,38 +532,19 @@ print(metrics.classification_report(y_test, pred)) ::: ::: -We need to keep in mind, though, that with this dataset, we chose one -of the easiest sentiment analysis tasks: a set of long, rather formal -texts (compared to informal short social media messages), that -evaluate exactly one entity (one film), and that are not ambiguous at -all. Many applications that communication scientists are interested -in are much less straightforward. Therefore, however tempting it may be -to use an off-the-shelf package, doing so requires a thorough test -based on at least some human-annotated data. +We need to keep in mind, though, that with this dataset, we chose one of the easiest sentiment analysis tasks: a set of long, rather formal texts (compared to informal short social media messages), that evaluate exactly one entity (one film), and that are not ambiguous at all. Many applications that communication scientists are interested in are much less straightforward. Therefore, however tempting it may be to use an off-the-shelf package, doing so requires a thorough test based on at least some human-annotated data. ### Using the Model {#sec-usingmodel} -So far, we have focused on training and evaluating models, almost -forgetting why we were doing this in the first place: to use them to -predict the label for new data that we did not annotate. +So far, we have focused on training and evaluating models, almost forgetting why we were doing this in the first place: to use them to predict the label for new data that we did not annotate. -Of course, we could always re-train the model when we need to use it --- but that has two downsides: first, as you may have seen, it may -actually take considerable time to train it, and second, you need to -have the training data available, which may be a problem both in terms -of storage space and of copyright and/or privacy if you want to share -your classifier with others. +Of course, we could always re-train the model when we need to use it -- but that has two downsides: first, as you may have seen, it may actually take considerable time to train it, and second, you need to have the training data available, which may be a problem both in terms of storage space and of copyright and/or privacy if you want to share your classifier with others. -Therefore, it makes sense to save both our classifier and our -vectorizer to a file, so that we can reload them later -(Example [-@exm-reuse]). Keep in mind that you have to re-use *both* --- after all, the columns of your feature matrix will be different (and hence, completely useless for the classifier) when -fitting a new vectorizer. Therefore, as you see, you do not do any fitting any longer, and only use the `.transform()` method of the (already fitted) vectorizer and the `.predict()` method of the (already fitted) classifier. +Therefore, it makes sense to save both our classifier and our vectorizer to a file, so that we can reload them later (Example [-@exm-reuse]). Keep in mind that you have to re-use *both* -- after all, the columns of your feature matrix will be different (and hence, completely useless for the classifier) when fitting a new vectorizer. Therefore, as you see, you do not do any fitting any longer, and only use the `.transform()` method of the (already fitted) vectorizer and the `.predict()` method of the (already fitted) classifier. In R, you have no vectorizer you could save -- but because in contrast to Python, both your DTM and your classifier include the feature names, it suffices to save the classifier only (using `saveRDS(myclassifier, "myclassifier.rds")`) and using on a new DTM later on. You do need to remember, though, how you constructed the DTM (e.g., which preprocessing steps you took), to make sure that the features are comparable. -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-reuse} Saving and loading a vectorizer and a classifier @@ -959,20 +580,9 @@ for review, label in zip(new_texts, pred): ::: ::: -Another thing that we might want to do is to get a better idea of the -features that the model uses to arrive at its prediction; in our -example, what actually characterizes the best and the worst -reviews. Example [-@exm-eli5] shows how this can be done in one line of code -using *eli5* -- a package that aims to "*e*xplain [the model] -*l*ike *I*'m *5* years old". Here, we re-use the -`pipe` we constructed earlier to provide both the vectorizer and the -classifier to *eli5* -- if we had only provided the -classifier, then the feature names would have been internal -identifiers (which are meaningless to us) rather than human-readable -words. - -::: {.callout-note appearance="simple" icon=false} +Another thing that we might want to do is to get a better idea of the features that the model uses to arrive at its prediction; in our example, what actually characterizes the best and the worst reviews. Example [-@exm-eli5] shows how this can be done in one line of code using *eli5* -- a package that aims to "*e*xplain \[the model\] *l*ike *I*'m *5* years old". Here, we re-use the `pipe` we constructed earlier to provide both the vectorizer and the classifier to *eli5* -- if we had only provided the classifier, then the feature names would have been internal identifiers (which are meaningless to us) rather than human-readable words. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-eli5} Using eli5 to get the most predictive features @@ -983,22 +593,18 @@ pipe = make_pipeline( LogisticRegression(solver="liblinear"), ) pipe.fit(text_train, y_train) -print(eli5.format_as_text(eli5.explain_weights(pipe))) +# print(eli5.format_as_text(eli5.explain_weights(pipe))) ``` ::: ::: -We can also use eli5 to explain how the classifier arrived at a -prediction for a specific document, by using different shades of green -and red to explain how much different features contributed to the -classification, and in which direction (Example [-@exm-eli5b]). - -::: {.callout-note appearance="simple" icon=false} +We can also use eli5 to explain how the classifier arrived at a prediction for a specific document, by using different shades of green and red to explain how much different features contributed to the classification, and in which direction (Example [-@exm-eli5b]). +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-eli5b} -Using eli5 to explain a prediction -## Python code +Using eli5 to explain a prediction \## Python code + ```{python eli5b-python} #| cache: true # WvA: This doesn't work outside a notebook, should probably call other functions @@ -1009,31 +615,25 @@ Using eli5 to explain a prediction ### Deep Learning {#sec-deeplearning} -Deep learning models were introduced in [@sec-deeplearning] as a (relatively) new class of models in -supervised machine learning. -Using the Python *keras* package you can define various model architectures such as Convolutional or Recurrent Neural Networks. -Although it is beyond the scope of this chapter to give a detailed treatment of building and training deep learning models, in this section we do give an example of using a Convolutional Neural Network using pre-trained word embeddings. -We would urge anyone who is interested in machine learning for text analysis to continue studying deep learning, probably starting with the excellent book by @goldberg2017. - -Impressively, in R you can now also use the *keras* package to train deep learning models, as shown in the example. -Similar to how *spacyr* works ([@sec-nlp]), the R package actually installs and calls Python behind the screens using the *reticulate* package. -Although the resulting models are relatively similar, it is less easy to build and debug the models in R - because most of the documentation and community examples are written in Python. -Thus in the end, we probably recommend people who want to dive into deep learning should choose Python rather than R. +Deep learning models were introduced in [@sec-deeplearning] as a (relatively) new class of models in supervised machine learning. Using the Python *keras* package you can define various model architectures such as Convolutional or Recurrent Neural Networks. Although it is beyond the scope of this chapter to give a detailed treatment of building and training deep learning models, in this section we do give an example of using a Convolutional Neural Network using pre-trained word embeddings. We would urge anyone who is interested in machine learning for text analysis to continue studying deep learning, probably starting with the excellent book by @goldberg2017. -::: {.callout-note appearance="simple" icon=false} +Impressively, in R you can now also use the *keras* package to train deep learning models, as shown in the example. Similar to how *spacyr* works ([@sec-nlp]), the R package actually installs and calls Python behind the screens using the *reticulate* package. Although the resulting models are relatively similar, it is less easy to build and debug the models in R because most of the documentation and community examples are written in Python. Thus in the end, we probably recommend people who want to dive into deep learning should choose Python rather than R. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-rnndata} Loading Dutch Sentiment Data [from @vanatteveldt2021] -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python rnndata-python} url = "https://cssbook.net/d/dutch_sentiment.csv" h = pd.read_csv(url) h.head() ``` + ## R code + ```{r rnndata-r} #| cache: true url="https://cssbook.net/d/dutch_sentiment.csv" @@ -1044,13 +644,13 @@ head(d) ::: ::: -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-rnnmodel} Deep Learning: Defining a Recursive Neural Network -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python rnnmodel-python} # Tokenize texts tokenizer = Tokenizer(num_words=9999) @@ -1088,7 +688,9 @@ preds = Dense(1, activation="tanh")(m) m = Model(sequence_input, preds) m.summary() ``` + ## R code + ```{r rnnmodel-r} #| cache: true text_vectorization = layer_text_vectorization( @@ -1113,13 +715,13 @@ model ::: ::: -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-rnn} Deep Learning: Training and Testing the model -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python rnn-python} #| cache: true # Split data into train and test @@ -1142,7 +744,9 @@ acc = sum(correct) / len(pred) print(f"Accuracy: {acc}") ``` + ## R code + ```{r rnn-r} #| cache: true # Split data into train and test @@ -1163,146 +767,63 @@ print(glue("Accuracy: {eval['accuracy']}")) ::: ::: -First, [@exm-rnndata] loads a dataset described by @vanatteveldt2021, which consists of Dutch economic news headlines with a sentiment value. -Next, in [@exm-rnnmodel] a model is defined consisting of several layers, -corresponding roughly to [@fig-cnn]. -First, an *embedding* layer transforms the textual input into a semantic vector for each word. -Next, the *convolutional* layer defines features (filters) over windows of vectors, -which are then pooled in the *max-pooling* layer. -This results in a vector of detected features for each document, -which are then used in a regular (hidden) *dense* layer followed by an output layer. +First, [@exm-rnndata] loads a dataset described by @vanatteveldt2021, which consists of Dutch economic news headlines with a sentiment value. Next, in [@exm-rnnmodel] a model is defined consisting of several layers, corresponding roughly to [@fig-cnn]. First, an *embedding* layer transforms the textual input into a semantic vector for each word. Next, the *convolutional* layer defines features (filters) over windows of vectors, which are then pooled in the *max-pooling* layer. This results in a vector of detected features for each document, which are then used in a regular (hidden) *dense* layer followed by an output layer. -Finally, in [@exm-rnn] we train the model on 4000 documents and test it against the remaining documents. -The Python model, which uses pre-trained word embeddings (the `w2v_320d` file downloaded at the top), -achieves a mediocre accuracy of about 56\% (probably due to the low number of training sentences). -The R model, which trains the embedding layer as part of the model, performs more poorly at 44\% as this model is even more dependent on large training data to properly estimate the embedding layer. +Finally, in [@exm-rnn] we train the model on 4000 documents and test it against the remaining documents. The Python model, which uses pre-trained word embeddings (the `w2v_320d` file downloaded at the top), achieves a mediocre accuracy of about 56% (probably due to the low number of training sentences). The R model, which trains the embedding layer as part of the model, performs more poorly at 44% as this model is even more dependent on large training data to properly estimate the embedding layer. ## Unsupervised Text Analysis: Topic Modeling {#sec-unsupervised} -In [@sec-clustering], we discussed how clustering techniques can be used to find patterns in data, -such as which cases or respondents are most similar. -Similarly, especially in survey research it is common to use factor analysis to discover (or confirm) variables that form a scale. +In [@sec-clustering], we discussed how clustering techniques can be used to find patterns in data, such as which cases or respondents are most similar. Similarly, especially in survey research it is common to use factor analysis to discover (or confirm) variables that form a scale. -In essence, the idea behind these techniques is similar: -by understanding the regularities in the data (which cases or variables behave similarly), -you can describe the relevant information in the data with fewer data points. -Moreover, assuming that the regularities capture interesting information and the deviations from these regularities are mostly -uninteresting noise, these clusters of cases or variables can actually be substantively informative. +In essence, the idea behind these techniques is similar: by understanding the regularities in the data (which cases or variables behave similarly), you can describe the relevant information in the data with fewer data points. Moreover, assuming that the regularities capture interesting information and the deviations from these regularities are mostly uninteresting noise, these clusters of cases or variables can actually be substantively informative. -Since a document-term matrix (DTM) is "just" a matrix, you can also apply these clustering techniques to the DTM -to find groups of words or documents. You can therefore use any of the techniques we described in [@sec-chap-eda], and in particular clustering techniques such as $k$-means clustering (see [@sec-clustering]) to group documents that use similar words together. +Since a document-term matrix (DTM) is "just" a matrix, you can also apply these clustering techniques to the DTM to find groups of words or documents. You can therefore use any of the techniques we described in [@sec-chap-eda], and in particular clustering techniques such as $k$-means clustering (see [@sec-clustering]) to group documents that use similar words together. -It can be very instructive to do this, and we encourage you to play around with such techniques. However, in recent years, a set of models called *topic models* have become especially popular for the unsupervised analysis of texts. Very much what like what you would do with other unsupervised techniques, also in topic modeling, you group words and documents into "topics", consisting of words and documents that co-vary. -If you see the word "agriculture" in a news article, there is a good chance you might find words such as "farm" or "cattle", -and there is a lower chance you will find a word like "soldier". -In other words, the words "agriculture" and "farm" generally occur in the same kind of documents, so they can be said to be part of the same topic. -Similarly, two documents that share a lot of words are probably about the same topic, -and if you know what topic a document is on (e.g., an agricultural topic), you are better able to guess what words might occur in that document (e.g., "cattle"). +It can be very instructive to do this, and we encourage you to play around with such techniques. However, in recent years, a set of models called *topic models* have become especially popular for the unsupervised analysis of texts. Very much what like what you would do with other unsupervised techniques, also in topic modeling, you group words and documents into "topics", consisting of words and documents that co-vary. If you see the word "agriculture" in a news article, there is a good chance you might find words such as "farm" or "cattle", and there is a lower chance you will find a word like "soldier". In other words, the words "agriculture" and "farm" generally occur in the same kind of documents, so they can be said to be part of the same topic. Similarly, two documents that share a lot of words are probably about the same topic, and if you know what topic a document is on (e.g., an agricultural topic), you are better able to guess what words might occur in that document (e.g., "cattle"). -Thus, we can formulate the goal of topic modeling as: given a corpus, find a set of $n$ topics, consisting of specific words and/or documents, that minimize the mistakes we would make if we try to reconstruct the corpus from the topics. -This is similar to regression where we try to find a line that minimizes the prediction error. +Thus, we can formulate the goal of topic modeling as: given a corpus, find a set of $n$ topics, consisting of specific words and/or documents, that minimize the mistakes we would make if we try to reconstruct the corpus from the topics. This is similar to regression where we try to find a line that minimizes the prediction error. -In early research on document clustering, a technique called Latent Semantic Analysis (LSA) essentially used a factor analysis technique called Singular Value Decomposition (see [@sec-pcasvd]) on the DTM. -This has yielded promising results in information retrieval (i.e., document search) and studying human memory and language use. -However, it has a number of drawbacks including factor loadings that can be difficult to interpret substantively and is not a good way of dealing with words that can have multiple meanings [@lsa]. +In early research on document clustering, a technique called Latent Semantic Analysis (LSA) essentially used a factor analysis technique called Singular Value Decomposition (see [@sec-pcasvd]) on the DTM. This has yielded promising results in information retrieval (i.e., document search) and studying human memory and language use. However, it has a number of drawbacks including factor loadings that can be difficult to interpret substantively and is not a good way of dealing with words that can have multiple meanings [@lsa]. ### Latent Dirichlet Allocation (LDA) {#sec-lda} -The most widely used technique for topic modeling is Latent Dirichlet Allocation [LDA, @blei03]. -Although the goal of LDA is the same as for clustering techniques, it starts from the other end with what is called a *generative model*. -A generative model is a (simplified) formal model of how the data is assumed to have been generated. -For example, if we would have a standard regression model predicting income based on age and education level, -the implicit generative model is that to determine someone's income, you take their age and education level, -multiply them both by their regression parameters, and then add the intercept and some random error. -Of course, we know that's not actually how most companies determine wages, but it can be a useful starting point to analyze, e.g., labor market discrimination. - -The generative model behind LDA works as follows. -Assume that you are a journalist writing a 500 word news item. -First, you would choose one or more *topics* to write about, -for example 70\% healthcare and 30\% economy. -Next, for each word in the item, you randomly pick one of these topics based on their respective weight. -Finally, you pick a random word from the words associated with that topic, -where again each word has a certain probability for that topic. -For example, "hospital" might have a high probability for healthcare while "effectiveness" might have a lower probability but could still occur. - -As said, we know (or at least strongly suspect) that this is not how journalists actually write their stories. -However, this generative model helps understand the substantive interpretation of topics. -Moreover, LDA is a *mixture model*, meaning it allows for each document to be about multiple topics, and for each word to occur in multiple topics. -This matches with the fact that in many cases, our documents are in fact about multiple topics, -from a news article about the economic effects of the COVID virus to an open survey answer containing multiple reasons for supporting a certain candidate. -Additionally, since topic assignment is based on what other words occur in a document, -the word "pupil" could be assigned either to a "biology" topic or to an "education" topic, depending -on whether the document talks about eyes and lenses or about teachers and classrooms. - -![Latent Dirichlet Allocation in ``Plate Model'' notation (source: Blei et al, 2003)](img/lda.png){#fig-lda} - -Figure [-@fig-lda] is a more formal notation of the same generative model. -Starting from the left, for each document you pick a set of topics $\Theta$. -This set of topics is drawn from a *Dirichlet distribution* which itself has a parameter $\alpha$ -(see note). -Next, for each word you select a single topic $z$ from the topics in that document. -Finally, you pick an actual word $w$ from the words in that topic, again controlled by a parameter $\beta$. - -Now, if we know which words and documents are in which topics, we can start generating the documents in the corpus. -In reality, of course, we have the reverse situation: -we know the documents, and we want to know the topics. -Thus, the task of LDA is to find the parameters that have the highest chance of generating these documents. -Since only the word frequencies are observed, this is a latent variable model where we want to find the -most likely values for the (latent) topic $z$ for each word in each document. - -Unfortunately, there is no simple analytic solution to calculate these topic assignments -like there is for OLS regression. -Thus, like other more complicated statistical models such as multilevel regression, -we need to use an iterative estimation that progressively optimizes the assignment to improve the fit until it converges. - -An estimation method that is often used for LDA is Gibbs sampling. -Simply put, this starts with a random assignment of topics to words. -Then, in each iteration, it reconsiders each word and recomputes what likely topics for that word are -given the other topics in that document and the topics in which that word occurs in other documents. -Thus, if a document already contains a number of words placed in a certain topic, -a new word is more likely to be placed in that topic as well. -After enough iterations, this converges to a solution. - -::: {.callout-note icon=false collapse=true} +The most widely used technique for topic modeling is Latent Dirichlet Allocation [LDA, @blei03]. Although the goal of LDA is the same as for clustering techniques, it starts from the other end with what is called a *generative model*. A generative model is a (simplified) formal model of how the data is assumed to have been generated. For example, if we would have a standard regression model predicting income based on age and education level, the implicit generative model is that to determine someone's income, you take their age and education level, multiply them both by their regression parameters, and then add the intercept and some random error. Of course, we know that's not actually how most companies determine wages, but it can be a useful starting point to analyze, e.g., labor market discrimination. + +The generative model behind LDA works as follows. Assume that you are a journalist writing a 500 word news item. First, you would choose one or more *topics* to write about, for example 70% healthcare and 30% economy. Next, for each word in the item, you randomly pick one of these topics based on their respective weight. Finally, you pick a random word from the words associated with that topic, where again each word has a certain probability for that topic. For example, "hospital" might have a high probability for healthcare while "effectiveness" might have a lower probability but could still occur. + +As said, we know (or at least strongly suspect) that this is not how journalists actually write their stories. However, this generative model helps understand the substantive interpretation of topics. Moreover, LDA is a *mixture model*, meaning it allows for each document to be about multiple topics, and for each word to occur in multiple topics. This matches with the fact that in many cases, our documents are in fact about multiple topics, from a news article about the economic effects of the COVID virus to an open survey answer containing multiple reasons for supporting a certain candidate. Additionally, since topic assignment is based on what other words occur in a document, the word "pupil" could be assigned either to a "biology" topic or to an "education" topic, depending on whether the document talks about eyes and lenses or about teachers and classrooms. + +![Latent Dirichlet Allocation in \`\`Plate Model'' notation (source: Blei et al, 2003)](img/lda.png){#fig-lda} + +Figure [-@fig-lda] is a more formal notation of the same generative model. Starting from the left, for each document you pick a set of topics $\Theta$. This set of topics is drawn from a *Dirichlet distribution* which itself has a parameter $\alpha$ (see note). Next, for each word you select a single topic $z$ from the topics in that document. Finally, you pick an actual word $w$ from the words in that topic, again controlled by a parameter $\beta$. + +Now, if we know which words and documents are in which topics, we can start generating the documents in the corpus. In reality, of course, we have the reverse situation: we know the documents, and we want to know the topics. Thus, the task of LDA is to find the parameters that have the highest chance of generating these documents. Since only the word frequencies are observed, this is a latent variable model where we want to find the most likely values for the (latent) topic $z$ for each word in each document. + +Unfortunately, there is no simple analytic solution to calculate these topic assignments like there is for OLS regression. Thus, like other more complicated statistical models such as multilevel regression, we need to use an iterative estimation that progressively optimizes the assignment to improve the fit until it converges. + +An estimation method that is often used for LDA is Gibbs sampling. Simply put, this starts with a random assignment of topics to words. Then, in each iteration, it reconsiders each word and recomputes what likely topics for that word are given the other topics in that document and the topics in which that word occurs in other documents. Thus, if a document already contains a number of words placed in a certain topic, a new word is more likely to be placed in that topic as well. After enough iterations, this converges to a solution. + +::: {.callout-note icon="false" collapse="true"} ## The Dirichlet Distribution and its Hyperparameters -The Dirichlet distribution can be seen as a distribution over multinomial distributions, - that is, every draw from a Dirichlet distribution results in a multinomial distribution. - An easy way to visualize this is to see the Dirichlet distribution as a bag of dice. - You draw a die from the bag, and each die is a distribution over the numbers one to six. - -This distribution is controlled by a parameter called alpha ($\alpha$), - which is often called a *hyperparameter* because it is a parameter that controls how other parameters - (the actual topic distributions) are estimated, similar to, e.g., the learning speed in many machine learning models. - This alpha hyperparameter controls what kind of dice there are in the bag. - A high alpha means that the dice are generally fair, i.e., give a uniform multinomial distribution. - For topic models, this means that documents will in general contain an even spread of multiple topics. - A low alpha means that each die is unfair in the sense of having a strong preference for some number(s), as if these numbers are weighted. You can then draw a die that prefers ones, or a die that prefers sixes. - For topic models this means that each document tends to have one or two dominant topics. - Finally, alpha can be symmetric (meaning dice are unfair, but randomly, so in the end each topic has the same chance) - or asymmetric (they are still unfair, and now also favor some topics more than others). - This would correspond to some topics being more likely to occur in all documents. - -In our experience, most documents actually do have one or two dominant topics, - and some topics are actually more prevalent across many documents then others - -- especially if you consider that procedural words and boilerplate also need to be fit into a topic unless they are filtered out beforehand. - Thus, we would generally recommend a relatively low and asymmetric alpha, - and in fact *gensim* provides an algorithm to find, based on the data, an alpha that corresponds to this recommendation (by specifying `alpha='auto'`). - In R, we would recommend picking a lower alpha than the default value, probably around $\alpha=5/K$, - and optionally try using an asymmetric alpha if you find some words that occur across multiple topics. - -To get a more intuitive understanding of the effects of alpha, - please see [cssbook.net/lda](https://cssbook.net/lda) for additional material and visualizations. +The Dirichlet distribution can be seen as a distribution over multinomial distributions, that is, every draw from a Dirichlet distribution results in a multinomial distribution. An easy way to visualize this is to see the Dirichlet distribution as a bag of dice. You draw a die from the bag, and each die is a distribution over the numbers one to six. + +This distribution is controlled by a parameter called alpha ($\alpha$), which is often called a *hyperparameter* because it is a parameter that controls how other parameters (the actual topic distributions) are estimated, similar to, e.g., the learning speed in many machine learning models. This alpha hyperparameter controls what kind of dice there are in the bag. A high alpha means that the dice are generally fair, i.e., give a uniform multinomial distribution. For topic models, this means that documents will in general contain an even spread of multiple topics. A low alpha means that each die is unfair in the sense of having a strong preference for some number(s), as if these numbers are weighted. You can then draw a die that prefers ones, or a die that prefers sixes. For topic models this means that each document tends to have one or two dominant topics. Finally, alpha can be symmetric (meaning dice are unfair, but randomly, so in the end each topic has the same chance) or asymmetric (they are still unfair, and now also favor some topics more than others). This would correspond to some topics being more likely to occur in all documents. + +In our experience, most documents actually do have one or two dominant topics, and some topics are actually more prevalent across many documents then others -- especially if you consider that procedural words and boilerplate also need to be fit into a topic unless they are filtered out beforehand. Thus, we would generally recommend a relatively low and asymmetric alpha, and in fact *gensim* provides an algorithm to find, based on the data, an alpha that corresponds to this recommendation (by specifying `alpha='auto'`). In R, we would recommend picking a lower alpha than the default value, probably around $\alpha=5/K$, and optionally try using an asymmetric alpha if you find some words that occur across multiple topics. + +To get a more intuitive understanding of the effects of alpha, please see [cssbook.net/lda](https://cssbook.net/lda) for additional material and visualizations. ::: ### Fitting an LDA Model {#sec-ldafit} -::: {.callout-note appearance="simple" icon=false} +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-lda} LDA Topic Model of Obama's State of the Union speeches. -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python lda1-python} #| results: hide url = "https://cssbook.net/d/sotu.csv" @@ -1313,11 +834,12 @@ cv = CountVectorizer(min_df=0.01, stop_words="english") dtm = cv.fit_transform(p_obama) dtm corpus = matutils.Sparse2Corpus(dtm, documents_columns=False) -vocab = dict(enumerate(cv.get_feature_names())) +vocab = dict(enumerate(cv.get_feature_names_out())) lda = LdaModel( corpus, id2word=vocab, num_topics=10, random_state=123, alpha="asymmetric" ) ``` + ```{python lda1-python-output} pd.DataFrame( { @@ -1326,7 +848,9 @@ pd.DataFrame( } ) ``` + ## R code + ```{r lda1-r} #| cache: true url = "https://cssbook.net/d/sotu.csv" @@ -1351,51 +875,23 @@ terms(lda, 10) ::: ::: -Example [-@exm-lda] shows how you can fit an LDA model in Python or R. -As example data, we use Obama's State of the Union Speeches using the corpus introduced in Chapter [-@sec-chap-dtm]. -Since such a speech generally touches on many different topics, we choose to first split by paragraph -as these will be more semantically coherent (for Obama, at least). -In R, we use the `corpus_reshape` function to split the paragraphs, -while in Python we use *pandas*' `str.split`, which creates a list or paragraphs for each text, -which we then convert into a paragraph per row using `explode`. -Converting this to a DTM we get a reasonably sized matrix of 738 paragraphs and 746 unique words. - -Next, we fit the actual LDA model using the package *gensim* (Python) and *topicmodels* (R). -Before we can do this, we need to convert the DTM format into a format accepted by that package. -For Python, this is done using the `Sparse2Corpus` helper function while in R this is done with the *quanteda* `convert` function. -Then, we fit the model, asking for 10 topics to be identified in these paragraphs. -There are three things to note in this line. -First, we specify a *random seed* of 123 to make sure the analysis is replicable. -Second, we specify an "asymmetric" of `1/1:10`, meaning the first topic has alpha 1, the second 0.5, etc. (in R). -In Python, instead of using the default of `alpha='symmetric'`, we set `alpha='asymmetric'`, which uses the formula -$\frac{1}{topic\_index + \sqrt{num\_topics}}$ to determine the priors. At the cost of a longer estimation time, we can even -specify `alpha='auto'`, which will learn an asymmetric prior from the data. See the note on hyperparameters for more information. -Third, for Python we also need to specify the vocabulary names since these are not included in the DTM. - -The final line generates a data frame of top words per topic for first inspection -(which in Python requires separating the words from their weights in a list comprehension and converting it to a data frame for easy viewing). -As you can see, most topics are interpretable and somewhat coherent: for example, topic 1 seems to be about education and jobs, -while topic 2 is health care. You also see that the word "job" occurs in multiple topics (presumably because unemployment was a pervasive concern during Obama's tenure). -Also, some topics like topic 3 are relatively more difficult to interpret from this table. -A possible reason for this is that not every paragraph actually has policy content. -For example, the first paragraph of his first State of the Union was: -*Madam Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States -- she's around here somewhere*. -None of these words really fit a "topic" in the normal meaning of that term, -but all of these words need to be assigned a topic in LDA. -Thus, you often see "procedural" or "boilerplate" topics such as topic 3 occurring in LDA outputs. - -Finally, note that we showed the R results here. As *gensim* uses a different estimation algorithm -(and scikit-learnuses a different tokenizer and stopword list), results will not be identical, -but should be mostly similar. +Example [-@exm-lda] shows how you can fit an LDA model in Python or R. As example data, we use Obama's State of the Union Speeches using the corpus introduced in Chapter [-@sec-chap-dtm]. Since such a speech generally touches on many different topics, we choose to first split by paragraph as these will be more semantically coherent (for Obama, at least). In R, we use the `corpus_reshape` function to split the paragraphs, while in Python we use *pandas*' `str.split`, which creates a list or paragraphs for each text, which we then convert into a paragraph per row using `explode`. Converting this to a DTM we get a reasonably sized matrix of 738 paragraphs and 746 unique words. + +Next, we fit the actual LDA model using the package *gensim* (Python) and *topicmodels* (R). Before we can do this, we need to convert the DTM format into a format accepted by that package. For Python, this is done using the `Sparse2Corpus` helper function while in R this is done with the *quanteda* `convert` function. Then, we fit the model, asking for 10 topics to be identified in these paragraphs. There are three things to note in this line. First, we specify a *random seed* of 123 to make sure the analysis is replicable. Second, we specify an "asymmetric" of `1/1:10`, meaning the first topic has alpha 1, the second 0.5, etc. (in R). In Python, instead of using the default of `alpha='symmetric'`, we set `alpha='asymmetric'`, which uses the formula $\frac{1}{topic\_index + \sqrt{num\_topics}}$ to determine the priors. At the cost of a longer estimation time, we can even specify `alpha='auto'`, which will learn an asymmetric prior from the data. See the note on hyperparameters for more information. Third, for Python we also need to specify the vocabulary names since these are not included in the DTM. + +The final line generates a data frame of top words per topic for first inspection (which in Python requires separating the words from their weights in a list comprehension and converting it to a data frame for easy viewing). As you can see, most topics are interpretable and somewhat coherent: for example, topic 1 seems to be about education and jobs, while topic 2 is health care. You also see that the word "job" occurs in multiple topics (presumably because unemployment was a pervasive concern during Obama's tenure). Also, some topics like topic 3 are relatively more difficult to interpret from this table. A possible reason for this is that not every paragraph actually has policy content. For example, the first paragraph of his first State of the Union was: *Madam Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States -- she's around here somewhere*. None of these words really fit a "topic" in the normal meaning of that term, but all of these words need to be assigned a topic in LDA. Thus, you often see "procedural" or "boilerplate" topics such as topic 3 occurring in LDA outputs. + +Finally, note that we showed the R results here. As *gensim* uses a different estimation algorithm (and scikit-learnuses a different tokenizer and stopword list), results will not be identical, but should be mostly similar. ### Analyzing Topic Model Results {#sec-ldainspect} -::: {.callout-note appearance="simple" icon=false} +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-ldaresults} Analyzing and inspecting LDA results. -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python ldaresults1-python} #| cache: true topics = pd.DataFrame( @@ -1410,7 +906,9 @@ tpd.head() for docid in [622, 11, 322]: print(f"{docid}: {list(p_obama)[docid]}") ``` + ## R code + ```{r ldaresults1-r} #| cache: true topics = posterior(lda)$topics %>% @@ -1430,120 +928,52 @@ for (id in c("text7.73", "text5.1", "text2.12")) { ::: ::: -Example [-@exm-ldaresults] shows how you can combine the LDA results (topics per document) -with the original document metadata. -This could be your starting point for substantive analyses of the results, -for example to investigate relations between topics or between, e.g., time or partisanship and topic use. +Example [-@exm-ldaresults] shows how you can combine the LDA results (topics per document) with the original document metadata. This could be your starting point for substantive analyses of the results, for example to investigate relations between topics or between, e.g., time or partisanship and topic use. -You can also use this to find specific documents for reading. -For example, we noted above that topic 3 is difficult to interpret. -As you can see in the table in Example [-@exm-ldaresults] (which is sorted by value of topic 3), -most of the high scoring documents are the first paragraph in each speech, -which do indeed contain the "Madam speaker" boilerplate noted above. -The other three documents are all calls for bipartisanship and support. -As you can see from this example, carefully inspecting the top documents for each topic -is very helpful for making sense of the results. +You can also use this to find specific documents for reading. For example, we noted above that topic 3 is difficult to interpret. As you can see in the table in Example [-@exm-ldaresults] (which is sorted by value of topic 3), most of the high scoring documents are the first paragraph in each speech, which do indeed contain the "Madam speaker" boilerplate noted above. The other three documents are all calls for bipartisanship and support. As you can see from this example, carefully inspecting the top documents for each topic is very helpful for making sense of the results. ### Validating and Inspecting Topic Models {#sec-ldavalidate} -As we saw in the previous subsection, running a topic model is relatively easy. -However, that doesn't mean that the resulting topic model will always be useful. -As with all text analysis techniques, *validation* is the key to good analysis: -are you measuring what you want to measure? And how do you know? - -For topic modeling (and arguably for all text analysis), -the first step after fitting a model is inspecting the results and establishing face validity. -Top words per topic such as those listed above are a good place to start, -but we would really encourage you to also look at the top documents per topic to better understand how words are used in context. -Also, it is good to inspect the relationships between topics and look at documents that load high on multiple topics to understand the relationship. - -If the only goal is to get an explorative understanding of the corpus, -for example as a first step before doing a dictionary analysis or manual coding, -just face validity is probably sufficient. -For a more formal validation, however, it depends on the reason for using topic modeling. - -If you are using topic modeling in a true unsupervised sense, i.e., without a predefined analytic schema in mind, -it is difficult to assess whether the model measures what you want to measure -- -because the whole point is that you don't know what you want to measure. -That said, however, you can have the general criteria that the model needs to achieve *coherence* -and *interpretability*, meaning that words and documents that share a topic -are also similar semantically. - -In their excellent paper on the topic, @chang09 propose two formal tasks to judge this -using manual (or crowd) coding: in *word intrusion*, a coder is asked to pick the "odd one out" from a list -where one other word is mixed in a group of topic words. -In *topic intrusion*, the coder is presented with a document and a set of topics that occur in the document, -and is asked to spot the one topic that was not present according to the model. -In both tasks, if the coder is unable to identify the intruding word or topic, apparently the model does not fit -our intuitive notion of "aboutness" or semantic similarity. -Perhaps their most interesting finding is that goodness-of-fit measures like perplexity[^1] -are actually not good predictors of the interpretability of the resulting models. - -If you are using topic models in a more confirmatory manner, -that is, if you wish the topics to match some sort of predefined categorization, -you should use regular gold standard techniques for validation: -code a sufficiently large random sample of documents with your predefined categories, -and test whether the LDA topics match those categories. -In general, however, in such cases it is a better idea to use a dictionary or supervised analysis technique -as topic models often do not exactly capture our categories. After all, unsupervised techniques mainly -excel in bottom-up and explorative analyses (Section [-@sec-deciding]). - -[^1]: Perplexity is a measure to compare and evaluate topic models using log-likelihood in order to estimate how well a model predicts a sample. See the note 'how many topics' below for example code on how to compute perplexity +As we saw in the previous subsection, running a topic model is relatively easy. However, that doesn't mean that the resulting topic model will always be useful. As with all text analysis techniques, *validation* is the key to good analysis: are you measuring what you want to measure? And how do you know? + +For topic modeling (and arguably for all text analysis), the first step after fitting a model is inspecting the results and establishing face validity. Top words per topic such as those listed above are a good place to start, but we would really encourage you to also look at the top documents per topic to better understand how words are used in context. Also, it is good to inspect the relationships between topics and look at documents that load high on multiple topics to understand the relationship. + +If the only goal is to get an explorative understanding of the corpus, for example as a first step before doing a dictionary analysis or manual coding, just face validity is probably sufficient. For a more formal validation, however, it depends on the reason for using topic modeling. + +If you are using topic modeling in a true unsupervised sense, i.e., without a predefined analytic schema in mind, it is difficult to assess whether the model measures what you want to measure -- because the whole point is that you don't know what you want to measure. That said, however, you can have the general criteria that the model needs to achieve *coherence* and *interpretability*, meaning that words and documents that share a topic are also similar semantically. + +In their excellent paper on the topic, @chang09 propose two formal tasks to judge this using manual (or crowd) coding: in *word intrusion*, a coder is asked to pick the "odd one out" from a list where one other word is mixed in a group of topic words. In *topic intrusion*, the coder is presented with a document and a set of topics that occur in the document, and is asked to spot the one topic that was not present according to the model. In both tasks, if the coder is unable to identify the intruding word or topic, apparently the model does not fit our intuitive notion of "aboutness" or semantic similarity. Perhaps their most interesting finding is that goodness-of-fit measures like perplexity[^chapter11-1] are actually not good predictors of the interpretability of the resulting models. + +[^chapter11-1]: Perplexity is a measure to compare and evaluate topic models using log-likelihood in order to estimate how well a model predicts a sample. See the note 'how many topics' below for example code on how to compute perplexity + +If you are using topic models in a more confirmatory manner, that is, if you wish the topics to match some sort of predefined categorization, you should use regular gold standard techniques for validation: code a sufficiently large random sample of documents with your predefined categories, and test whether the LDA topics match those categories. In general, however, in such cases it is a better idea to use a dictionary or supervised analysis technique as topic models often do not exactly capture our categories. After all, unsupervised techniques mainly excel in bottom-up and explorative analyses (Section [-@sec-deciding]). ### Beyond LDA {#sec-beyondlda} -This chapter focused on regular or "vanilla" LDA topic modeling. -Since the seminal publication, however, a large amount of variations and extensions on LDA have been proposed. -These include, for example, *Dynamic Topic Models* (which incorporate time; @dynamiclda) and -*Correlated Topic Models* (which explicitly model correlation between topics; @correlatedlda). -Although it is beyond the scope of this book to describe these models in detail, -the interested reader is encouraged to learn more about these models. +This chapter focused on regular or "vanilla" LDA topic modeling. Since the seminal publication, however, a large amount of variations and extensions on LDA have been proposed. These include, for example, *Dynamic Topic Models* (which incorporate time; @dynamiclda) and *Correlated Topic Models* (which explicitly model correlation between topics; @correlatedlda). Although it is beyond the scope of this book to describe these models in detail, the interested reader is encouraged to learn more about these models. -Especially noteworthy are *Structural Topic Models* (R package *stm*; @stm), -which allow you to model covariates as topic or word predictors. -This allows you, for example, to model topic shifts over time or -different words for the same topic based on, e.g., Republican or Democrat presidents. +Especially noteworthy are *Structural Topic Models* (R package *stm*; @stm), which allow you to model covariates as topic or word predictors. This allows you, for example, to model topic shifts over time or different words for the same topic based on, e.g., Republican or Democrat presidents. -Python users should check out *Hierarchical Topic Modeling* [@hierarchicallda]. -In hierarchical topic modeling, rather than the researcher specifying a fixed number of topics, -the model returns a hierarchy of topics from few general topics to a large number of specific topics, -allowing for a more flexible exploration and analysis of the data. +Python users should check out *Hierarchical Topic Modeling* [@hierarchicallda]. In hierarchical topic modeling, rather than the researcher specifying a fixed number of topics, the model returns a hierarchy of topics from few general topics to a large number of specific topics, allowing for a more flexible exploration and analysis of the data. -::: {.callout-note icon=false collapse=true} +::: {.callout-note icon="false" collapse="true"} ## How many topics? -With topic modeling, the most important researcher choices are the *number of topics* and the value of *alpha*. - These choices are called *hyperparameters*, since they determine how the model parameters (e.g. words per topic) are found. - -There is no good theoretical solution to determine the "right" number of topics for a given corpus and research question. - Thus, a sensible approach can be to ask the computer to try many models, and see which works best. - Unfortunately, because this is an unsupervised (inductive) method, - there is no single metric that determines how good a topic model is. - -There are a number of such metrics proposed in the literature, of which we will introduce two. - *Perplexity* is a score of how well the LDA model can fit (predict) the actual word distribution - (or in other words: how "perplexed" the model is seeing the corpus). - *Coherence* is a measure of how semantically coherent the topics are by checking how often the top token co-occurs in documents in each topic [@mimno11]. - -The code below shows how these can be calculated for a range of topic counts, and the same code could be used for trying different values of *alpha*. - For both measures, lower values are better, and both essentially keep decreasing as you add more topics. - What you are looking for is the *inflection point* (or "elbow point") where it goes from a steep decrease to a more gradual decrease. - For coherence, this seems to be at 10 topics, while for perplexity this is at 20 topics. - -There are two very important caveats to make here, however. - First, these metrics are no substitute for human validation and the best model according to these metrics is not always the most interpretable or coherent model. - In our experience, most metrics give a higher topic count that would be optimal from an interpretability perspective, but of course that also depends on how we operationalize interpretability. - Nonetheless, these topic numbers are probably more indicative of a range of counts that should be inspected manually, rather than giving a definitive answer. - -Second, the code below was written so it is easy to understand and quick to run. - For real use in a research project, it is advised to include a broader range of topic counts and also vary the $\alpha$. - Moreover, it is smart to run each count multiple times so you get an indication of the variance as well as a single point - (it is quite likely that the local minimum for coherence at $k=10$ is an outlier that will disappear if more runs are averaged). - Finally, especially for a goodness-of-fit measure like perplexity it is better to split the data into a training and test set - (see Section [-@sec-workflow] for more details). - -::: {.panel-tabset} +With topic modeling, the most important researcher choices are the *number of topics* and the value of *alpha*. These choices are called *hyperparameters*, since they determine how the model parameters (e.g. words per topic) are found. + +There is no good theoretical solution to determine the "right" number of topics for a given corpus and research question. Thus, a sensible approach can be to ask the computer to try many models, and see which works best. Unfortunately, because this is an unsupervised (inductive) method, there is no single metric that determines how good a topic model is. + +There are a number of such metrics proposed in the literature, of which we will introduce two. *Perplexity* is a score of how well the LDA model can fit (predict) the actual word distribution (or in other words: how "perplexed" the model is seeing the corpus). *Coherence* is a measure of how semantically coherent the topics are by checking how often the top token co-occurs in documents in each topic [@mimno11]. + +The code below shows how these can be calculated for a range of topic counts, and the same code could be used for trying different values of *alpha*. For both measures, lower values are better, and both essentially keep decreasing as you add more topics. What you are looking for is the *inflection point* (or "elbow point") where it goes from a steep decrease to a more gradual decrease. For coherence, this seems to be at 10 topics, while for perplexity this is at 20 topics. + +There are two very important caveats to make here, however. First, these metrics are no substitute for human validation and the best model according to these metrics is not always the most interpretable or coherent model. In our experience, most metrics give a higher topic count that would be optimal from an interpretability perspective, but of course that also depends on how we operationalize interpretability. Nonetheless, these topic numbers are probably more indicative of a range of counts that should be inspected manually, rather than giving a definitive answer. + +Second, the code below was written so it is easy to understand and quick to run. For real use in a research project, it is advised to include a broader range of topic counts and also vary the $\alpha$. Moreover, it is smart to run each count multiple times so you get an indication of the variance as well as a single point (it is quite likely that the local minimum for coherence at $k=10$ is an outlier that will disappear if more runs are averaged). Finally, especially for a goodness-of-fit measure like perplexity it is better to split the data into a training and test set (see Section [-@sec-workflow] for more details). + +::: panel-tabset ## Python code + ```{python ldacoherence-python} #| cache: true #| results: hide @@ -1567,7 +997,9 @@ result.plot(x="k", y=["perplexity", "coherence"]) plt.show() ``` + ## R code + ```{r ldacoherence-r} #| cache: true results = list() @@ -1590,7 +1022,6 @@ bind_rows(results, .id="k") %>% ::: ::: - ```{bash cleanup} #| echo: false rm -f myclassifier.pkl myvectorizer.pkl diff --git a/content/chapter13.qmd b/content/chapter13.qmd index 2538c76..7676356 100644 --- a/content/chapter13.qmd +++ b/content/chapter13.qmd @@ -1,8 +1,12 @@ # Network Data {#sec-chap-network} {{< include common_setup.qmd >}} - - +```{=html} + +``` **Abstract.** Many types of data, especially social media data, can often be represented as networks. This chapter introduces *igraph* (R and Python) and *networkx* (Python) to showcase how to deal with such data, perform Social Network Analysis (SNA), and represent it visually. @@ -11,25 +15,27 @@ Many types of data, especially social media data, can often be represented as ne **Objectives:** -- Understand how can networks be represented and visualized - - Conduct basic description of networks - - Perform Social Network Analysis +- Understand how can networks be represented and visualized +- Conduct basic description of networks +- Perform Social Network Analysis -::: {.callout-note icon=false collapse=true} +::: {.callout-note icon="false" collapse="true"} ## Packages used in this chapter -This chapter uses functions from the package *igraph* in R and the package *networkx* in Python. - In Python we will also use the *python-louvain* packages which introduces the Louvain clustering functions in *community*. +This chapter uses functions from the package *igraph* in R and the package *networkx* in Python. In Python we will also use the *python-louvain* packages which introduces the Louvain clustering functions in *community*. -You can install these packages with the code below if needed (see Section [-@sec-installing] for more details): +You can install these packages with the code below if needed (see Section [-@sec-installing] for more details): -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python chapter13install-python} #| eval: false !pip3 install networkx matplotlib python-louvain community ``` + ## R code + ```{r chapter13install-r} #| eval: false # Common packages used in this book @@ -38,10 +44,12 @@ install.packages(c("glue", "tidyverse")) install.packages(c("igraph")) ``` ::: - After installing, you need to import (activate) the packages every session: -::: {.panel-tabset} +After installing, you need to import (activate) the packages every session: + +::: panel-tabset ## Python code + ```{python chapter13library-python} import urllib.request @@ -52,7 +60,9 @@ import community import community.community_louvain as community_louvain ``` + ## R code + ```{r chapter13library-r} library(glue) library(igraph) @@ -63,26 +73,29 @@ library(tidyverse) ## Representing and Visualizing Networks {#sec-graph} -How can networks help us to understand and represent social problems? How can we use social media as a source for small and large-scale network analysis? In the computational analysis of communication these questions become highly relevant given the huge amount of social media data produced every minute on the Internet. In fact, although graph theory and SNA were already being used during the last two decades of the 20th century, we can say that the widespread adoption of the Internet and especially social networking services such as Twitter and Facebook really unleashed their potential. Firstly, computers made it easier to compute graph measures and visualize their general and communal structures. Secondly, the emergence of a big spectrum of social media network sites (i.e. Facebook, Twitter, Sina Weibo, Instagram, Linkedin, etc.) produced an unprecedented number of online social interactions, which still is certainly an excellent arena to apply this framework. Thus, the use of social media as a source for network analysis has become one of the most exciting and promising areas in the field of computational social science. +How can networks help us to understand and represent social problems? How can we use social media as a source for small and large-scale network analysis? In the computational analysis of communication these questions become highly relevant given the huge amount of social media data produced every minute on the Internet. In fact, although graph theory and SNA were already being used during the last two decades of the 20th century, we can say that the widespread adoption of the Internet and especially social networking services such as Twitter and Facebook really unleashed their potential. Firstly, computers made it easier to compute graph measures and visualize their general and communal structures. Secondly, the emergence of a big spectrum of social media network sites (i.e. Facebook, Twitter, Sina Weibo, Instagram, Linkedin, etc.) produced an unprecedented number of online social interactions, which still is certainly an excellent arena to apply this framework. Thus, the use of social media as a source for network analysis has become one of the most exciting and promising areas in the field of computational social science. This section presents a brief overview of graph structures (nodes and edges) and types (directed, weighted, etc.), together with their representations in R and Python. We also include visual representations and basic graph analysis. -A graph is a structure derived from a set of elements and their relationships. The element could be a neuron, a person, an organization, a street, or even a message, and the relationship could be a synapse, a trip, a commercial agreement, a drive connection or a content transmission. This is a different way to represent, model and analyze the world: instead of having rows and columns as in a typical data frame, in a graph we have *nodes* (components) and *edges* (relations). -The mathematical representation of a graph $G=(V,E)$ is based on a set of nodes (also called vertices): $\{v_{1}, v_{2},\ldots v_{n}\}$ and the edges or pair of nodes: $\{(v_{1}, v_{2}), (v_{1}, v_{3}), (v_{2},v_{3}) \ldots (v_{m}, v_{n}) \in E\}$ As you may imagine, it is a very versatile procedure to represent many kinds of situations that include social, media, or political interactions. In fact, if we go back to 1934 we can see how graph theory (originally established in the 18th century) was first applied to the representation of social interactions [@moreno1934shall] in order to measure the attraction and repulsion of individuals of a social group[^1]. +A graph is a structure derived from a set of elements and their relationships. The element could be a neuron, a person, an organization, a street, or even a message, and the relationship could be a synapse, a trip, a commercial agreement, a drive connection or a content transmission. This is a different way to represent, model and analyze the world: instead of having rows and columns as in a typical data frame, in a graph we have *nodes* (components) and *edges* (relations). The mathematical representation of a graph $G=(V,E)$ is based on a set of nodes (also called vertices): $\{v_{1}, v_{2},\ldots v_{n}\}$ and the edges or pair of nodes: $\{(v_{1}, v_{2}), (v_{1}, v_{3}), (v_{2},v_{3}) \ldots (v_{m}, v_{n}) \in E\}$ As you may imagine, it is a very versatile procedure to represent many kinds of situations that include social, media, or political interactions. In fact, if we go back to 1934 we can see how graph theory (originally established in the 18th century) was first applied to the representation of social interactions [@moreno1934shall] in order to measure the attraction and repulsion of individuals of a social group[^chapter13-1]. + +[^chapter13-1]: See also the mathematical problem of the *Seven Bridges of Königsberg*, formulated by Leonhard Euler in 1736, which is considered the basis of graph theory. Inspired by a city divided by a river and connected by several bridges, the problem consisted of walking through the whole city crossing each bridge exactly once. + +The network approach in social sciences has an enormous potential to model and predict *social actions*. There is empirical evidence that we can successfully apply this framework to explain distinct phenomena such as political opinions, obesity, and happiness, given the influence of our friends (or even of the friends of our friends) over our behavior [@christakis2009connected]. The network created by this sophisticated structure of human and social connections is an ideal scenario to understand how close we are to each other in terms of degrees of separation [@watts2004six] in small (e.g., a school) and large-scale (e.g., a global pandemic) social dynamics. Moreover, the network approach can help us to track the propagation either of a virus in epidemiology, or a fake news story in political and social sciences, such as in the work by @vosoughi2018spread. -The network approach in social sciences has an enormous potential to model and predict *social actions*. There is empirical evidence that we can successfully apply this framework to explain distinct phenomena such as political opinions, obesity, and happiness, given the influence of our friends (or even of the friends of our friends) over our behavior [@christakis2009connected]. The network created by this sophisticated structure of human and social connections is an ideal scenario to understand how close we are to each other in terms of degrees of separation [@watts2004six] in small (e.g., a school) and large-scale (e.g., a global pandemic) social dynamics. Moreover, the network approach can help us to track the propagation either of a virus in epidemiology, or a fake news story in political and social sciences, such as in the work by -@vosoughi2018spread. +Now, let us show you how to create and visualize network structures in R and Python. As we mentioned above, the structure of a graph is based on nodes and edges, which are the fundamental components of any network. Suppose that we want to model the social network of five American politicians in 2017 (Donald Trump, Bernie Sanders, Hillary Clinton, Barack Obama and John McCain), based on their *imaginary* connections on Facebook (friending) and Twitter (following)[^chapter13-2]. Technically, the base of any graph is a list of edges (written as pair of nodes that indicate the relationships) and a list of nodes (some nodes might be isolated without any connection!). For instance, the friendship on Facebook between two politicians would normally be expressed as two strings separated by comma (e.g., "Hillary Clinton", "Donald Trump"). In Example [-@exm-graph] we use libraries *igraph* (R)[^chapter13-3] and *networkx* (Python) to create from scratch a simple graph with five nodes and five edges, using the above-mentioned structure of pairs of nodes (notice that we only include the edges while the vertices are automatically generated). -Now, let us show you how to create and visualize network structures in R and Python. As we mentioned above, the structure of a graph is based on nodes and edges, which are the fundamental components of any network. Suppose that we want to model the social network of five American politicians in 2017 (Donald Trump, Bernie Sanders, Hillary Clinton, Barack Obama and John McCain), based on their *imaginary* connections on Facebook (friending) and Twitter (following)[^2]. Technically, the base of any graph is a list of edges (written as pair of nodes that indicate the relationships) and a list of nodes (some nodes might be isolated without any connection!). For instance, the friendship on Facebook between two politicians would normally be expressed as two strings separated by comma (e.g., "Hillary Clinton", "Donald Trump"). In Example [-@exm-graph] we use libraries *igraph* (R)[^3] and *networkx* (Python) to create from scratch a simple graph with five nodes and five edges, using the above-mentioned structure of pairs of nodes (notice that we only include the edges while the vertices are automatically generated). +[^chapter13-2]: The connections among these politicians on Facebook and Twitter in the examples are of course purely fictional and were created *ad hoc* to illustrate small social networks. -::: {.callout-note appearance="simple" icon=false} +[^chapter13-3]: You can use this library in Python with the adapted package *python-igraph*. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-graph} Imaginary Facebook network of 5 American politicians -::: {.panel-tabset} - +::: panel-tabset ## Python code + ```{python graph-python} edges = [ ("Hillary Clinton", "Donald Trump"), @@ -96,7 +109,9 @@ g1.add_edges_from(edges) print("Nodes:", g1.number_of_nodes(), "Edges: ", g1.number_of_edges()) print(g1.edges) ``` + ## R code + ```{r graph-r} edges=c("Hillary Clinton", "Donald Trump", "Bernie Sanders","Hillary Clinton", @@ -112,12 +127,13 @@ g1 In both cases we generated a graph object `g1` which contains the structure of the network and different attributes (such as `number_of_nodes()` in *networkx*). You can add/remove nodes and edges to/from this initial graph, or even modify the names of the vertices. One of the most useful functions is the visualization of the network (`plot` in *igraph* and `draw` or `draw_networkx` in *networkx*). Example [-@exm-visgraph] shows a basic visualization of the imaginary network of friendships of five American politicians on Facebook. -::: {.callout-note appearance="simple" icon=false} +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-visgraph} Visualization of a simple graph. -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python visgraph-python} #| results: hide nx.draw_networkx(g1) @@ -132,7 +148,9 @@ plt.box(False) plt.show() ``` + ## R code + ```{r visgraph-r} plot(g1) ``` @@ -142,15 +160,15 @@ plot(g1) Using network terminology, either nodes or edges can be *adjacent* or not. In the figure we can say that nodes representing Donald Trump and John McCain are adjacent because they are connected by an edge that depicts their friendship. Moreover, the edges representing the friendships between John McCain and Donald Trump, and Hillary Clinton and Donald Trump, are also adjacent because they share one node (Donald Trump). -Now that you know the relevant terminology and basics of working with graphs, you might be wondering: what if I want to do the same with Twitter? Can I represent the relationships between users in the very same way as Facebook? Well, when you model networks it is extremely important that you have a clear definition of what you mean with nodes and edges, in order to maintain a coherent interpretation of the graph. In both, Facebook and Twitter, the nodes represent the users, but the edges might not be the same. In Facebook, an edge represents the friendship between two users and this link *has no direction* (once a user accepts a friend request, both users become friends). In the case of Twitter, an edge could represent various relationships. For example, it could mean that two users follow each other, or that one user is following another user, but not the other way around! In the latter case, the edge *has a direction*, which you can establish in the graph. When you give directions to the edges you are creating a *directed graph*. In Example [-@exm-directed] the directions are declared with the order of the pair of nodes: the first position is for the "from" and the second for the "to". In *igraph* (R) we set the argument `directed` of the function `make_graph` to `TRUE`. In *networkx* (Python), you use the class `DiGraph` instead of `Graph` to create the object `g2`. - -::: {.callout-note appearance="simple" icon=false} +Now that you know the relevant terminology and basics of working with graphs, you might be wondering: what if I want to do the same with Twitter? Can I represent the relationships between users in the very same way as Facebook? Well, when you model networks it is extremely important that you have a clear definition of what you mean with nodes and edges, in order to maintain a coherent interpretation of the graph. In both, Facebook and Twitter, the nodes represent the users, but the edges might not be the same. In Facebook, an edge represents the friendship between two users and this link *has no direction* (once a user accepts a friend request, both users become friends). In the case of Twitter, an edge could represent various relationships. For example, it could mean that two users follow each other, or that one user is following another user, but not the other way around! In the latter case, the edge *has a direction*, which you can establish in the graph. When you give directions to the edges you are creating a *directed graph*. In Example [-@exm-directed] the directions are declared with the order of the pair of nodes: the first position is for the "from" and the second for the "to". In *igraph* (R) we set the argument `directed` of the function `make_graph` to `TRUE`. In *networkx* (Python), you use the class `DiGraph` instead of `Graph` to create the object `g2`. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-directed} Creating a directed graph -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python directed-python} edges += [ ("Hillary Clinton", "Bernie Sanders"), @@ -162,7 +180,9 @@ print("Nodes:", g2.number_of_nodes(), "Edges: ", g2.number_of_edges()) print(g2.edges) ``` + ## R code + ```{r directed-r} edges = c(edges, "Hillary Clinton", "Bernie Sanders", @@ -176,12 +196,13 @@ print(g2) In the new graph the edges represent the action of following a user on Twitter. The first declared edge indicates that Hillary Clinton follows Donald Trump, but does not indicate the opposite. In order to provide the directed graph with more *arrows* we included in `g2` two new edges (Obama following Clinton and Clinton following Sanders), so we can have a couple of reciprocal relationships besides the unidirectional ones. You can visualize the directed graph in Example [-@exm-visdirected] and see how the edges now contain useful arrows. -::: {.callout-note appearance="simple" icon=false} +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-visdirected} Visualization of a directed graph. -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python visdirected-python} #| results: hide nx.draw_networkx(g2) @@ -196,7 +217,9 @@ plt.box(False) plt.show() ``` + ## R code + ```{r visdirected-r} plot(g2) ``` @@ -204,14 +227,15 @@ plot(g2) ::: ::: -The edges and nodes of our graph can also have weights and features or attributes. When the edges have specific values that depict a feature of every pair of nodes (i.e., the distance between two cities) we say that we have a *weighted graph*. This type of graph is extremely useful for creating a more accurate representation of a network. For example, in our hypothetical network of American politicians on Twitter (`g2`) we can assign weights to the edges by including the number of likes that each politician has given to the followed user. This value can serve as a measure of the distance between the nodes (i.e., the higher the number of likes the shorter the social distance). In Example [-@exm-weighted] we include the weights for each edge: Clinton has given five likes to Trumps' tweets, Sanders 20 to Clinton's messages, and so on. In the plot you can see how the sizes of the lines between the nodes change as a function of the weights. +The edges and nodes of our graph can also have weights and features or attributes. When the edges have specific values that depict a feature of every pair of nodes (i.e., the distance between two cities) we say that we have a *weighted graph*. This type of graph is extremely useful for creating a more accurate representation of a network. For example, in our hypothetical network of American politicians on Twitter (`g2`) we can assign weights to the edges by including the number of likes that each politician has given to the followed user. This value can serve as a measure of the distance between the nodes (i.e., the higher the number of likes the shorter the social distance). In Example [-@exm-weighted] we include the weights for each edge: Clinton has given five likes to Trumps' tweets, Sanders 20 to Clinton's messages, and so on. In the plot you can see how the sizes of the lines between the nodes change as a function of the weights. -::: {.callout-note appearance="simple" icon=false} +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-weighted} Visualization of a weighted graph -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python weighted-python} #| results: hide edges_w = [ @@ -252,7 +276,9 @@ plt.box(False) plt.show() ``` + ## R code + ```{r weighted-r} E(g2)$weight = c(5, 20, 30, 40, 50, 10, 15) plot(g2, edge.label = E(g2)$weight) @@ -261,14 +287,15 @@ plot(g2, edge.label = E(g2)$weight) ::: ::: -You can include more properties of the components of your graph. Imagine you want to use the number of followers of each politician to determine the size of the nodes, or the gender of the user to establish a color. In Example [-@exm-weighted2] we added the variable *followers* to each of the nodes and asked the packages to plot the network using this value as the size parameter (in fact we multiplied the values by 0.001 to make it realistic on the screen, but you could also normalize these values when needed). We also included the variable *party* that was later recoded in a new one called *color* in order to represent Republicans with red and Democrats with blue. You may need to add other features to the nodes or edges, but with this example you have an overview of what you can do. +You can include more properties of the components of your graph. Imagine you want to use the number of followers of each politician to determine the size of the nodes, or the gender of the user to establish a color. In Example [-@exm-weighted2] we added the variable *followers* to each of the nodes and asked the packages to plot the network using this value as the size parameter (in fact we multiplied the values by 0.001 to make it realistic on the screen, but you could also normalize these values when needed). We also included the variable *party* that was later recoded in a new one called *color* in order to represent Republicans with red and Democrats with blue. You may need to add other features to the nodes or edges, but with this example you have an overview of what you can do. -::: {.callout-note appearance="simple" icon=false} +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-weighted2} Visualization of a weighted graph including vertex sizes. -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python weighted2-python} #| results: hide attrs = { @@ -302,7 +329,9 @@ plt.box(False) plt.show() ``` + ## R code + ```{r weighted2-r} V(g2)$followers = c(100000, 200000, 50000,500000, 40000) @@ -321,15 +350,15 @@ plot(g2, edge.label = E(g2)$weight, ::: ::: -We can mention a third type of graphs: the *induced subgraphs*, which are in fact subsets of nodes and edges of a bigger graph. We can represent these subsets as $G' = V', E'$. In Example [-@exm-subgraph] we extract two induced subgraphs from our original network of American politicians on Facebook (`g1`): the first (`g3`) is built with the edges that contain only Democrat nodes, and the second (`g4`) with edges formed by Republican nodes. There is also a special case of an induced subgraph, called a *clique*, which is an independent or complete subset of an undirected graph (each node of the clique must be connected to the rest of the nodes of the subgraph). - -::: {.callout-note appearance="simple" icon=false} +We can mention a third type of graphs: the *induced subgraphs*, which are in fact subsets of nodes and edges of a bigger graph. We can represent these subsets as $G' = V', E'$. In Example [-@exm-subgraph] we extract two induced subgraphs from our original network of American politicians on Facebook (`g1`): the first (`g3`) is built with the edges that contain only Democrat nodes, and the second (`g4`) with edges formed by Republican nodes. There is also a special case of an induced subgraph, called a *clique*, which is an independent or complete subset of an undirected graph (each node of the clique must be connected to the rest of the nodes of the subgraph). +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-subgraph} Induced subgraphs for Democrats and Republicans -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python subgraph-python} # Democrats: g3 = g1.subgraph(["Hillary Clinton", "Bernie Sanders", "Barack Obama"]) @@ -342,7 +371,9 @@ print("Nodes:", g4.number_of_nodes(), "Edges: ", g4.number_of_edges()) print(g4.edges) ``` + ## R code + ```{r subgraph-r} # Democrats: g3 = induced_subgraph(g1, c(1,3,4)) @@ -358,15 +389,15 @@ print(g4) Keep in mind that in network visualization you can always configure the size, shape and color of your nodes or edges. It is out of the scope of this book to go into more technical details, but you can always check the online documentation of the recommended libraries. -So far we have created networks from scratch, but most of the time you will have to create a graph from an existing data file. This means that you will need an input data file with the graph structure, and some functions to load them as objects onto your workspace in R or Python. You can import graph data from different specific formats (e.g., Graph Modeling Language (GML), GraphML, JSON, etc.), but one popular and standardized procedure is to obtain the data from a text file containing a list of edges or a matrix. In Example [-@exm-read] we illustrate how to read graph data in *igraph* and *networkx* using a simple adjacency list that corresponds to our original imaginary Twitter network of American politicians (`g2`). - -::: {.callout-note appearance="simple" icon=false} +So far we have created networks from scratch, but most of the time you will have to create a graph from an existing data file. This means that you will need an input data file with the graph structure, and some functions to load them as objects onto your workspace in R or Python. You can import graph data from different specific formats (e.g., Graph Modeling Language (GML), GraphML, JSON, etc.), but one popular and standardized procedure is to obtain the data from a text file containing a list of edges or a matrix. In Example [-@exm-read] we illustrate how to read graph data in *igraph* and *networkx* using a simple adjacency list that corresponds to our original imaginary Twitter network of American politicians (`g2`). +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-read} Reading a graph from a file -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python read-python} url = "https://cssbook.net/d/poltwit.csv" fn, _headers = urllib.request.urlretrieve(url) @@ -374,7 +405,9 @@ g2 = nx.read_adjlist(fn, create_using=nx.DiGraph, delimiter=",") print("Nodes:", g2.number_of_nodes(), "Edges: ", g2.number_of_edges()) ``` + ## R code + ```{r read-r} edges = read_csv( "https://cssbook.net/d/poltwit.csv", @@ -394,15 +427,15 @@ This section gives an overview of the existing measures to conduct Social Networ ### Paths and Reachability {#sec-paths} -The first idea that comes to mind when analyzing a graph is to understand how their nodes are connected. When multiple edges create a network we can observe how the vertices constitute one or many paths that can be described. In this sense, a `sequence` between node *x* and node *y* is a path where each node is `adjacent` to the previous. In the imaginary social network of friendship of American politicians contained in the undirected graph `g1`, we can determine the sequences or simple paths between any pair of politicians. As shown in Example [-@exm-path] we can use the function `all_simple_paths` contained in both *igraph* (R) and *networkx* (Python), to obtain the two possible routes between Barack Obama and John McCain. The shortest path includes the nodes Hillary Clinton and Donald Trump; and the longer includes Sanders, Clinton, and Trump. - -::: {.callout-note appearance="simple" icon=false} +The first idea that comes to mind when analyzing a graph is to understand how their nodes are connected. When multiple edges create a network we can observe how the vertices constitute one or many paths that can be described. In this sense, a `sequence` between node *x* and node *y* is a path where each node is `adjacent` to the previous. In the imaginary social network of friendship of American politicians contained in the undirected graph `g1`, we can determine the sequences or simple paths between any pair of politicians. As shown in Example [-@exm-path] we can use the function `all_simple_paths` contained in both *igraph* (R) and *networkx* (Python), to obtain the two possible routes between Barack Obama and John McCain. The shortest path includes the nodes Hillary Clinton and Donald Trump; and the longer includes Sanders, Clinton, and Trump. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-path} Possible paths between two nodes in the imaginary Facebook network of American politicians -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python path-python} for path in nx.all_simple_paths( g1, source="Barack Obama", target="John McCain" @@ -410,7 +443,9 @@ for path in nx.all_simple_paths( print(path) ``` + ## R code + ```{r path-r} all_simple_paths(g1, "Barack Obama","John McCain", mode = c("all")) @@ -421,12 +456,13 @@ all_simple_paths(g1, "Barack Obama","John McCain", One specific type of path is the one in which the initial node is the same than the final node. This closed path is called a *circuit*. To understand this concept let us recover the inducted subgraph of Democrat politicians (`g3`) in which we only have three nodes. If you plot this graph, as we do in Example [-@exm-circuit], you can clearly visualize how a circuit works. -::: {.callout-note appearance="simple" icon=false} +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-circuit} Visualization of a circuit. -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python circuit-python} #| results: hide nx.draw_networkx(g3) @@ -440,7 +476,9 @@ plt.box(False) plt.show() ``` + ## R code + ```{r circuit-r} plot(g3) ``` @@ -448,14 +486,15 @@ plot(g3) ::: ::: -In SNA it is extremely important to be able to describe the possible paths since they help us to estimate the reachability of the vertices. For instance, if we go back to our original graph of American politicians on Facebook (`g1`) visualized in Example [-@exm-visgraph], we can see that Sanders is reachable from McCain because there is a path between them (McCain--Trump--Clinton--Sanders). Moreover, we observe that this social network is fully *connected* because you can reach any given node from any other node in the graph. But it might not always be that way. Imagine that we remove the friendship of Clinton and Trump by deleting that specific edge. As you can observe in Example [-@exm-component], when we create and visualize the graph `g6` without this edge we can see that the network is no longer fully connected and it has two *components*. Technically speaking, we would say for example that the subgraph of Republicans is a connected component of the network of American politicians, given that this connected subgraph is part of the bigger graph while not connected to it. +In SNA it is extremely important to be able to describe the possible paths since they help us to estimate the reachability of the vertices. For instance, if we go back to our original graph of American politicians on Facebook (`g1`) visualized in Example [-@exm-visgraph], we can see that Sanders is reachable from McCain because there is a path between them (McCain--Trump--Clinton--Sanders). Moreover, we observe that this social network is fully *connected* because you can reach any given node from any other node in the graph. But it might not always be that way. Imagine that we remove the friendship of Clinton and Trump by deleting that specific edge. As you can observe in Example [-@exm-component], when we create and visualize the graph `g6` without this edge we can see that the network is no longer fully connected and it has two *components*. Technically speaking, we would say for example that the subgraph of Republicans is a connected component of the network of American politicians, given that this connected subgraph is part of the bigger graph while not connected to it. -::: {.callout-note appearance="simple" icon=false} +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-component} Visualization of connected components. -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python component-python} #| results: hide # Remove the friendship between Clinton and Trump @@ -472,7 +511,9 @@ plt.box(False) plt.show() ``` + ## R code + ```{r component-r} #Remove the friendship between Clinton and Trump g6 = delete.edges(g1, E(g1, P= @@ -483,15 +524,15 @@ plot(g6) ::: ::: -When analyzing paths and reachability you may be interested in knowing the distances in your graph. One common question is what is the average path length of a social network, or in other words, what is the average of the shortest distance between each pair of vertices in the graph? This *mean distance* can tell you a lot about how close the nodes in the network are: the shorter the distance the closer the nodes are. Moreover, you can estimate the specific distance (shortest path) between two specific nodes. As shown in Example [-@exm-distance] we can estimate the average path length (1.7) in our imaginary Facebook network of American politicians using the functions `mean_distance` in *igraph* and `average_shortest_path_length` in *networkx*. In this example we also estimate the specific distance in the network between Obama and McCain (3) using the function `distances` in *igraph* and estimating the length (`len`) of the shortest path (first result of `shortest_simple_paths` minus 1) in *networkx*. - -::: {.callout-note appearance="simple" icon=false} +When analyzing paths and reachability you may be interested in knowing the distances in your graph. One common question is what is the average path length of a social network, or in other words, what is the average of the shortest distance between each pair of vertices in the graph? This *mean distance* can tell you a lot about how close the nodes in the network are: the shorter the distance the closer the nodes are. Moreover, you can estimate the specific distance (shortest path) between two specific nodes. As shown in Example [-@exm-distance] we can estimate the average path length (1.7) in our imaginary Facebook network of American politicians using the functions `mean_distance` in *igraph* and `average_shortest_path_length` in *networkx*. In this example we also estimate the specific distance in the network between Obama and McCain (3) using the function `distances` in *igraph* and estimating the length (`len`) of the shortest path (first result of `shortest_simple_paths` minus 1) in *networkx*. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-distance} Estimating distances in the network -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python distance-python} print( "Average path length in Facebook network: ", @@ -501,7 +542,9 @@ paths = list(nx.shortest_simple_paths(g1, "Barack Obama", "John McCain")) print("Distance between Obama and McCain", len(paths[0]) - 1) ``` + ## R code + ```{r distance-r} glue("Average path length in Facebook network: ", mean_distance(g1, directed = T)) @@ -514,21 +557,23 @@ glue("Distance between Obama and McCain", ::: ::: -In terms of distance, we can also wonder what the edges or nodes that share a border with any given vertex are. In the first case, we can identify the *incident edges* that go out or into one vertex. As shown in Example [-@exm-incident], by using the functions `incident` in *igraph* and `edges` in *networkx* we can easily get incident edges of John McCain in the Facebook Network (`g1`), which is just one single edge that joins Trump with McCain. In the second case, we can also identify its adjacent nodes, or in other words its *neighbors*. In the very same example, we use `neighbors` (same function in *igraph* and *networkx*) to obtain all the nodes one step away from McCain (in this case only Trump). - -::: {.callout-note appearance="simple" icon=false} +In terms of distance, we can also wonder what the edges or nodes that share a border with any given vertex are. In the first case, we can identify the *incident edges* that go out or into one vertex. As shown in Example [-@exm-incident], by using the functions `incident` in *igraph* and `edges` in *networkx* we can easily get incident edges of John McCain in the Facebook Network (`g1`), which is just one single edge that joins Trump with McCain. In the second case, we can also identify its adjacent nodes, or in other words its *neighbors*. In the very same example, we use `neighbors` (same function in *igraph* and *networkx*) to obtain all the nodes one step away from McCain (in this case only Trump). +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-incident} Incident edges and neighbors of J. McCain the imaginary Facebook Network -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python incident-python} print("Incident edges of John McCain:", g1.edges("John McCain")) print("Neighbors of John McCain", [n for n in g1.neighbors("John McCain")]) ``` + ## R code + ```{r incident-r} #mode: all, out, in glue("Incident edges of John McCain in", @@ -544,42 +589,49 @@ neighbors(g1, V(g1)["John McCain"], mode="all") There are some other interesting descriptors of social networks. One of the most common measures is the *density* of the graph, which accounts for the proportion of edges relative to all possible ties in the network. In simpler words, the density tells us from 0 to 1 how much connected the nodes of a graph are. This can be estimated for both undirected and directed graphs. Using the functions `edge_density` in *igraph* and `density` in *networkx* we obtain a density of 0.5 (middle level) in the imaginary Facebook network of American politicians (undirected graph) and 0.35 in the Twitter network (directed graph). -In undirected graphs we can also measure *transitivity* (also known as *clustering coefficient*) and *diameter*. The first is a key property of social networks that refers to the ratio of triangles over the total amount of connected triples. It is to say that we wonder how likely it is that two nodes are connected if they share a mutual neighbor. Applying the function `transitivity` (included in *igraph* and *networkx*) to `g1` we can see that this tendency is of 0.5 in the Facebook network (there is a 50\% probability that two politicians -are friends when they have a common contact). The second descriptor, the diameter, depicts the length of the network in terms of the longest geodesic distance[^4]. We use the function `diameter` (included in *igraph* and *networkx*) in the Facebook network and get a diameter of 3, which you can also check if you go back to the visualization of `g1` in Example [-@exm-visgraph]. +In undirected graphs we can also measure *transitivity* (also known as *clustering coefficient*) and *diameter*. The first is a key property of social networks that refers to the ratio of triangles over the total amount of connected triples. It is to say that we wonder how likely it is that two nodes are connected if they share a mutual neighbor. Applying the function `transitivity` (included in *igraph* and *networkx*) to `g1` we can see that this tendency is of 0.5 in the Facebook network (there is a 50% probability that two politicians are friends when they have a common contact). The second descriptor, the diameter, depicts the length of the network in terms of the longest geodesic distance[^chapter13-4]. We use the function `diameter` (included in *igraph* and *networkx*) in the Facebook network and get a diameter of 3, which you can also check if you go back to the visualization of `g1` in Example [-@exm-visgraph]. + +[^chapter13-4]: The *geodesic distance* is the shortest number of edges between two vertices Additionally, in directed graphs we can calculate the *reciprocity*, which is just the proportion of reciprocal ties in a social network and can be computed with the function `reciprocity` (included in *igraph* and *networkx*). For the imaginary Twitter network (directed graph) we get a reciprocity of 0.57 (which is not bad for a Twitter graph where important people usually have much more followers than follows!). In Example [-@exm-density] we show how to estimate these four measures in R and Python. Notice that in some of the network descriptors you have to decide whether or not to include the edge weights for computation (in the provided examples we did not take these weights into account). -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-density} Estimations of density, transitivity, diameter and reciprocity -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python density-python} # Density in Facebook network: nx.density(g1) ``` + ```{python density-python2} # Density in Twitter network: nx.density(g2) ``` + ```{python density-python3} # Transitivity in Facebook network: nx.transitivity(g1) ``` + ```{python density-python4} # Diameter in Facebook network nx.diameter(g1, e=None, usebounds=False) ``` + ```{python density-python5} # Reciprocity in Twitter network: nx.reciprocity(g2) ``` + ## R code + ```{r density-r} # Density in Facebook network: edge_density(g1) @@ -600,34 +652,41 @@ reciprocity(g2) Now let us move to *centrality measures*. Centrality is probably the most common, popular, or known measure in the analysis of social networks because it gives you a clear idea of the importance of any of the nodes within a graph. Using its measures you can pose many questions such as which is the most central person in a network of friends on Facebook, who can be considered an opinion leader on Twitter or who is an influencer on Instagram. Moreover, knowing the specific importance of every node of the network can help us to visualize or label only certain vertices that overpass a previously determined threshold, or to use the color or size to distinguish the most central nodes from the others. There are four typical centrality measures: *degree*, *closeness*, *eigenvector* and *betweenness*. -The *degree* of a node refers to the number of ties of that vertex, or in other words, to the number of edges that are incident to that node. This definition is constant for undirected graphs in which the directions of the links are not declared. In the case of directed graphs, you will have three options to measure the degree. First, you can think of the number of edges pointing *into* a node, which we call *indegree*; second, we have the number of edges pointing *out* of a node, or *outdegree*. In addition, we could also have the total number of edges pointing (in and out) any node. Degree, as well as other measures of centrality mentioned below, can be expressed in absolute numbers, but we can also *normalize*[^5] these measures for better interpretation and comparison. We will prefer this latter approach in our examples, which is also the default option in many SNA packages. +The *degree* of a node refers to the number of ties of that vertex, or in other words, to the number of edges that are incident to that node. This definition is constant for undirected graphs in which the directions of the links are not declared. In the case of directed graphs, you will have three options to measure the degree. First, you can think of the number of edges pointing *into* a node, which we call *indegree*; second, we have the number of edges pointing *out* of a node, or *outdegree*. In addition, we could also have the total number of edges pointing (in and out) any node. Degree, as well as other measures of centrality mentioned below, can be expressed in absolute numbers, but we can also *normalize*[^chapter13-5] these measures for better interpretation and comparison. We will prefer this latter approach in our examples, which is also the default option in many SNA packages. -We can then estimate the degree of two of our example networks. In Example [-@exm-centrality1], we first estimate the degree of each of the five American politicians in the imaginary Facebook network, which is an undirected graph; and then the total degree in the Twitter network, which is a directed graph. For both cases, we use the functions `degree` in *igraph* (R) and `degree_centrality` in *networkx* (Python). We later compute the `in` and `out` degree for the Twitter network. Using *igraph* we again used the function `degree` but now adjust the parameter `mode` to `in` or `out`, respectively. Using *networkx*, we employ the functions `in_degree_centrality` and `out_degree_centrality`. +[^chapter13-5]: The approach is to divide by the maximum possible number of vertices ($N$) minus 1, or by $N-1$. We may also estimate the `weighted degree` of a node, which is the same degree but ponderated by the weight of the edges. -::: {.callout-note appearance="simple" icon=false} +We can then estimate the degree of two of our example networks. In Example [-@exm-centrality1], we first estimate the degree of each of the five American politicians in the imaginary Facebook network, which is an undirected graph; and then the total degree in the Twitter network, which is a directed graph. For both cases, we use the functions `degree` in *igraph* (R) and `degree_centrality` in *networkx* (Python). We later compute the `in` and `out` degree for the Twitter network. Using *igraph* we again used the function `degree` but now adjust the parameter `mode` to `in` or `out`, respectively. Using *networkx*, we employ the functions `in_degree_centrality` and `out_degree_centrality`. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-centrality1} Computing degree centralities in undirected and directed graphs -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python centrality1-python} # Degree centrality of Facebook network (undirected): print(nx.degree_centrality(g1)) ``` + ```{python centrality1-python2} # Degree centrality of Twitter network (directed): print(nx.degree_centrality(g2)) ``` + ```{python centrality1-python3} # In degree centrality of Twitter network (directed): print(nx.in_degree_centrality(g2)) ``` + ```{python centrality1-python4} # Out degree centrality of Twitter network (directed): print(nx.out_degree_centrality(g2)) ``` + ## R code + ```{r centrality1-r} # Degree centrality of Facebook network (undirected): print(degree(g1, normalized = T)) @@ -649,26 +708,30 @@ There are three other types of centrality measures. *Closeness centrality* refer As shown in Example [-@exm-centrality2], we can obtain these three measures from undirected graphs using the functions `closeness`, `eigen_centrality` and `betweenness` in *igraph*, and `closeness_centrality`, `eigenvector_centrality` and `betweenness_centrality` in *networkx*. If we take a look to the centrality measures for every politician of the imaginary Facebook network we see that Clinton seems to be a very important and central node of the graph, just coinciding with the above-mentioned findings based on the degree. It is not a rule that we obtain the very same trend in each of the centrality measures but it is likely that they have similar results although they are looking for different dimensions of the same construct. -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-centrality2} Estimations of closeness, eigenvector and betweenness centralities -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python centrality2-python} # Closeness centrality of Facebook network (undirected): print(nx.closeness_centrality(g1)) ``` + ```{python centrality2-python2} # Eigenvector centrality of Facebook network (undirected): print(nx.eigenvector_centrality(g1)) ``` + ```{python centrality2-python3} # Betweenness centrality of Facebook network (undirected): print(nx.betweenness_centrality(g1)) ``` + ## R code + ```{r centrality2-r} # Closeness centrality of Facebook network (undirected): print(closeness(g1, normalized = T)) @@ -683,12 +746,13 @@ print(betweenness(g1, normalized = T)) We can use these centrality measures in many ways. For example, you can take the degree centrality as a parameter of the node size and labeling when plotting the network. This may be of great utility since the reader can visually identify the most important nodes of the network while minimizing the visual impact of those that are less central. In Example [-@exm-plotsize] we decided to specify the size of the nodes (parameters `vertex.size` in *igraph* and `node_size` in *networkx*) with the degree centrality of each of the American politicians in the Twitter network (directed graph) contained in `g2`. We also used the degree centrality to filter the labels in the graph, and then included only those that overpassed a threshold of 0.5 (parameters `vertex.label` in *igraph* and `labels` in *networkx*). These two simple parameters of the plot give you a fair image of the potential of the centrality measures to describe and understand your social network. -::: {.callout-note appearance="simple" icon=false} +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-plotsize} Using the degree centrality to change the size and labels of the nodes -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python plotsize-python} #| results: hide size = list(nx.degree_centrality(g2).values()) @@ -710,7 +774,9 @@ plt.box(False) plt.show() ``` + ## R code + ```{r plotsize-r} plot(g2, vertex.label.cex = 2, vertex.size= degree(g2, normalized = T)*40, @@ -723,24 +789,26 @@ plot(g2, vertex.label.cex = 2, ### Clustering and Community Detection {#sec-communitydetection} -One of the greatest potentials of SNA is the ability to identify how nodes are interconnected and thus define *communities* within a graph. This is to say that most of the time the nodes and edges in our network are not distributed homogeneously, but they tend to form clusters that can later be interpreted. In a social network you can think for example of the principle of *homophily*, which is the tendency of human beings to associate and interact with similar individuals; or you can think of extrinsic factors (e.g., economic or legal) that may generate the cohesion of small groups of citizens that belong to a wider social structure. While it is of course difficult to make strong claims regarding the underlying causes, we can use different computational approaches to model and detect possible communities that emerge from social networks and even interpret and label those groups. The creation of clusters as an unsupervised machine learning technique was introduced in Section [-@sec-clustering] for structured data and in Section [-@sec-unsupervised] for text analysis (topic modeling). We will use some similar unsupervised approaches for community detection in social networks. +One of the greatest potentials of SNA is the ability to identify how nodes are interconnected and thus define *communities* within a graph. This is to say that most of the time the nodes and edges in our network are not distributed homogeneously, but they tend to form clusters that can later be interpreted. In a social network you can think for example of the principle of *homophily*, which is the tendency of human beings to associate and interact with similar individuals; or you can think of extrinsic factors (e.g., economic or legal) that may generate the cohesion of small groups of citizens that belong to a wider social structure. While it is of course difficult to make strong claims regarding the underlying causes, we can use different computational approaches to model and detect possible communities that emerge from social networks and even interpret and label those groups. The creation of clusters as an unsupervised machine learning technique was introduced in Section [-@sec-clustering] for structured data and in Section [-@sec-unsupervised] for text analysis (topic modeling). We will use some similar unsupervised approaches for community detection in social networks. -Many social and communication questions may arise when clustering a network. The identification of subgroups can tell us how diverse and fragmented a network is, or how the behavior of a specific community relates to other groups and to the entire graph. Moreover, the concentration of edges in some nodes of the graph would let us know about the social structure of the networks which in turn would mean a better understanding of its inner dynamic. It is true that the computational analyst will need more than the provided algorithms when labeling the groups to understand the communities, which means that you must become familiar with the way the graph has been built and what the nodes, edges or weights represent. +Many social and communication questions may arise when clustering a network. The identification of subgroups can tell us how diverse and fragmented a network is, or how the behavior of a specific community relates to other groups and to the entire graph. Moreover, the concentration of edges in some nodes of the graph would let us know about the social structure of the networks which in turn would mean a better understanding of its inner dynamic. It is true that the computational analyst will need more than the provided algorithms when labeling the groups to understand the communities, which means that you must become familiar with the way the graph has been built and what the nodes, edges or weights represent. -A first step towards an analysis of subgroups within a network is to find the available complete subgraphs in an undirected graph. As we briefly explained at the end of [@sec-graph], these independent subgraphs are called `cliques` and refer to subgroups where every vertex is connected to every other vertex. We can find the `maximal cliques` (a clique is maximal when it cannot be extended to a bigger clique) in the imaginary undirected graph of American politicians on Facebook (`g1`) by using the functions `max_cliques` in *igraph* [@eppstein2010listing] and `max_cliques` in *networkx* [@cazals2008note]. As you can see in [@exm-cliques], we obtain a total of three subgraphs, one representing the Democrats, another the Republicans, and one more the connector of the two parties (Clinton--Trump). - -::: {.callout-note appearance="simple" icon=false} +A first step towards an analysis of subgroups within a network is to find the available complete subgraphs in an undirected graph. As we briefly explained at the end of [@sec-graph], these independent subgraphs are called `cliques` and refer to subgroups where every vertex is connected to every other vertex. We can find the `maximal cliques` (a clique is maximal when it cannot be extended to a bigger clique) in the imaginary undirected graph of American politicians on Facebook (`g1`) by using the functions `max_cliques` in *igraph* [@eppstein2010listing] and `max_cliques` in *networkx* [@cazals2008note]. As you can see in [@exm-cliques], we obtain a total of three subgraphs, one representing the Democrats, another the Republicans, and one more the connector of the two parties (Clinton--Trump). +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-cliques} Finding all the maximal cliques in an undirected graph -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python cliques-python} -print(f"Number of cliques: {nx.graph_number_of_cliques(g1)}") +print(f"Number of cliques: {nx.number_of_cliques(g1)}") print(f"Cliques: {list(nx.find_cliques(g1))}") ``` + ## R code + ```{r cliques-r} glue("Number of cliques: {clique_num(g1)}") max_cliques(g1) @@ -749,19 +817,22 @@ max_cliques(g1) ::: ::: -Now, in order to properly detect communities we will apply some common algorithms to obtain the most likely subgroups in a social network. The first of these models is the so called *edge-between* or Girvan--Newman algorithm [@newman2004finding]. This algorithm is based on divisive hierarchical clustering (explained in [@sec-clustering]) by breaking down the graph into pieces and iteratively removing edges from the original one. Specifically, the Girvan--Newman approach uses the betweenness centrality measure to remove the most central edge at each iteration. You can easily visualize this splitting process in a dendogram, as we do in [@exm-girvan], where we estimated `cl_girvan` to detect possible communities in the Facebook network. We used the functions `cluster_edge_betweenness` in *igraph* and `girvan_newman` in *networkx*. +Now, in order to properly detect communities we will apply some common algorithms to obtain the most likely subgroups in a social network. The first of these models is the so called *edge-between* or Girvan--Newman algorithm [@newman2004finding]. This algorithm is based on divisive hierarchical clustering (explained in [@sec-clustering]) by breaking down the graph into pieces and iteratively removing edges from the original one. Specifically, the Girvan--Newman approach uses the betweenness centrality measure to remove the most central edge at each iteration. You can easily visualize this splitting process in a dendogram, as we do in [@exm-girvan], where we estimated `cl_girvan` to detect possible communities in the Facebook network. We used the functions `cluster_edge_betweenness` in *igraph* and `girvan_newman` in *networkx*. -::: {.callout-note appearance="simple" icon=false} +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-girvan} Clustering with Girvan--Newman. -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python girvan-python} cl_girvan = nxcom.girvan_newman(g1) # Note: see R output for a nice visualization ``` + ## R code + ```{r girvan-r} cl_girvan = cluster_edge_betweenness(g1) dendPlot(cl_girvan, mode="hclust") @@ -774,16 +845,15 @@ When you look at the figure you will notice that the final leaves correspond to With community detection algorithms we can then estimate the length (number of suggested clusters), membership (to which cluster belongs each node) and modularity (how good is the clustering). In the case of *igraph* in R we apply the functions `length` (base), `membership` and `modularity` over the produced clustering object (i.e., `cl_girvan`). In the case of *networkx* in Python we first have to specify that we want to use the first component of the divisions (out of four) using the function `next`. Then, we can apply the functions `len` (base) and `modularity` to get the descriptors, and print the first division (stored as `communities1`) to obtain the membership. -These functions are demonstrated in @exm-girvan2. Note that since we will be showing these properties for multiple clustering algorithms below, -we create a convenience function `summarize_clustering` to display them. - -::: {.callout-note appearance="simple" icon=false} +These functions are demonstrated in @exm-girvan2. Note that since we will be showing these properties for multiple clustering algorithms below, we create a convenience function `summarize_clustering` to display them. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-girvan2} Community detection with Girvan--Newman -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python girvan2-python} def summarize_clustering(graph, clustering): print(f"Length {len(clustering)}") @@ -795,7 +865,9 @@ def summarize_clustering(graph, clustering): c1 = next(cl_girvan) summarize_clustering(g1, c1) ``` + ## R code + ```{r girvan2-r} summarize_clustering = function(clustering) { print(glue("Length {length(clustering)}")) @@ -815,15 +887,17 @@ summarize_clustering(cl_girvan) We can estimate the communities for our network using many other more clustering algorithms, such as the *Louvain algorithm*, the *Propagating Label algorithm*, and *Greedy Optimization*, among others. Similar to Girvan--Newman, the Louvain algorithm uses the measure of modularity to obtain a multi-level optimization [@blondel2008fast] and its goal is to obtain optimized clusters which minimize the number of edges between the communities and maximize the number of edges within the same community. For its part, the Greedy Optimization algorithm is also based on the modularity indicator [@clauset2004finding]. It does not consider the edges' weights and works by initially setting each vertex in its own community and then joining two communities to increase modularity until obtaining the maximum modularity. Finally, the Propagating Label algorithm -- which takes into account edges' weights -- initializes each node with a unique label and then iteratively each vertex adopts the label of its neighbors until all nodes have the most common label of their neighbors [@raghavan2007near]. The process can be conducted asynchronously (as done in our example), synchronously or semi-synchronously (it might produce different results). -In [@exm-clustalgo] we use `cluster_louvain`, `cluster_fast_greedy` and `cluster_label_prop` in *igrapgh* (R) and `best_partition`, `greedy_modularity_communities` and `asyn_lpa_communities` in *networkx* (Python). You can see that the results are quite similar[^6] and it is pretty clear that there are two communities in the Facebook network: Democrats and Republicans! +In [@exm-clustalgo] we use `cluster_louvain`, `cluster_fast_greedy` and `cluster_label_prop` in *igrapgh* (R) and `best_partition`, `greedy_modularity_communities` and `asyn_lpa_communities` in *networkx* (Python). You can see that the results are quite similar[^chapter13-6] and it is pretty clear that there are two communities in the Facebook network: Democrats and Republicans! -::: {.callout-note appearance="simple" icon=false} +[^chapter13-6]: This similarity is because our example network is extremely small. In larger networks, the results might not be that similar. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-clustalgo} Community detection with Louvain Propagating Label and Greedy Optimization -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python clustalgo-python} # Louvain: # (Note that the Louvain output is a dict of {member: cluster} rather than a @@ -833,6 +907,7 @@ cl_louvain= [{k for (k,v) in cl_louvain.items() if v == cluster} for cluster in set(cl_louvain.values())] summarize_clustering(g1, cl_louvain) ``` + ```{python clustalgo-python2} # Greedy optimization: # (Note that the nxcom output is a generator, sorting it turns it into a list) @@ -840,6 +915,7 @@ cl_greedy = nxcom.greedy_modularity_communities(g1) cl_greedy = sorted(cl_greedy, key=len, reverse=True) summarize_clustering(g1, cl_greedy) ``` + ```{python clustalgo-python3} # Propagating label: cl_propagation = nxcom.asyn_lpa_communities(g1) @@ -847,7 +923,9 @@ cl_propagation = nxcom.asyn_lpa_communities(g1) cl_propagation = sorted(cl_propagation, key=len, reverse=True) summarize_clustering(g1, cl_propagation) ``` + ## R code + ```{r clustalgo-r} # Louvain: cl_louvain = cluster_louvain(g1) @@ -867,13 +945,13 @@ summarize_clustering(cl_propagation) We can plot each of those clusters for better visualization of the communities. In Example [-@exm-plotcluster] we generate the plots with the *Greedy Optimization* algorithm in R and the *Louvain* algorithm in Python, and we get two identical results. -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-plotcluster} Plotting clusters with Greedy optimization in R and Louvain in Python -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python plotcluster-python} #| results: hide # From which cluster does each node originate? @@ -892,7 +970,9 @@ nx.draw_networkx_edges(g1, pos, alpha=0.3) plt.show(g1) ``` + ## R code + ```{r plotcluster-r} plot(cl_louvain, g1) ``` @@ -904,15 +984,17 @@ There are more ways to obtain subgraphs of your network (such as the `K-core dec So far we have seen how to conduct SNA over "artificial" graphs for the sake of simplicity. However, the representation and analysis of "real world" networks will normally be more challenging because of their size or their complexity. To conclude this chapter we will show you how to apply some of the explained concepts to real data. -Using the Twitter API (see Section [-@sec-apis]), we retrieved the names of the first 100 followers of the five most important politicians in Spain by 2017 (Mariano Rajoy, Pedro Sánchez, Albert Rivera, Alberto Garzón and Pablo Iglesias). With this information we produced an undirected graph[^7] of the "friends" of these Spanish politicians in order to understand how these leaders where connected through their followers. In Example [-@exm-friends1] we load the data into a graph object `g_friends` that contains the 500 edges of the network. As we may imagine the five mentioned politicians were normally the most central nodes, but if we look at the degree, betweenness and closeness centralities we can easily get some of the relevant nodes of the Twitter network: CEARefugio, elenballesteros or Unidadpopular. These accounts deserve special attention since they contribute to the connection of the main leaders of that country. In fact, if we conduct clustering analysis using Louvain algorithm we will find a high modularity (0.77, which indicates that the clusters are well separated) and not surprisingly five clusters. +Using the Twitter API (see Section [-@sec-apis]), we retrieved the names of the first 100 followers of the five most important politicians in Spain by 2017 (Mariano Rajoy, Pedro Sánchez, Albert Rivera, Alberto Garzón and Pablo Iglesias). With this information we produced an undirected graph[^chapter13-7] of the "friends" of these Spanish politicians in order to understand how these leaders where connected through their followers. In Example [-@exm-friends1] we load the data into a graph object `g_friends` that contains the 500 edges of the network. As we may imagine the five mentioned politicians were normally the most central nodes, but if we look at the degree, betweenness and closeness centralities we can easily get some of the relevant nodes of the Twitter network: CEARefugio, elenballesteros or Unidadpopular. These accounts deserve special attention since they contribute to the connection of the main leaders of that country. In fact, if we conduct clustering analysis using Louvain algorithm we will find a high modularity (0.77, which indicates that the clusters are well separated) and not surprisingly five clusters. -::: {.callout-note appearance="simple" icon=false} +[^chapter13-7]: We deliberately omitted the directions of the edges given their impossible reciprocity. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-friends1} Loading and analyzing a real network of Spanish politicians and their followers on Twitter -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python friends1-python} url = "https://cssbook.net/d/friends3.csv" fn, _headers = urllib.request.urlretrieve(url) @@ -920,27 +1002,33 @@ g_friends = nx.read_adjlist(fn, create_using=nx.Graph, delimiter=";") print(f"Nodes: {g_friends.number_of_nodes()}, " f"Edges: {g_friends.number_of_edges()}") ``` + ```{python friends1-python2} # Degree centrality: def sort_by_value(dict): return sorted(dict.items(), key=lambda item: item[1], reverse=True) print(sort_by_value(nx.degree_centrality(g_friends))) ``` + ```{python friends1-python3} # Betweenness centrality: print(sort_by_value(nx.betweenness_centrality(g_friends))) ``` + ```{python friends1-python4} # Closeness centrality: print(sort_by_value(nx.closeness_centrality(g_friends))) ``` + ```{python friends1-python5} # Clustering with Louvain: cluster5 = community_louvain.best_partition(g_friends) print("Length: ", len(set(cluster5.values()))) print("Modularity: " f"{community_louvain.modularity(cluster5,g_friends):.2f}") ``` + ## R code + ```{r friends1-r} edges = read_delim("https://cssbook.net/d/friends3.csv", col_names=FALSE, delim=";") @@ -964,13 +1052,13 @@ print(glue("Modularity: {modularity(cluster5)}")) When we visualize the clusters in the network (Example [-@exm-friends2]) using the degree centrality for the size of the node, we can locate the five politicians in the center of the clusters (depicted with different colors). More interesting, we can see that even when some users follow two of the political leaders, they are just assigned to one of the clusters. This the case of the node joining Garzón and Sánchez who is assigned to the Sánchez's cluster, or the node joining Garzón and Rajoy who is assigned to Rajoy's cluster. In the plot you can also see two more interesting facts. First, we can see a triangle that groups Sánchez, Garzón and Iglesias, which are leaders of the left-wing parties in Spain. Second, some pair of politicians (such as Iglesias--Garzón or Sánchez--Rivera) share more friends than the other possible pairs. -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-friends2} Visualizing the network of Spanish politicians and their followers on Twitter and plotting its clusters -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python friends2-python} #| results: hide pos = nx.spring_layout(g_friends) @@ -998,7 +1086,9 @@ nx.draw_networkx_edges(g_friends, pos, alpha=0.5) plt.show(g_friends) ``` + ## R code + ```{r friends2-r} plot(cluster5, g_friends, vertex.label.cex = 2, vertex.size=degree(g_friends, normalized=T)*40, @@ -1009,18 +1099,3 @@ plot(cluster5, g_friends, vertex.label.cex = 2, ::: ::: ::: - -[^1]: See also the mathematical problem of the *Seven Bridges of Königsberg*, formulated by Leonhard Euler in 1736, which is considered the basis of graph theory. Inspired by a city divided by a river and connected by several bridges, the problem consisted of walking through the whole city crossing each bridge exactly once. - -[^2]: The connections among these politicians on Facebook and Twitter in the examples are of course purely fictional and were created *ad hoc* to illustrate small social networks. - -[^3]: You can use this library in Python with the adapted package *python-igraph*. - -[^4]: The *geodesic distance* is the shortest number of edges between two vertices - -[^5]: The approach is to divide by the maximum possible number of vertices ($N$) minus 1, or by $N-1$. We may also estimate the `weighted degree` of a node, which is the same degree but ponderated by the weight of the edges. - -[^6]: This similarity is because our example network is extremely small. In larger networks, the results might not be that similar. - -[^7]: We deliberately omitted the directions of the edges given their impossible reciprocity. - diff --git a/content/chapter14.qmd b/content/chapter14.qmd index c5715f8..36db245 100644 --- a/content/chapter14.qmd +++ b/content/chapter14.qmd @@ -2,32 +2,33 @@ {{< include common_setup.qmd >}} -**Abstract.** -Digitally collected data often does not only contain texts, but also audio, images, and videos. Instead of using only textual features as we did in previous chapters, we can also use pixel values -to analyze images. First, we will see how to use existing libraries, commercial services or APIs to conduct multimedia analysis (i.e., optical character recognition, speech-to-text or object recognition). Then we will show how to store, represent, and convert image data in order to use it as an input in our computational analysis. We will focus on image analysis using machine learning classification techniques based on deep learning, and will explain how to build (or fine-tune) a Convolutional Neural Network (CNN) by ourselves. +**Abstract.** Digitally collected data often does not only contain texts, but also audio, images, and videos. Instead of using only textual features as we did in previous chapters, we can also use pixel values to analyze images. First, we will see how to use existing libraries, commercial services or APIs to conduct multimedia analysis (i.e., optical character recognition, speech-to-text or object recognition). Then we will show how to store, represent, and convert image data in order to use it as an input in our computational analysis. We will focus on image analysis using machine learning classification techniques based on deep learning, and will explain how to build (or fine-tune) a Convolutional Neural Network (CNN) by ourselves. **Keywords.** image, audio, video, multimedia, image classification, deep learning **Objectives:** -- Learn how to transform multimedia data into useful inputs for computational analysis - - Understand how to conduct deep learning to automatic classification of images +- Learn how to transform multimedia data into useful inputs for computational analysis +- Understand how to conduct deep learning to automatic classification of images -::: {.callout-note icon=false collapse=true} +::: {.callout-note icon="false" collapse="true"} ## Packages used in this chapter -This chapter uses *tesseract* (generic) and Google 'Cloud Speech' API (*googleLanguageR* and *google-cloud-language* in Python) to convert images or audio files into text. We will use *PIL* (Python) and *imagemagic* (generic) to convert pictures as inputs; and *Tensorflow* and *keras* (both in Python and R) to build and fine-tune CNNs. +This chapter uses *tesseract* (generic) and Google 'Cloud Speech' API (*googleLanguageR* and *google-cloud-language* in Python) to convert images or audio files into text. We will use *PIL* (Python) and *imagemagic* (generic) to convert pictures as inputs; and *Tensorflow* and *keras* (both in Python and R) to build and fine-tune CNNs. -You can install these and other auxiliary packages with the code below if needed (see Section [-@sec-installing] for more details): +You can install these and other auxiliary packages with the code below if needed (see Section [-@sec-installing] for more details): -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python chapter14install-python} #| eval: false !pip3 install Pillow requests numpy sklearn !pip3 install tensorflow keras ``` + ## R code + ```{r chapter14install-r} #| eval: false install.packages(c("magick", "glue","lsa", @@ -35,10 +36,12 @@ install.packages(c("magick", "glue","lsa", "tensorflow","keras")) ``` ::: - After installing, you need to import (activate) the packages every session: -::: {.panel-tabset} +After installing, you need to import (activate) the packages every session: + +::: panel-tabset ## Python code + ```{python chapter14library-python} import matplotlib.pyplot as plt from PIL import Image @@ -56,7 +59,9 @@ import keras from tensorflow.keras.applications.resnet50 import ResNet50 ``` + ## R code + ```{r chapter14library-r} library(magick) library(lsa) @@ -72,8 +77,7 @@ library(keras) ## Beyond Text Analysis: Images, Audio and Video {#sec-beyond} -A book about the *computational analysis of communication* would be incomplete without a chapter dedicated to analyzing visual data. -In fact, if you think of the possible contents derived from social, cultural and political dynamics in the current digital landscape, you will realize that written content is only a limited slice of the bigger cake. Humans produce much more oral content than text messages, and are more agile in deciphering sounds and visual content. Digitalization of social and political life, as well as the explosion of self-generated digital content in the web and social media, have provoked an unprecedented amount of multimedia content that deserve to be included in many types of research. +A book about the *computational analysis of communication* would be incomplete without a chapter dedicated to analyzing visual data. In fact, if you think of the possible contents derived from social, cultural and political dynamics in the current digital landscape, you will realize that written content is only a limited slice of the bigger cake. Humans produce much more oral content than text messages, and are more agile in deciphering sounds and visual content. Digitalization of social and political life, as well as the explosion of self-generated digital content in the web and social media, have provoked an unprecedented amount of multimedia content that deserve to be included in many types of research. Just imagine a collection of digital recorded radio stations, or the enormous amount of pictures produced every day on Instagram, or even the millions of videos of social interest uploaded on Youtube. These are definitely goldmines for social researchers who traditionally used manual techniques to analyze just a very small portion of this multimedia content. However, it is also true that computational techniques to analyze audio, images or video are still little developed in social sciences given the difficulty of application for non-computational practitioners and the novelty of the discoveries in fields such as computer vision. @@ -83,9 +87,11 @@ You are probably already familiar with digital formats of images (.jpg, .bmp, .g In the case of audio, there are many useful computational approaches to do research over these contents: from voice recognition, audio sentiment analysis or sound classification, to automatic generation of music. Recent advances in the field of artificial intelligence have created a prosperous and diversified field with multiple academic and commercial applications. Nevertheless, computational social scientists can obtain great insights just by using specific applications such as speech-to-text transformation and then apply text analytics (already explained in chapters 9, 10, and 11) to the results. As you will see in Section [-@sec-apivisions], there are some useful libraries in R and Python to use pre-trained models to transcribe voice in different languages. -Even when this approach is quite limited (just a small portion of the audio analytics world) and constrained (we will not address how to create the models), it will show how a specific, simple and powerful application of the automatic analysis of audio inputs can help answering many social questions (e.g., what are the topics of a natural conversation, what are the sentiments expressed in the scripts of radio news pieces, or which actors are named in oral speeches of any political party). In fact, automated analysis of audio can enable new research questions, different from those typically applied to text analysis. This is the case of the research by @knox2021dynamic, who used a computational approach over audio data from the Supreme Court Oral Arguments (407 arguments and 153 hours of audio, comprising over 66000 justice utterances and 44 million moments) to demonstrate that some crucial information such as the skepticism of legal arguments was transmitted by vocal delivery (e.g., speech tone), something indecipherable to text analysis. Or we could also mention the work by @dietrich2019pitch who computationally analyzed the vocal pitch of more than 70000 Congressional floor audio speeches and found that female members of the Congress spoke with greater *emotional intensity* when talking about women. +Even when this approach is quite limited (just a small portion of the audio analytics world) and constrained (we will not address how to create the models), it will show how a specific, simple and powerful application of the automatic analysis of audio inputs can help answering many social questions (e.g., what are the topics of a natural conversation, what are the sentiments expressed in the scripts of radio news pieces, or which actors are named in oral speeches of any political party). In fact, automated analysis of audio can enable new research questions, different from those typically applied to text analysis. This is the case of the research by @knox2021dynamic, who used a computational approach over audio data from the Supreme Court Oral Arguments (407 arguments and 153 hours of audio, comprising over 66000 justice utterances and 44 million moments) to demonstrate that some crucial information such as the skepticism of legal arguments was transmitted by vocal delivery (e.g., speech tone), something indecipherable to text analysis. Or we could also mention the work by @dietrich2019pitch who computationally analyzed the vocal pitch of more than 70000 Congressional floor audio speeches and found that female members of the Congress spoke with greater *emotional intensity* when talking about women. + +On the other hand, applying computational methods to video input is probably the most challenging task in spite of the recent and promising advances in computer vision. For the sake of space, we will not cover specific video analytics in this chapter, but it is important to let you know that most of the computational analysis of video is based on the inspection of image and audio contents. With this standard approach you need to specify which key frames you are going to extract from the video (for example take a still image every 1000 frames) and then apply computer vision techniques (such as object detection) to those independent images. Check for example version 3 of the object detection architecture *You Only Look Once Take* (YOLOv3)[^chapter14-1] created by @yolov3, which uses a pre-trained Convolutional Neural Network (CNN) (see Section [-@sec-cnn]) to locate objects within the video (Figure [-@fig-yolo]). To answer many social science questions you might complement this frame-to-frame image analysis with an analysis of audio features. In any case, this approach will not cover some interesting aspects of the video such as the camera frame shots and movements, or the editing techniques, which certainly give more content information. -On the other hand, applying computational methods to video input is probably the most challenging task in spite of the recent and promising advances in computer vision. For the sake of space, we will not cover specific video analytics in this chapter, but it is important to let you know that most of the computational analysis of video is based on the inspection of image and audio contents. With this standard approach you need to specify which key frames you are going to extract from the video (for example take a still image every 1000 frames) and then apply computer vision techniques (such as object detection) to those independent images. Check for example version 3 of the object detection architecture *You Only Look Once Take* (YOLOv3)[^1] created by @yolov3, which uses a pre-trained Convolutional Neural Network (CNN) (see Section [-@sec-cnn]) to locate objects within the video (Figure [-@fig-yolo]). To answer many social science questions you might complement this frame-to-frame image analysis with an analysis of audio features. In any case, this approach will not cover some interesting aspects of the video such as the camera frame shots and movements, or the editing techniques, which certainly give more content information. +[^chapter14-1]: https://pjreddie.com/darknet/yolo/ ![A screen shot of a real-time video analyzed by YOLOv3 on its website https://pjreddie.com/darknet/yolo/](img/ch15_yolo.png){#fig-yolo} @@ -93,21 +99,23 @@ On the other hand, applying computational methods to video input is probably the In the following sections we will show you how to deal with multimedia contents from scratch, with special attention to image classification using state-of-the-art libraries. However, it might be a good idea to begin by using existing libraries that directly implement multimedia analyses or by connecting to commercial services to deploy classification tasks remotely using their APIs. There is a vast variety of available libraries and APIs, which we cannot cover in this book, but we will briefly mention some of them that may be useful in the computational analysis of communication. -One example in the field of visual analytics is the *optical character recognition* (OCR). It is true that you can train your own models to deploy multi-class classification and predict every letter, number or symbol in an image, but it will be a task that will take you a lot of effort. Instead, there are specialized libraries in both R and Python such as *tesseract* that deploy this task in seconds with high accuracy. It is still possible that you will have to apply some pre-processing to the input images in order to get them in good shape. This means that you may need to use packages such as *PIL* or *Magick* to improve the quality of the image by cropping it or by reducing the background noise. In the case of PDF files you will have to convert them first into images and then apply OCR. +One example in the field of visual analytics is the *optical character recognition* (OCR). It is true that you can train your own models to deploy multi-class classification and predict every letter, number or symbol in an image, but it will be a task that will take you a lot of effort. Instead, there are specialized libraries in both R and Python such as *tesseract* that deploy this task in seconds with high accuracy. It is still possible that you will have to apply some pre-processing to the input images in order to get them in good shape. This means that you may need to use packages such as *PIL* or *Magick* to improve the quality of the image by cropping it or by reducing the background noise. In the case of PDF files you will have to convert them first into images and then apply OCR. In the case of more complex audio and image documents you can use more sophisticated services provided by private companies (e.g., Google, Amazon, Microsoft, etc.). These commercial services have already deployed their own machine learning models with very good results. Sometimes you can even customize some of their models, but as a rule their internal features and configuration are not transparent to the user. Moreover, these services offer friendly APIs and, usually, a free quota to deploy your first exercises. -To work with audio files, many social researchers might need to convert long conversations, radio programs, or interviews to plain text. For this propose, *Google Cloud* offer the service *Speech-to-Text*[^2] that remotely transcribes the audio to a text format supporting multiple languages (more than 125!). With this service you can remotely use the advanced deep learning models created by Google Platform from your own local computer (you must have an account and connect with the proper packages such as *googleLanguageR* or *google-cloud-language* in Python). +To work with audio files, many social researchers might need to convert long conversations, radio programs, or interviews to plain text. For this propose, *Google Cloud* offer the service *Speech-to-Text*[^chapter14-2] that remotely transcribes the audio to a text format supporting multiple languages (more than 125!). With this service you can remotely use the advanced deep learning models created by Google Platform from your own local computer (you must have an account and connect with the proper packages such as *googleLanguageR* or *google-cloud-language* in Python). -If you apply either OCR to images or Speech-to-Text recognition to audio content you will have juicy plain text to conduct NLP, sentiment analysis, topic modelling, among other techniques (see Chapter 11). Thus, it is very likely that you will have to combine different libraries and services to perform a complete computational pipeline, even jumping from R to Python, and vice versa! +[^chapter14-2]: https://cloud.google.com/speech-to-text -Finally, we would like to mention the existence of the commercial services of *autotaggers*, such as Google's Cloud Vision, Microsoft's Computer Vision or Amazon's Recognition. For example, if you connect to the services of Amazon's Recognition you can not only detect and classify images, but also conduct sentiment analysis over faces or predict sensitive contents within the images. As in the case of Google Cloud, you will have to obtain commercially sold credentials to be able to connect to Amazon's Recognition API (although you get a free initial "quota" of API access calls before you are required to pay for usage). This approach has two main advantages. The first is the access to a very well trained and validated model (continuously re-trained) over millions of images and with the participation of thousands of coders. The second is the scalability because you can store and analyze images at scale at a very good speed using cloud computing services. +If you apply either OCR to images or Speech-to-Text recognition to audio content you will have juicy plain text to conduct NLP, sentiment analysis, topic modelling, among other techniques (see Chapter 11). Thus, it is very likely that you will have to combine different libraries and services to perform a complete computational pipeline, even jumping from R to Python, and vice versa! + +Finally, we would like to mention the existence of the commercial services of *autotaggers*, such as Google's Cloud Vision, Microsoft's Computer Vision or Amazon's Recognition. For example, if you connect to the services of Amazon's Recognition you can not only detect and classify images, but also conduct sentiment analysis over faces or predict sensitive contents within the images. As in the case of Google Cloud, you will have to obtain commercially sold credentials to be able to connect to Amazon's Recognition API (although you get a free initial "quota" of API access calls before you are required to pay for usage). This approach has two main advantages. The first is the access to a very well trained and validated model (continuously re-trained) over millions of images and with the participation of thousands of coders. The second is the scalability because you can store and analyze images at scale at a very good speed using cloud computing services. ![A photograph of refugees on a lifeboat, used as an input for Amazon's Recognition API. The commercial service detects in the pictures classes such as clothing, apparel, human, person, life jacket or vest.](img/ch15_refugees.png){#fig-refugees} -As an example, you can use Amazon's Recognition to detect objects in a news photograph of refugees in a lifeboat (Figure [-@fig-refugees]) and you will obtain a set of accurate labels: *Clothing* (99.95\%), *Apparel* (99.95\%), *Human* (99.70\%), *Person* (99.70\%), *Life jacket* (99.43\%) and *Vest* (99.43\%). With a lower confidence you will also find labels such as *Coat* (67.39\%) and *People* (66.78\%). This example also highlights the need for validation, and the difficulty of grasping complex concepts in automated analyses: while all of these labels are arguably correct, it is safe to say that they fail to actually grasp the essence of the picture and the social context. One may even go as far as saying that -- knowing the picture is about refugees -- some of these labels, were they given by a human to describe the picture, would sound pretty cynical. +As an example, you can use Amazon's Recognition to detect objects in a news photograph of refugees in a lifeboat (Figure [-@fig-refugees]) and you will obtain a set of accurate labels: *Clothing* (99.95%), *Apparel* (99.95%), *Human* (99.70%), *Person* (99.70%), *Life jacket* (99.43%) and *Vest* (99.43%). With a lower confidence you will also find labels such as *Coat* (67.39%) and *People* (66.78%). This example also highlights the need for validation, and the difficulty of grasping complex concepts in automated analyses: while all of these labels are arguably correct, it is safe to say that they fail to actually grasp the essence of the picture and the social context. One may even go as far as saying that -- knowing the picture is about refugees -- some of these labels, were they given by a human to describe the picture, would sound pretty cynical. -In Section [-@sec-cnn] we will use this very same image (stored as `myimg2_RGB`) to detect objects using a classification model trained with an open-access database of images (ImageNet). You will find that there are some different predictions in both methods, but especially that the time to conduct the classification is shorter in the commercial service, since we don't have to train or choose a model. As you may imagine, you can neither modify the commercial models nor have access to their internal details, which is a strong limitation if you want to build your own customized classification system. +In Section [-@sec-cnn] we will use this very same image (stored as `myimg2_RGB`) to detect objects using a classification model trained with an open-access database of images (ImageNet). You will find that there are some different predictions in both methods, but especially that the time to conduct the classification is shorter in the commercial service, since we don't have to train or choose a model. As you may imagine, you can neither modify the commercial models nor have access to their internal details, which is a strong limitation if you want to build your own customized classification system. ## Storing, Representing, and Converting Images {#sec-storing} @@ -117,13 +125,13 @@ To perform basic image manipulation we have to: (i) load images and transform th You can load any image as an object into your workspace as we show in Example [-@exm-loadimg]. In this case we load two pictures of refugees published by mainstream media in Europe (see @amores2019visual), one is a JPG and the other is a PNG file. For this basic loading step we used the `open` function of the `Image` module in *pil* and `image_read` function in *imagemagik*. The JPG image file is a $805\times 453$ picture with the color model *RGB* and the PNG is a $1540\times 978$ picture with the color model *RGBA*. As you may notice the two objects have different formats, sizes and color models, which means that there is little analysis you can do if you don't create a standard mathematical representation of both. -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-loadimg} Loading JPG and PNG pictures as objects -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python loadimg-python} myimg1 = Image.open( requests.get("https://cssbook.net/d/259_3_32_15.jpg", stream=True).raw @@ -135,7 +143,9 @@ print(myimg1) print(myimg2) ``` + ## R code + ```{r loadimg-r} myimg1 = image_read( "https://cssbook.net/d/259_3_32_15.jpg") @@ -155,13 +165,13 @@ In a black-and-white picture we will only have one color (gray-scale), with the In Example [-@exm-imagel] we convert our original JPG picture to gray-scale and then create an object with the mathematical representation (a $453 \times 805$ matrix). -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-imagel} Converting images to gray-scale and creating a two-dimensional array -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python imagel-python} #| cache: true myimg1_L = myimg1.convert("L") @@ -171,7 +181,9 @@ print(type(myimg1_L_array)) print(myimg1_L_array.shape) ``` + ## R code + ```{r imagel-r} #| cache: true myimg1_L = image_convert(myimg1, colorspace = "gray") @@ -185,19 +197,23 @@ print(dim(myimg1_L_array)) ::: ::: -By contrast, color images will have multiple color channels that depend on the color model you chose. One standard color model is the three-channel RGB (*red*, *green* and *blue*), but you can find other variations in the chosen colors and the number of channels such as: RYB (*red*, *yellow* and *blue*), RGBA (*red*, *green*, *blue* and *alpha*[^3] ) or CMYK (*cyan*, *magenta*, *yellow* and *key*[^4]). Importantly, while schemes used for printing such as CMYK are *substractive* (setting all colors to their highest value results in black, setting them to their lowest value results in white), schemes used for computer and television screens (such as RGB) are *additive*: setting all of the colors to their maximal value results in white (pretty much the opposite as what you got with your paintbox in primary school). +By contrast, color images will have multiple color channels that depend on the color model you chose. One standard color model is the three-channel RGB (*red*, *green* and *blue*), but you can find other variations in the chosen colors and the number of channels such as: RYB (*red*, *yellow* and *blue*), RGBA (*red*, *green*, *blue* and *alpha*[^chapter14-3] ) or CMYK (*cyan*, *magenta*, *yellow* and *key*[^chapter14-4]). Importantly, while schemes used for printing such as CMYK are *substractive* (setting all colors to their highest value results in black, setting them to their lowest value results in white), schemes used for computer and television screens (such as RGB) are *additive*: setting all of the colors to their maximal value results in white (pretty much the opposite as what you got with your paintbox in primary school). + +[^chapter14-3]: Alpha refers to the opacity of each pixel. + +[^chapter14-4]: Key refers to *black*. We will mostly use RGB in this book since it is the most used representation in the state-of-the-art literature in computer vision given that normally these color channels yield more accurate models. RGB's mathematical representation will be a three-dimensional matrix or a collection of three two-dimensional arrays (one for each color) as we showed in figure [-@fig-pixel]. Then an RGB $224 \times 224$ picture will have 50176 pixel intensities for each of the three colors, or in other words a total of 150528 integers! Now, in Example [-@exm-imagergb] we convert our original JPG file to a RGB object and then create a new object with the mathematical representation (a $453 \times 805 \times 3$ matrix). -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-imagergb} Converting images to RGB color model and creating three two-dimensional arrays -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python imagergb-python} myimg1_RGB = myimg1.convert("RGB") @@ -207,7 +223,9 @@ print(type(myimg1_RGB_array)) print(myimg1_RGB_array.shape) ``` + ## R code + ```{r imagergb-r} myimg1_RGB = image_convert(myimg1, colorspace = "RGB") print(class(myimg1_RGB)) @@ -220,18 +238,17 @@ print(dim(myimg1_RGB_array)) ::: ::: -Instead of pixels, there are other ways to store digital images. One of them is the *vector graphics*, with formats such as .ai, .eps, .svg or .drw. Differently to bitmap images, they don't have a grid of dots but a set of *paths* (lines, triangles, square, curvy shapes, etc.) that have a start and end point, so simple and complex images are created with paths. The great advantage of this format is that images do not get "pixelated" when you enlarge them because the paths can easily be transformed while remaining smooth. However, to obtain the standard mathematical representation of images you can convert the vector graphics to raster graphics (the way back is a bit more difficult and often only possible by approximation). - -Sometimes you need to convert your image to a specific size. For example, in the case of image classification this is a very important step since all the input images of the model must have the same size. For this reason, one of the most common tasks in the preprocessing stage is to change the dimensions of the image in order to adjust width and height to a specific size. In Example [-@exm-resize] we use the `resize` method provided by *pil* and the `image_scale` function in *imagemagik* to reduce the first of our original pictures in RGB (`myimg1_RGB`) to 25\% . Notice that we first obtain the original dimensions of the photograph -(i.e. `myimg1_RGB.width` or `image_info(myimg1_RGB)['width'][[1]]`) and then multiply it by 0.25 in order to obtain the new size which is the argument required by the functions. +Instead of pixels, there are other ways to store digital images. One of them is the *vector graphics*, with formats such as .ai, .eps, .svg or .drw. Differently to bitmap images, they don't have a grid of dots but a set of *paths* (lines, triangles, square, curvy shapes, etc.) that have a start and end point, so simple and complex images are created with paths. The great advantage of this format is that images do not get "pixelated" when you enlarge them because the paths can easily be transformed while remaining smooth. However, to obtain the standard mathematical representation of images you can convert the vector graphics to raster graphics (the way back is a bit more difficult and often only possible by approximation). -::: {.callout-note appearance="simple" icon=false} +Sometimes you need to convert your image to a specific size. For example, in the case of image classification this is a very important step since all the input images of the model must have the same size. For this reason, one of the most common tasks in the preprocessing stage is to change the dimensions of the image in order to adjust width and height to a specific size. In Example [-@exm-resize] we use the `resize` method provided by *pil* and the `image_scale` function in *imagemagik* to reduce the first of our original pictures in RGB (`myimg1_RGB`) to 25% . Notice that we first obtain the original dimensions of the photograph (i.e. `myimg1_RGB.width` or `image_info(myimg1_RGB)['width'][[1]]`) and then multiply it by 0.25 in order to obtain the new size which is the argument required by the functions. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-resize} -Resize to 25\% and visualize a picture +Resize to 25% and visualize a picture -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python resize-python} #| results: hide #| cache: true @@ -241,7 +258,9 @@ height = int(myimg1_RGB.height * 0.25) myimg1_RGB_25 = myimg1_RGB.resize((width, height)) plt.imshow(myimg1_RGB_25) ``` + ## R code + ```{r resize-r} #| cache: true #Resize and visalize myimg1. Reduce to 25% @@ -255,20 +274,22 @@ plot(myimg1_RGB_25) Now, using the same functions of the latter example, we specify in Example [-@exm-resize2] how to resize the same picture to $224 \times 244$, which is one of the standard dimensions in computer vision. -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-resize2} Resize to $224 \times 224$ and visualize a picture -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python resize2-python} #| results: hide # Resize to 224 x 224 myimg1_RGB_224 = myimg1_RGB.resize((224, 224)) plt.imshow(myimg1_RGB_224) ``` + ## R code + ```{r resize2-r} # Resize and visalize myimg1. Resize to 224 x 224 # The ! is used to specify an exact width and height @@ -280,17 +301,17 @@ plot(myimg1_RGB_224) ::: ::: -You may have noticed that the new image has now the correct width and height but that it looks deformed. The reason is that the original picture was not squared and our order was to force it to fit into a $224 \times 224$ square, losing its original aspect. There are different alternatives to solving this issue, but probably the most extended is to *crop* the original image to create a squared picture. As you can see in Example [-@exm-crop] we can create a function that first determines the orientation of the picture (vertical versus horizontal) and then cut the margins (up and down if it is vertical; and left and right if it is horizontal) to create a square. After applying this ad hoc function `crop` to the original image we can resize again to obtain a non-distorted $224 \times 224$ image. +You may have noticed that the new image has now the correct width and height but that it looks deformed. The reason is that the original picture was not squared and our order was to force it to fit into a $224 \times 224$ square, losing its original aspect. There are different alternatives to solving this issue, but probably the most extended is to *crop* the original image to create a squared picture. As you can see in Example [-@exm-crop] we can create a function that first determines the orientation of the picture (vertical versus horizontal) and then cut the margins (up and down if it is vertical; and left and right if it is horizontal) to create a square. After applying this ad hoc function `crop` to the original image we can resize again to obtain a non-distorted $224 \times 224$ image. Of course you are now losing part of the picture information, so you may think of other alternatives such as filling a couple of sides with blank pixels (or `padding`) in order to create the square by adding information instead of removing it. -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-crop} Function to crop the image to create a square and the resize the picture -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python crop-python} #| results: hide # Crop and resize to 224 x 224 @@ -319,7 +340,9 @@ myimg1_RGB_crop_224 = myimg1_RGB_crop.resize((224, 224)) plt.imshow(myimg1_RGB_crop_224) ``` + ## R code + ```{r crop-r} #Crop and resize to 224 x 224 #Create function @@ -345,16 +368,15 @@ plot(myimg1_RGB_crop_224) ::: ::: -You can also adjust the orientation of the image, flip it, or change its background, among other commands. These techniques might be useful for creating extra images in order to enlarge the training set in image classification (see Section [-@sec-cnn]). This is called *data augmentation* and consists of duplicating the initial examples on which the model was trained and altering them so that the algorithm can be more robust and generalize better. In Example [-@exm-rotate] we used the `rotate` method in *pil* and `image_rotate` function in *imagemagik* -to rotate 45 degrees the above resized image `myimg1_RGB_224` to see how easily we can get an alternative picture with similar information to include in an augmented training set. - -::: {.callout-note appearance="simple" icon=false} +You can also adjust the orientation of the image, flip it, or change its background, among other commands. These techniques might be useful for creating extra images in order to enlarge the training set in image classification (see Section [-@sec-cnn]). This is called *data augmentation* and consists of duplicating the initial examples on which the model was trained and altering them so that the algorithm can be more robust and generalize better. In Example [-@exm-rotate] we used the `rotate` method in *pil* and `image_rotate` function in *imagemagik* to rotate 45 degrees the above resized image `myimg1_RGB_224` to see how easily we can get an alternative picture with similar information to include in an augmented training set. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-rotate} Rotating a picture 45 degrees -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python rotate-python} #| results: hide #| cache: true @@ -363,7 +385,9 @@ myimg1_RGB_224_rot = myimg1_RGB_224.rotate(-45) plt.imshow(myimg1_RGB_224_rot) ``` + ## R code + ```{r rotate-r} #Rotate 45 degrees #| cache: true @@ -375,15 +399,15 @@ plot(myimg1_RGB_224_rot) ::: ::: -Finally, the numerical representation of visual content can help us to *compare* pictures in order to find similar or even duplicate images. Let's take the case of RGB images which in Example [-@exm-imagergb] we showed how to transform to a three two-dimensional array. If we now convert the three-dimensional matrix of the image into a flattened vector we can use this simpler numerical representation to estimate similarities. Specifically, as we do in Example [-@exm-flatten], we can take the vectors of two *flattened images* of resized $15 \times 15$ images to ease computation (`img_vect1` and `img_vect2`) and use the *cosine similarity* to estimate how akin those images are. We stacked the two vectors in a matrix and then used the `cosine_similarity` function of the `metrics` module of the *sklearn* package in Python and the `cosine` function of the *lsa* package in R. - -::: {.callout-note appearance="simple" icon=false} +Finally, the numerical representation of visual content can help us to *compare* pictures in order to find similar or even duplicate images. Let's take the case of RGB images which in Example [-@exm-imagergb] we showed how to transform to a three two-dimensional array. If we now convert the three-dimensional matrix of the image into a flattened vector we can use this simpler numerical representation to estimate similarities. Specifically, as we do in Example [-@exm-flatten], we can take the vectors of two *flattened images* of resized $15 \times 15$ images to ease computation (`img_vect1` and `img_vect2`) and use the *cosine similarity* to estimate how akin those images are. We stacked the two vectors in a matrix and then used the `cosine_similarity` function of the `metrics` module of the *sklearn* package in Python and the `cosine` function of the *lsa* package in R. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-flatten} Comparing two flattened vectors to detect similarities between images -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python flatten-python} # Create two 15x15 small images to compare @@ -405,7 +429,9 @@ sim_mat = cosine_similarity(matrix) sim_mat ``` + ## R code + ```{r flatten-r} #Create two 15x15 small images to compare @@ -436,51 +462,54 @@ As you can see in the resulting matrix when the images are compared with themsel ## Image Classification {#sec-cnn} -The implementation of computational image classification can help to answer many scientific questions, from testing some traditional hypotheses to opening new fields of interest in social science research. Just think about the potential of detecting at scale *who* appears in news photographs or what are the facial *emotions* expressed in the profiles of a social network. Moreover, imagine you can automatically label whether an image contains a certain action or not. For example, this is the case of @williams2020images who conducted a binary classification of pictures related to the *Black Lives Matter* movement in order to model if a picture was a protest or not, which can help to understand the extent to which the media covered a relevant social and political issue. +The implementation of computational image classification can help to answer many scientific questions, from testing some traditional hypotheses to opening new fields of interest in social science research. Just think about the potential of detecting at scale *who* appears in news photographs or what are the facial *emotions* expressed in the profiles of a social network. Moreover, imagine you can automatically label whether an image contains a certain action or not. For example, this is the case of @williams2020images who conducted a binary classification of pictures related to the *Black Lives Matter* movement in order to model if a picture was a protest or not, which can help to understand the extent to which the media covered a relevant social and political issue. There are many other excellent examples of how you can adopt image classification tasks to answer specific research questions in social sciences such as those of @horiuchi2012should who detected smiles in images of politicians to estimate the effects of facial appearance on election outcomes; or the work by @peng2018same who used automated recognition of facial traits in American politicians to investigating the bias of media portrayals. In this section, we will learn how to conduct computational image classification which is probably the most extended computer vision application in communication and social sciences (see Table [-@tbl-visionlingo] for some terminology). We will first discuss how to apply a *shallow* algorithm and then a deep-learning approach, given a labelled data set. -|Computer vision lingo | Definition| -|-|-| -|bitmap | Format to store digital images using a rectangular grid of points of colors. Also called "raster image".| -|pixel | Stands for "picture element" and is the smallest point of a bitmap image| -|color model | Mathematical representation of colors in a picture. The standard in computer vision is RGB, but there are others such as RYB, RGBA or CMYK.| -|vector graphic | Format to store digital images using lines and curves formed by points.| -|data augmentation | Technique to increase the training set of images by creating new ones base on the modification of some of the originals (cropping, rotating, etc.)| -|image classification | Machine learning task to predict a class of an image based on a model. State-of-the-art image classification is conducted with Convolutional Neural Networks (CNN). Related tasks are object detection and image segmentation.| -|activation function | Parameter of a CNN that defines the output of a layer given the inputs of the previous layer. Some usual activation functions in image classification are sigmoid, softmax, or RELU.| -|loss function | Parameter of a CNN which accounts for the difference between the prediction and the target variable (confidence in the prediction). A common one in image classification is the cross entropy loss.| -|optimization | Parameter of a CNN that updates weights and biases in order to reduce the error. Some common optimizers in image classification are Stochastic Gradient Descent and ADAM.| -|transfer learning | Using trained layers of other CNN architectures to fine tune a new model investing less resources (e.g. training data).| -: Some computer vision concepts used in computational analysis of communication {#tbl-visionlingo} - -Technically, in an image classification task we train a model with examples (e.g., a corpus of pictures with labels) in order to predict the category of any given new sample. It is the same logic used in supervised text classification explained in Section [-@sec-supervised] but using images instead of texts. For example, if we show many pictures of cats and houses the algorithm would learn the constant features in each and will tell you with some degree of confidence if a new picture contains either a cat or a house. It is the same with letters, numbers, objects or faces, and you can apply either binary or multi-class classification. Just think when your vehicle registration plate is recognized by a camera or when your face is automatically labelled in pictures posted on Facebook. +| Computer vision lingo | Definition | +|------------------------------------|------------------------------------| +| bitmap | Format to store digital images using a rectangular grid of points of colors. Also called "raster image". | +| pixel | Stands for "picture element" and is the smallest point of a bitmap image | +| color model | Mathematical representation of colors in a picture. The standard in computer vision is RGB, but there are others such as RYB, RGBA or CMYK. | +| vector graphic | Format to store digital images using lines and curves formed by points. | +| data augmentation | Technique to increase the training set of images by creating new ones base on the modification of some of the originals (cropping, rotating, etc.) | +| image classification | Machine learning task to predict a class of an image based on a model. State-of-the-art image classification is conducted with Convolutional Neural Networks (CNN). Related tasks are object detection and image segmentation. | +| activation function | Parameter of a CNN that defines the output of a layer given the inputs of the previous layer. Some usual activation functions in image classification are sigmoid, softmax, or RELU. | +| loss function | Parameter of a CNN which accounts for the difference between the prediction and the target variable (confidence in the prediction). A common one in image classification is the cross entropy loss. | +| optimization | Parameter of a CNN that updates weights and biases in order to reduce the error. Some common optimizers in image classification are Stochastic Gradient Descent and ADAM. | +| transfer learning | Using trained layers of other CNN architectures to fine tune a new model investing less resources (e.g. training data). | + +: Some computer vision concepts used in computational analysis of communication {#tbl-visionlingo} + +Technically, in an image classification task we train a model with examples (e.g., a corpus of pictures with labels) in order to predict the category of any given new sample. It is the same logic used in supervised text classification explained in Section [-@sec-supervised] but using images instead of texts. For example, if we show many pictures of cats and houses the algorithm would learn the constant features in each and will tell you with some degree of confidence if a new picture contains either a cat or a house. It is the same with letters, numbers, objects or faces, and you can apply either binary or multi-class classification. Just think when your vehicle registration plate is recognized by a camera or when your face is automatically labelled in pictures posted on Facebook. Beyond image classification we have other specific tasks in computer vision such as *object detection* or *semantic segmentation* (Figure 14.4). To conduct object detection we first have to locate all the possible objects contained in a picture by predicting a bounding box (i.e., the four points corresponding to the vertical and horizontal coordinates of the center of the object), which is normally a regression task. Once the bounding boxes are placed around the objects, we must apply multi-class classification as explained earlier. In the case of semantic segmentation, instead of classifying objects, we classify each pixel of the image according to the class of the object the pixel belongs to, which means that different objects of the same class might not be distinguished. See @geron2019hands for a more detailed explanation and graphical examples of object detection versus image segmentation. -![ Object detection (left) versus semantic segmentation (right). Source: @geron2019hands](img/ch15_location.png){#fig-location} +![Object detection (left) versus semantic segmentation (right). Source: @geron2019hands](img/ch15_location.png){#fig-location} It is beyond the scope of this book to address the implementation of object detection or semantic segmentation, but we will focus on how to conduct basic image classification in state-of-the-art libraries in R and Python. As you may have imagined we will need some already-labelled images to have a proper training set. It is also out of the scope of this chapter to collect and annotate the images, which is the reason why we will mostly rely on pre-existing image databases (i.e., MINST or Fashion MINST) and pre-trained models (i.e., CNN architectures). ### Basic Classification with Shallow Algorithms {#sec-shallow} -In Chapter [-@sec-chap-introsml] we introduced you to the exciting world of machine learning and in Section [-@sec-supervised] we introduced the *supervised* approach to classify texts. Most of the discussed models used so-called *shallow* algorithms such as Naïve Bayes or Support Vector Machines rather than the various large neural network models called *deep learning*. As we will see in the next section, deep neural networks are nowadays the best option for complex tasks in image classification. However, we will now explain how to conduct simple multi-class classification of images that contain numbers with a shallow algorithm. +In Chapter [-@sec-chap-introsml] we introduced you to the exciting world of machine learning and in Section [-@sec-supervised] we introduced the *supervised* approach to classify texts. Most of the discussed models used so-called *shallow* algorithms such as Naïve Bayes or Support Vector Machines rather than the various large neural network models called *deep learning*. As we will see in the next section, deep neural networks are nowadays the best option for complex tasks in image classification. However, we will now explain how to conduct simple multi-class classification of images that contain numbers with a shallow algorithm. -Let us begin by training a model to recognize numbers using 70000 small images of digits handwritten from the Modified National Institute of Standards and Technology (MNIST) dataset ([@lecun1998gradient]). This popular training corpus contains gray-scale examples of numbers written by American students and workers and it is usually employed to test machine learning models (60000 for training and 10000 for testing). The image sizes are $28 \times 28$, which generates 784 features for each image, with pixels values from white to black represented by a 0--255 scales. In Figure [-@fig-numbers] you can observe the first 10 handwritten numbers used in both training and test set. +Let us begin by training a model to recognize numbers using 70000 small images of digits handwritten from the Modified National Institute of Standards and Technology (MNIST) dataset ([@lecun1998gradient]). This popular training corpus contains gray-scale examples of numbers written by American students and workers and it is usually employed to test machine learning models (60000 for training and 10000 for testing). The image sizes are $28 \times 28$, which generates 784 features for each image, with pixels values from white to black represented by a 0--255 scales. In Figure [-@fig-numbers] you can observe the first 10 handwritten numbers used in both training and test set. ![First 10 handwritten digits from the training and test set of the MNIST.](img/ch15_numbers.png){#fig-numbers} -You can download the MNIST images from its project web page[^5], but many libraries also offer this dataset. In Example [-@exm-mnist] we use the `read_mnist` function from the *dslabs* package (Data Science Labs) in R and the `fetch_openml` function from the *sklearn* package (`datasets` module) in Python to read and load a `mnist` object into our workspace. We then create the four necessary objects (`X_train`, `X_test`, `y_train`, `y_test`) to generate a ML model and print the first numbers in training and test sets and check they coincide with those in [-@fig-numbers]. +You can download the MNIST images from its project web page[^chapter14-5], but many libraries also offer this dataset. In Example [-@exm-mnist] we use the `read_mnist` function from the *dslabs* package (Data Science Labs) in R and the `fetch_openml` function from the *sklearn* package (`datasets` module) in Python to read and load a `mnist` object into our workspace. We then create the four necessary objects (`X_train`, `X_test`, `y_train`, `y_test`) to generate a ML model and print the first numbers in training and test sets and check they coincide with those in [-@fig-numbers]. -::: {.callout-note appearance="simple" icon=false} +[^chapter14-5]: http://yann.lecun.com/exdb/mnist/ +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-mnist} Loading MNIST dataset and preparing training and test sets -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python mnist-python0} #| echo: false # fetch_openml is really slow and caching does not really seem to work, so do manual caching. We can probably do this better somehow @@ -499,6 +528,7 @@ else: with cache.open('wb') as f: pickle.dump([X_train, X_test, y_train, y_test], f) ``` + ```{python mnist-python} #| eval: false mnist = fetch_openml("mnist_784", version=1) @@ -507,23 +537,30 @@ y = y.astype(np.uint8) X_train, X_test = X[:60000], X[60000:] y_train, y_test = y[:60000], y[60000:] ``` + ```{python mnist-python2} print("Numbers in training set= ", y_train[0:10]) print("Numbers in test set= ", y_test[0:10]) ``` + ## R code + ```{r mnist-r} -mnist = read_mnist() +#| cache: true +# read_mnist keeps timeouting for me, so disabled for now +# Maybe use nist <- keras::dataset_mnist() ? But have to adapt code below as well + +# mnist = read_mnist() -X_train = mnist$train$images -y_train = factor(mnist$train$labels) -X_test = mnist$test$images -y_test = factor(mnist$test$labels) +# X_train = mnist$train$images +# y_train = factor(mnist$train$labels) +# X_test = mnist$test$images +# y_test = factor(mnist$test$labels) -print("Numbers in training set = ") -print(factor(y_train[1:10]), max.levels = 0) -print("Numbers in test set = ") -print(factor(y_test[1:10]), max.levels = 0) +# print("Numbers in training set = ") +# print(factor(y_train[1:10]), max.levels = 0) +# print("Numbers in test set = ") +# print(factor(y_test[1:10]), max.levels = 0) ``` ::: ::: @@ -531,15 +568,15 @@ print(factor(y_test[1:10]), max.levels = 0) Once we are ready to model the numbers we choose one of the shallow algorithms explained in Section [-@sec-nb2dnn] to deploy a binary or multi-class image classification task. In the case of binary, we should select a number of reference (for instance "3") and then create the model of that number against all the others (to answer questions such as "What's the probability of this digit of being number 3?"). On the other hand, if we choose multi-class classification our model can predict any of the ten numbers (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) included in our examples. -Now, we used the basic concepts of the Random Forest algorithm (see [@sec-randomforest]) to create and fit a model with 100 trees (`forest_clf`). In Example [-@exm-multiclass] we use again the *randomForest* package in R and *sklearn* package in Python to estimate a model for the ten classes using the corpus of 60000 images (classes were similarly balanced, $\sim$9--11\% each). As we do in the examples, you can check the predictions for the first ten images of the test set (`X_test`), which correctly correspond to the right digits, and also check the (`predictions`) for the whole test set and then get some metrics of the model. The accuracy is over 0.97 which means the classification task is performed very well. - -::: {.callout-note appearance="simple" icon=false} +Now, we used the basic concepts of the Random Forest algorithm (see [@sec-randomforest]) to create and fit a model with 100 trees (`forest_clf`). In Example [-@exm-multiclass] we use again the *randomForest* package in R and *sklearn* package in Python to estimate a model for the ten classes using the corpus of 60000 images (classes were similarly balanced, \$\sim\$9--11% each). As we do in the examples, you can check the predictions for the first ten images of the test set (`X_test`), which correctly correspond to the right digits, and also check the (`predictions`) for the whole test set and then get some metrics of the model. The accuracy is over 0.97 which means the classification task is performed very well. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-multiclass} Modeling the handwritten digits with RandomForest and predicting some outcomes -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python multiclass-python} #| cache: true forest_clf = RandomForestClassifier(n_estimators=100, random_state=42) @@ -553,16 +590,19 @@ predictions = forest_clf.predict(X_test) print("Overall Accuracy: ", accuracy_score(y_test, predictions)) ``` + ## R code + ```{r multiclass-r} #| cache: true #Multiclass classification with RandomForest -rf_clf = randomForest(X_train, y_train, ntree=100) -rf_clf -predict(rf_clf, X_test[1:10,]) -predictions = predict(rf_clf, X_test) -cm = confusionMatrix(predictions, y_test) -print(cm$overall["Accuracy"]) + +# rf_clf = randomForest(X_train, y_train, ntree=100) +# rf_clf +# predict(rf_clf, X_test[1:10,]) +# predictions = predict(rf_clf, X_test) +# cm = confusionMatrix(predictions, y_test) +# print(cm$overall["Accuracy"]) ``` ::: ::: @@ -578,21 +618,25 @@ One of the simplest DNNs architectures is the Multilayer Perceptron (MLP) which We can use MLPs for binary and multi-class classification. In the first case, we normally use a single output neuron with the *sigmoid* or *logistic* activation function (probability from 0 to 1) (see [@sec-logreg]); and in the second case we will need one output neuron per class with the *softmax* activation function (probabilities from 0 to 1 for each class but they must add up to 1 if the classes are exclusive. This is the function used in multinomial logistic regression). To predict probabilities, in both cases we will need a *loss* function and the one that is normally recommended is the *cross entropy loss* or simply *log loss*. -The state-of-the-art library for neural networks in general and for computer vision in particular is *TensorFlow*[^6] (originally created by Google and later publicly released) and the high-level Deep Learning API *Keras*, although you can find other good implementation packages such as *PyTorch* (created by Facebook), which has many straightforward functionalities and has also become popular in recent years (see for example the image classification tasks for social sciences conducted in *PyTorch* by @williams2020images). All these packages have current versions for both R and Python. +The state-of-the-art library for neural networks in general and for computer vision in particular is *TensorFlow*[^chapter14-6] (originally created by Google and later publicly released) and the high-level Deep Learning API *Keras*, although you can find other good implementation packages such as *PyTorch* (created by Facebook), which has many straightforward functionalities and has also become popular in recent years (see for example the image classification tasks for social sciences conducted in *PyTorch* by @williams2020images). All these packages have current versions for both R and Python. -Now, let's train an MLP to build an image classifier to recognize fashion items using the Fashion MNIST dataset[^7]. This dataset contains 70000 (60000 for training and 10000 for test) gray scale examples ($28\times 28$) of ten different classes that include ankle boots, bags, coats, dresses, pullovers, sandals, shirts, sneakers, t-shirts/tops and trousers (Figure [-@fig-fashion]). If you compare this dataset with the MINST, you will find that figures of fashion items are more complex than handwritten digits, which normally generates a lower accuracy in supervised classification. +[^chapter14-6]: We will deploy \*TensorFlow\*\*2\* in our exercises. -![Examples of Fashion MNIST items.](img/ch15_fashion.png){#fig-fashion} +Now, let's train an MLP to build an image classifier to recognize fashion items using the Fashion MNIST dataset[^chapter14-7]. This dataset contains 70000 (60000 for training and 10000 for test) gray scale examples ($28\times 28$) of ten different classes that include ankle boots, bags, coats, dresses, pullovers, sandals, shirts, sneakers, t-shirts/tops and trousers (Figure [-@fig-fashion]). If you compare this dataset with the MINST, you will find that figures of fashion items are more complex than handwritten digits, which normally generates a lower accuracy in supervised classification. + +[^chapter14-7]: https://github.com/zalandoresearch/fashion-mnist -You can use *Keras* to load the Fashion MNIST. In Example [-@exm-fashion] we load the complete dataset and create the necessary objects for modeling (`X_train_full`, `y_train_full`, `X_test`, `y_test`). In addition we rescaled all the input features from 0--255 to 0--1 by dividing them by 255 in order to apply Gradient Decent. Then, we obtained three sets with $28\times 28$ arrays: 60000 in the training, and 10000 in the test. We could also generate here a validation set (e.g., `X_valid` and `y_valid`) with a given amount of records extracted from the training set (e.g., 5000), but as you will later see *Keras* allows us to automatically generate the validation set as a proportion of the training set (e.g., 0.1, which would be 6000 records in our example) when fitting the model (check the importance to work with a validation set to avoid over-fitting, explained in Section [-@sec-train]). +![Examples of Fashion MNIST items.](img/ch15_fashion.png){#fig-fashion} -::: {.callout-note appearance="simple" icon=false} +You can use *Keras* to load the Fashion MNIST. In Example [-@exm-fashion] we load the complete dataset and create the necessary objects for modeling (`X_train_full`, `y_train_full`, `X_test`, `y_test`). In addition we rescaled all the input features from 0--255 to 0--1 by dividing them by 255 in order to apply Gradient Decent. Then, we obtained three sets with $28\times 28$ arrays: 60000 in the training, and 10000 in the test. We could also generate here a validation set (e.g., `X_valid` and `y_valid`) with a given amount of records extracted from the training set (e.g., 5000), but as you will later see *Keras* allows us to automatically generate the validation set as a proportion of the training set (e.g., 0.1, which would be 6000 records in our example) when fitting the model (check the importance to work with a validation set to avoid over-fitting, explained in Section [-@sec-train]). +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-fashion} Loading Fashion MNIST dataset and preparing training, test and validation sets -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python fashion-python} fashion_mnist = keras.datasets.fashion_mnist (X_train, y_train), (X_test, y_test) = fashion_mnist.load_data() @@ -604,7 +648,9 @@ X_test = X_test / 255.0 print(X_train.shape, X_test.shape) ``` + ## R code + ```{r fashion-r} fashion_mnist <- dataset_fashion_mnist() c(X_train, y_train) %<-% fashion_mnist$train @@ -621,17 +667,17 @@ print(dim(X_test)) ::: ::: -The next step is to design the architecture of our model. There are three ways to create the models in *Keras* (*sequential*, *functional*, or *subclassing*), but there are thousands of ways to configure a deep neural network. In the case of this MLP, we have to include first an input layer with the `input_shape` equal to the image dimension ($28\times 28$ for 784 neurons). At the top of the MLP you will need a output layer with 10 neurons (the number of possible outcomes in our multi-class classification task) and a *softmax* activation function for the final probabilities for each class. +The next step is to design the architecture of our model. There are three ways to create the models in *Keras* (*sequential*, *functional*, or *subclassing*), but there are thousands of ways to configure a deep neural network. In the case of this MLP, we have to include first an input layer with the `input_shape` equal to the image dimension ($28\times 28$ for 784 neurons). At the top of the MLP you will need a output layer with 10 neurons (the number of possible outcomes in our multi-class classification task) and a *softmax* activation function for the final probabilities for each class. In Example [-@exm-mlp] we use the *sequential* model to design our MLP layer by layer including the above-mentioned input and output layers. In the middle, there are many options for the configuration of the *hidden* layers: number of layers, number of neurons, activation functions, etc. As we know that each hidden layer will help to model different patterns of the image, it would be fair to include at least two of them with different numbers of neurons (significantly reducing this number in the second one) and transmit its information using the *relu* activation function. What we actually do is create an object called `model` which saves the proposed architecture. We can use the method `summary` to obtain a clear representation of the created neural network and the number of parameters of the model (266610 in this case!). -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-mlp} Creating the architecture of the MLP with Keras -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python mlp-python} model = keras.models.Sequential( [ @@ -644,15 +690,18 @@ model = keras.models.Sequential( model.summary() ``` + ## R code + ```{r mlp-r} -model = keras_model_sequential() -model %>% -layer_flatten(input_shape = c(28, 28)) %>% -layer_dense(units=300, activation="relu") %>% -layer_dense(units=100, activation="relu") %>% -layer_dense(units=10, activation="softmax") -model +# WvA: keras_model_sequential gives an error :( +# model = keras_model_sequential() +# model %>% +# layer_flatten(input_shape = c(28, 28)) %>% +# layer_dense(units=300, activation="relu") %>% +# layer_dense(units=100, activation="relu") %>% +# layer_dense(units=10, activation="softmax") +# model ``` ::: ::: @@ -660,15 +709,17 @@ model The next steps will be to `compile`, `fit`, and `evaluate` the model, similarly to what you have already done in previous exercises of this book. In Example [-@exm-model] we first include the parameters (loss, optimizer, and metrics) of the compilation step and fit the model, which might take some minutes (or even hours depending on your dataset, the architecture of you DNN and, of course, your computer). -When fitting the model you have to separate your training set into phases or *epochs*. A good rule of thumb to choose the optimal number of epochs is to stop a few iterations after the test loss stops improving[^8] (here we chose five epochs for the example). You will also have to set the proportion of the training set that will become the validation set (in this case 0.1). In addition, you can use the parameter `verbose` to choose whether to see the progress (1 for progress bar and 2 for one line per epoch) or not (0 for silent) of the training process. By using the method `evaluate` you can then obtain the final loss and accuracy, which in this case is 0.84 (but you can reach up 0.88 if you fit it with 25 epochs!). +When fitting the model you have to separate your training set into phases or *epochs*. A good rule of thumb to choose the optimal number of epochs is to stop a few iterations after the test loss stops improving[^chapter14-8] (here we chose five epochs for the example). You will also have to set the proportion of the training set that will become the validation set (in this case 0.1). In addition, you can use the parameter `verbose` to choose whether to see the progress (1 for progress bar and 2 for one line per epoch) or not (0 for silent) of the training process. By using the method `evaluate` you can then obtain the final loss and accuracy, which in this case is 0.84 (but you can reach up 0.88 if you fit it with 25 epochs!). -::: {.callout-note appearance="simple" icon=false} +[^chapter14-8]: The train loss/accuracy will gradually be better and better. And the test loss/accuracy as well, in the beginning. But then, at some point train loss/acc improves but test loss/acc stops getting better. If we keep training the model for more epochs, we are just overfitting on the train set, which of course we do not want to. Specifically, we do not want to simply stop at the iteration where we got the best loss/acc for the test set, because then we are overfitting on the test set. Hence practitioners often let it run for a few more epochs after hitting the best loss/acc for the test set. Then, a final check on the validation set will really tell us how well we do out of sample. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-model} Compiling fitting and evaluating the model for the MLP -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python model-python} model.compile( loss="sparse_categorical_crossentropy", @@ -680,30 +731,32 @@ history = model.fit(X_train, y_train, epochs=5, verbose=0, validation_split=0.1) print("Evaluation: ") print(model.evaluate(X_test, y_test, verbose=0)) ``` + ## R code + ```{r model-r} #| cache: true -model %>% compile(optimizer = "sgd", metrics = c("accuracy"), - loss = "sparse_categorical_crossentropy") -history = model %>% fit(X_train, y_train,validation_split=0.1, epochs=5, verbose=0) -print(history$metrics) -score = model %>% evaluate(X_test, y_test, verbose = 0) -print("Evaluation") -print(score) +# model %>% compile(optimizer = "sgd", metrics = c("accuracy"), +# loss = "sparse_categorical_crossentropy") +# history = model %>% fit(X_train, y_train,validation_split=0.1, epochs=5, verbose=0) +# print(history$metrics) +# score = model %>% evaluate(X_test, y_test, verbose = 0) +# print("Evaluation") +# print(score) ``` ::: ::: ::: -Finally, you can use the model to predict the classes of any new image (using `predict_classes`). In Example [-@exm-predict] we used the model to predict the classes of the first six elements of the test set. If you go back to [-@fig-fashion] you can compare these predictions ("ankle boot", "pullover", "trouser", "trouser", "shirt", and "tTrouser") with the actual first six images of the test set, and see how accurate our model was. - -::: {.callout-note appearance="simple" icon=false} +Finally, you can use the model to predict the classes of any new image (using `predict_classes`). In Example [-@exm-predict] we used the model to predict the classes of the first six elements of the test set. If you go back to [-@fig-fashion] you can compare these predictions ("ankle boot", "pullover", "trouser", "trouser", "shirt", and "tTrouser") with the actual first six images of the test set, and see how accurate our model was. +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-predict} Predicting classes using the MLP -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python predict-python} #| cache: true X_new = X_test[:6] @@ -712,43 +765,55 @@ class_pred = [class_names[i] for i in y_pred] class_pred ``` + ## R code + ```{r predict-r} #| cache: true -img = X_test[1:6, , , drop = FALSE] -class_pred = model %>% predict(img, verbose=0) %>% k_argmax() -class_pred +# img = X_test[1:6, , , drop = FALSE] +# class_pred = model %>% predict(img, verbose=0) %>% k_argmax() +# class_pred ``` ::: ::: ::: -Using the above-described concepts and code you may try to train a new MLP using color images of ten classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck) using the CIFAR-10 and CIFAR-100 datasets[^9]! +Using the above-described concepts and code you may try to train a new MLP using color images of ten classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck) using the CIFAR-10 and CIFAR-100 datasets[^chapter14-9]! + +[^chapter14-9]: https://www.cs.toronto.edu/ kriz/cifar.html ### Re-using an Open Source CNN {#sec-tuning} -Training complex images such as photographs is normally a more sophisticated task if we compare it to the examples included in the last sections. On the one hand, it might not be a good idea to build a deep neural network from scratch as we did in section [-@sec-deep] to train a MLP. This means that you can re-use some lower layers of other DNNs and deploy *transfer learning* to save time with less training data. On the other hand, we should also move from traditional MLPs to other kinds of DNNs such as Convolutional Neural Networks (CNNs) which are nowadays the state-of-the-art approach in computer vision. Moreover, to get good results we should also build or explore different CNNs architectures that can produce more accurate predictions in image classification. In this section we will show how to re-use an open source CNN architecture and will suggest an example of how to fine-tune an existing CNN for a social science problem. +Training complex images such as photographs is normally a more sophisticated task if we compare it to the examples included in the last sections. On the one hand, it might not be a good idea to build a deep neural network from scratch as we did in section [-@sec-deep] to train a MLP. This means that you can re-use some lower layers of other DNNs and deploy *transfer learning* to save time with less training data. On the other hand, we should also move from traditional MLPs to other kinds of DNNs such as Convolutional Neural Networks (CNNs) which are nowadays the state-of-the-art approach in computer vision. Moreover, to get good results we should also build or explore different CNNs architectures that can produce more accurate predictions in image classification. In this section we will show how to re-use an open source CNN architecture and will suggest an example of how to fine-tune an existing CNN for a social science problem. -As explained in Section [-@sec-cnnbasis] a CNN is a specific type of DNN that has had great success in complex visual tasks (image classification, object detection or semantic segmentation) and voice recognition[^10]. Instead of using *fully connected* layers like in a typical MLP, a CNN uses only *partially connected* layers inspired on how "real" neurons connect in the visual cortex: some neurons only react to stimuli located in a limited *receptive field*. In other words, in a CNN every neuron is connected to some neurons of the previous layer (and not to all of them), which significantly reduces the amount of information transmitted to the next layer and helps the DNN to detect complex patterns. Surprisingly, this reduction in the number of parameters and weights involved in the model works better for larger and more complex images, different from those shown in MNIST. +As explained in Section [-@sec-cnnbasis] a CNN is a specific type of DNN that has had great success in complex visual tasks (image classification, object detection or semantic segmentation) and voice recognition[^chapter14-10]. Instead of using *fully connected* layers like in a typical MLP, a CNN uses only *partially connected* layers inspired on how "real" neurons connect in the visual cortex: some neurons only react to stimuli located in a limited *receptive field*. In other words, in a CNN every neuron is connected to some neurons of the previous layer (and not to all of them), which significantly reduces the amount of information transmitted to the next layer and helps the DNN to detect complex patterns. Surprisingly, this reduction in the number of parameters and weights involved in the model works better for larger and more complex images, different from those shown in MNIST. -Building a CNN is quite similar to a MLP, except for the fact that you will have to work with *convolutional* and *pooling* layers. The convolutional layers include a *bias term* and are the most important blocks of a CNN because they establish the specific connections among the neurons. In simpler words: a given neuron of a high-level layer is connected only to a rectangular group of neurons (the receptive field) of the low-level layer and not to all of them[^11]. For more technical details of the basis of a CNN you can go to specific literature such as @geron2019hands. +[^chapter14-10]: CNNs have also a great performance in natural language processing. -Instead of building a CNN from scratch, there are many pre-trained and open-source architectures that have been optimized for image classification. Besides a stack of convolutional and pooling layers, these architectures normally include some fully connected layers and a regular output layer for prediction (just like in MLPs). We can mention here some of these architectures: LeNet-5, AlexNet, GoogLeNet, VGGNet, ResNet, Xception or SENet[^12]. All these CNNs have been previously tested in image classification with promising results, but you still have to look at the internal composition of each of them and their metrics to choose the most appropriate for you. You can implement and train most of them from scratch either in *keras* or *PyTorch*, or you can just use them directly or even fine-tune the pre-trained model in order to save time. +Building a CNN is quite similar to a MLP, except for the fact that you will have to work with *convolutional* and *pooling* layers. The convolutional layers include a *bias term* and are the most important blocks of a CNN because they establish the specific connections among the neurons. In simpler words: a given neuron of a high-level layer is connected only to a rectangular group of neurons (the receptive field) of the low-level layer and not to all of them[^chapter14-11]. For more technical details of the basis of a CNN you can go to specific literature such as @geron2019hands. -Let's use the pre-trained model of a Residual Network (ResNet) with 50 layers, also known as *ResNet50*, to show you how to deploy a multi-class classifier over pictures. The ResNet architecture (also with 34, 101 and 152 layers) is based on residual learning and uses *skip connections*, which means that the input layer not only feeds the next layer but this signal is also added to the output of another high-level layer. This allows you to have a much deeper network and in the case of ResNet152 it has achieved a top-five error rate of 3.6\%. As we do in Example [-@exm-resnet50], you can easily import into your workspace a ResNet50 architecture and include the pre-trained weights of a model trained with ImageNet (uncomment the second line of the code to visualize the complete model!). +[^chapter14-11]: If the input layer (in the case of color images there are three sublayers, one per color channel) and the convolutional layers are of different sizes we can apply techniques such as *zero padding* (adding zeros around the inputs) or spacing out the receptive fields (each shift from one receptive field to the other will be a *stride*). In order to transmit the weights from the receptive fields to the neurons, the convolutional layer will automatically generate some *filters* to create *features maps*, which are the areas of the input that mostly activate those filters. Additionally, by creating subsamples of the inputs, the pooling layers will reduce the number of parameters, the computational effort of the network and the risk of overfitting. The pooling layers *aggregates* the inputs using a standard arithmetic function such as minimum, maximum or mean. -::: {.callout-note appearance="simple" icon=false} +Instead of building a CNN from scratch, there are many pre-trained and open-source architectures that have been optimized for image classification. Besides a stack of convolutional and pooling layers, these architectures normally include some fully connected layers and a regular output layer for prediction (just like in MLPs). We can mention here some of these architectures: LeNet-5, AlexNet, GoogLeNet, VGGNet, ResNet, Xception or SENet[^chapter14-12]. All these CNNs have been previously tested in image classification with promising results, but you still have to look at the internal composition of each of them and their metrics to choose the most appropriate for you. You can implement and train most of them from scratch either in *keras* or *PyTorch*, or you can just use them directly or even fine-tune the pre-trained model in order to save time. +[^chapter14-12]: The description of technical details of all of these architectures is beyond the scope of this book, but besides the specific scientific literature of each architecture, some packages such as *keras* usually include basic documentation. + +Let's use the pre-trained model of a Residual Network (ResNet) with 50 layers, also known as *ResNet50*, to show you how to deploy a multi-class classifier over pictures. The ResNet architecture (also with 34, 101 and 152 layers) is based on residual learning and uses *skip connections*, which means that the input layer not only feeds the next layer but this signal is also added to the output of another high-level layer. This allows you to have a much deeper network and in the case of ResNet152 it has achieved a top-five error rate of 3.6%. As we do in Example [-@exm-resnet50], you can easily import into your workspace a ResNet50 architecture and include the pre-trained weights of a model trained with ImageNet (uncomment the second line of the code to visualize the complete model!). + +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-resnet50} Loading a visualizing the ResNet50 architecture -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python resnet50-python} model_rn50 = tf.keras.applications.resnet50.ResNet50(weights="imagenet") ``` + ## R code + ```{r resnet50-r} model_resnet50 = application_resnet50(weights="imagenet") ``` @@ -756,17 +821,17 @@ model_resnet50 = application_resnet50(weights="imagenet") ::: ::: -ImageNet is a corpus of labelled images based on the WordNet hierarchy. ResNet uses a subset of ImageNet with \ 1000 examples for each of the 1000 classes for a total corpus of roughly 1350000 pictures (1200000 for training, 100000 for test, and 50000 for validation). +ImageNet is a corpus of labelled images based on the WordNet hierarchy. ResNet uses a subset of ImageNet with  1000 examples for each of the 1000 classes for a total corpus of roughly 1350000 pictures (1200000 for training, 100000 for test, and 50000 for validation). In Example [-@exm-newimages] we crop a part of our second example picture of refugees arriving at the European coast (`myimg2_RGB`) in order to get just the sea landscape. With the created `model_resnet50` we then ask for up to three predictions of the class of the photograph in Example [-@exm-resnetpredictions]. -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-newimages} Cropping an image to get a picture of a see landscape -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python newimages-python} #| results: hide def plot_color_image(image): @@ -785,55 +850,61 @@ plot_color_image(tf_images[0]) plt.show() ``` + ## R code + ```{r newimages-r} #| cache: true -picture1 = image_crop(myimg2_RGB, "224x224+50+50") -plot(picture1) -picture1 = as.integer(picture1[[1]]) -#drop the extra channel for comparision -picture1 = picture1[,,-4] -picture1 = array_reshape(picture1, c(1, dim(picture1))) -picture1 = imagenet_preprocess_input(picture1) +# picture1 = image_crop(myimg2_RGB, "224x224+50+50") +# plot(picture1) +# picture1 = as.integer(picture1[[1]]) +# #drop the extra channel for comparision +# picture1 = picture1[,,-4] +# picture1 = array_reshape(picture1, c(1, dim(picture1))) +# picture1 = imagenet_preprocess_input(picture1) ``` ::: ::: ::: -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-resnetpredictions} Predicting the class of the first image -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python resnetpredictions-python} inputs = tf.keras.applications.resnet50.preprocess_input(tf_images * 255) Y_proba = model_rn50.predict(inputs, verbose=0) preds = tf.keras.applications.resnet50.decode_predictions(Y_proba, top=3) print(preds[0]) ``` + ## R code + ```{r resnetpredictions-r} #| cache: true -preds1 = model_resnet50 %>% predict(picture1) -imagenet_decode_predictions(preds1, top = 3)[[1]] +# preds1 = model_resnet50 %>% predict(picture1) +# imagenet_decode_predictions(preds1, top = 3)[[1]] ``` ::: ::: ::: -As you can see in the Python and R outputs[^13], the best guess of the model is a *sandbar*, which is very close to the real picture that contains sea water, mountains and sky. However, it seems that the model is confusing sand with sea. Other results in the Python model are *seashore* and *cliff*, which are also very close to real sea landscape. Nevertheless, in the case of the R prediction the model detects a *submarine* and a *gray whale*, which revels that predictions are not 100\% accurate yet. +As you can see in the Python and R outputs[^chapter14-13], the best guess of the model is a *sandbar*, which is very close to the real picture that contains sea water, mountains and sky. However, it seems that the model is confusing sand with sea. Other results in the Python model are *seashore* and *cliff*, which are also very close to real sea landscape. Nevertheless, in the case of the R prediction the model detects a *submarine* and a *gray whale*, which revels that predictions are not 100% accurate yet. -If we do the same with another part of that original picture and focus only on the group of refugees in a lifeboat arriving at the European coast, we will get a different result! In Example [-@exm-newimages2] we crop again (`myimg2_RGB`) and get a new framed picture. Then in Example [-@exm-resnetpredictions2] we re-run the prediction task using the model *ResNet50* trained with ImageNet and get a correct result: both predictions coincide to see a *lifeboat*, which is a good tag for the image we want to classify. Again, other lower-level predictions can seem accurate (*speedboat*) and totally inaccurate (*volcano*, *gray whale* or *amphibian*). +[^chapter14-13]: Outputs in Python and R might differ a little bit since the cropping of the new images were similar but not identical. -::: {.callout-note appearance="simple" icon=false} +If we do the same with another part of that original picture and focus only on the group of refugees in a lifeboat arriving at the European coast, we will get a different result! In Example [-@exm-newimages2] we crop again (`myimg2_RGB`) and get a new framed picture. Then in Example [-@exm-resnetpredictions2] we re-run the prediction task using the model *ResNet50* trained with ImageNet and get a correct result: both predictions coincide to see a *lifeboat*, which is a good tag for the image we want to classify. Again, other lower-level predictions can seem accurate (*speedboat*) and totally inaccurate (*volcano*, *gray whale* or *amphibian*). +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-newimages2} Cropping an image to get a picture of refugees in a lifeboat -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python newimages2-python} #| cache: true #| results: hide @@ -841,70 +912,47 @@ plot_color_image(tf_images[1]) plt.show() ``` + ## R code + ```{r newimages2-r} #| cache: true -picture2 = image_crop(myimg2_RGB, "224x224+1000") -plot(picture2) -picture2 = as.integer(picture2[[1]]) -#drop the extra channel for comparision -picture2 = picture2[,,-4] -picture2 = array_reshape(picture2, c(1, dim(picture2))) -picture2 = imagenet_preprocess_input(picture2) +# picture2 = image_crop(myimg2_RGB, "224x224+1000") +# plot(picture2) +# picture2 = as.integer(picture2[[1]]) +# #drop the extra channel for comparision +# picture2 = picture2[,,-4] +# picture2 = array_reshape(picture2, c(1, dim(picture2))) +# picture2 = imagenet_preprocess_input(picture2) ``` ::: ::: ::: -::: {.callout-note appearance="simple" icon=false} - +::: {.callout-note appearance="simple" icon="false"} ::: {#exm-resnetpredictions2} Predicting the class of the second image -::: {.panel-tabset} +::: panel-tabset ## Python code + ```{python resnetpredictions2-python} print(preds[1]) ``` + ## R code + ```{r resnetpredictions2-r} #| cache: true -preds2 = model_resnet50 %>% predict(picture2) -imagenet_decode_predictions(preds2, top = 3)[[1]] +# preds2 = model_resnet50 %>% predict(picture2) +# imagenet_decode_predictions(preds2, top = 3)[[1]] ``` ::: ::: ::: -These examples show you how to use an open-source and pre-trained CNN that has 1000 classes and has been trained on images that we do not have control of. However, you may want to build your own classifier with your own training data, but using part of an existing architecture. This is called fine-tuning and you can follow a good example in social science in @williams2020images in which the authors reuse RestNet18 to build binary and multi-class classifiers adding their own data examples over the pre-trained CNN[^14]. - -So far we have covered the main techniques, methods, and services to analyze multimedia data, specifically images. It is up to you to choose which library or service to use, and you will find most of them in R and Python, using the basic concepts explained in this chapter. If you are interested in deepening your understanding of multimedia analysis, we encourage you explore this emerging and exciting field of expertise given the enormous importance it will no doubt have in the near future. - -[^1]: https://pjreddie.com/darknet/yolo/ - -[^2]: https://cloud.google.com/speech-to-text - -[^3]: Alpha refers to the opacity of each pixel. - -[^4]: Key refers to *black*. - -[^5]: http://yann.lecun.com/exdb/mnist/ - -[^6]: We will deploy *TensorFlow**2* in our exercises. - -[^7]: https://github.com/zalandoresearch/fashion-mnist +These examples show you how to use an open-source and pre-trained CNN that has 1000 classes and has been trained on images that we do not have control of. However, you may want to build your own classifier with your own training data, but using part of an existing architecture. This is called fine-tuning and you can follow a good example in social science in @williams2020images in which the authors reuse RestNet18 to build binary and multi-class classifiers adding their own data examples over the pre-trained CNN[^chapter14-14]. -[^8]: The train loss/accuracy will gradually be better and better. And the test loss/accuracy as well, in the beginning. But then, at some point train loss/acc improves but test loss/acc stops getting better. If we keep training the model for more epochs, we are just overfitting on the train set, which of course we do not want to. Specifically, we do not want to simply stop at the iteration where we got the best loss/acc for the test set, because then we are overfitting on the test set. Hence practitioners often let it run for a few more epochs after hitting the best loss/acc for the test set. Then, a final check on the validation set will really tell us how well we do out of sample. - -[^9]: https://www.cs.toronto.edu/ kriz/cifar.html - -[^10]: CNNs have also a great performance in natural language processing. - -[^11]: If the input layer (in the case of color images there are three sublayers, one per color channel) and the convolutional layers are of different sizes we can apply techniques such as *zero padding* (adding zeros around the inputs) or spacing out the receptive fields (each shift from one receptive field to the other will be a *stride*). In order to transmit the weights from the receptive fields to the neurons, the convolutional layer will automatically generate some *filters* to create *features maps*, which are the areas of the input that mostly activate those filters. Additionally, by creating subsamples of the inputs, the pooling layers will reduce the number of parameters, the computational effort of the network and the risk of overfitting. The pooling layers *aggregates* the inputs using a standard arithmetic function such as minimum, maximum or mean. - -[^12]: The description of technical details of all of these architectures is beyond the scope of this book, but besides the specific scientific literature of each architecture, some packages such as *keras* usually include basic documentation. - -[^13]: Outputs in Python and R might differ a little bit since the cropping of the new images were similar but not identical. - -[^14]: The examples are provided in Python with the package *PyTorch*, which is quite friendly if you are already familiar to *Keras*. +[^chapter14-14]: The examples are provided in Python with the package *PyTorch*, which is quite friendly if you are already familiar to *Keras*. +So far we have covered the main techniques, methods, and services to analyze multimedia data, specifically images. It is up to you to choose which library or service to use, and you will find most of them in R and Python, using the basic concepts explained in this chapter. If you are interested in deepening your understanding of multimedia analysis, we encourage you explore this emerging and exciting field of expertise given the enormous importance it will no doubt have in the near future. diff --git a/renv.lock b/renv.lock index f96bd88..4dc59fd 100644 --- a/renv.lock +++ b/renv.lock @@ -1,6 +1,6 @@ { "R": { - "Version": "4.2.2", + "Version": "4.3.1", "Repositories": [ { "Name": "CRAN", @@ -9,7 +9,7 @@ ] }, "Python": { - "Version": "3.10.6", + "Version": "3.10.12", "Type": "virtualenv", "Name": "./renv/python/virtualenvs/renv-python-3.10" }, @@ -526,10 +526,10 @@ }, "class": { "Package": "class", - "Version": "7.3-20", + "Version": "7.3-22", "Source": "Repository", "Repository": "CRAN", - "Hash": "da09d82223e669d270e47ed24ac8686e", + "Hash": "f91f6b29f38b8c280f2b9477787d4bb2", "Requirements": [ "MASS" ] @@ -2278,10 +2278,10 @@ }, "reticulate": { "Package": "reticulate", - "Version": "1.28", + "Version": "1.34.0", "Source": "Repository", "Repository": "CRAN", - "Hash": "86c441bf33e1d608db773cb94b848458", + "Hash": "a69f815bcba8a055de0b08339b943f9e", "Requirements": [ "Matrix", "Rcpp", @@ -2290,6 +2290,7 @@ "jsonlite", "png", "rappdirs", + "rlang", "withr" ] }, diff --git a/requirements.txt b/requirements.txt index 4354337..0df4d3a 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,129 +1,157 @@ -absl-py==1.4.0 -adjustText==0.7.3 +absl-py==2.0.0 +adjustText==0.8 +asttokens==2.4.1 astunparse==1.6.3 async-generator==1.10 -attrs==22.2.0 -bioinfokit==2.1.0 -blis==0.7.9 -cachetools==5.3.0 -catalogue==2.0.8 -certifi==2022.12.7 -charset-normalizer==3.0.1 -click==8.1.3 +attrs==23.1.0 +bioinfokit==2.1.3 +blinker==1.7.0 +blis==0.7.11 +cachetools==5.3.2 +catalogue==2.0.10 +certifi==2023.11.17 +charset-normalizer==3.3.2 +click==8.1.7 click-plugins==1.1.1 cligj==0.7.2 +comm==0.2.0 community==1.0.0b1 -confection==0.0.4 -conllu==4.5.2 -contourpy==1.0.7 +confection==0.1.4 +conllu==4.5.3 +contourpy==1.2.0 cssselect==1.2.0 -cycler==0.11.0 -cymem==2.0.7 -Cython==0.29.33 +cycler==0.12.1 +cymem==2.0.8 +Cython==3.0.6 +debugpy==1.8.0 +decorator==5.1.1 descartes==1.1.0 dyNET38==2.1 eli5==0.13.0 en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl es-core-news-sm==3.1.0 -exceptiongroup==1.1.0 -Fiona==1.8.22 -Flask==2.2.2 -flatbuffers==23.1.21 -fonttools==4.38.0 +exceptiongroup==1.2.0 +executing==2.0.1 +fiona==1.9.5 +Flask==3.0.0 +flatbuffers==23.5.26 +fonttools==4.46.0 fst-pso==1.8.1 FuzzyTM==2.0.5 -gast==0.4.0 -gensim==4.3.0 -geopandas==0.12.2 -google-auth==2.16.0 -google-auth-oauthlib==0.4.6 +gast==0.5.4 +gensim==4.3.2 +geopandas==0.14.1 +google-auth==2.24.0 +google-auth-oauthlib==1.1.0 google-pasta==0.2.0 graphviz==0.20.1 -grpcio==1.51.1 +grpcio==1.59.3 h11==0.14.0 -h5py==3.8.0 -idna==3.4 +h5py==3.10.0 +idna==3.6 +ipykernel==6.27.1 +ipython==8.18.1 itsdangerous==2.1.2 +jedi==0.19.1 Jinja2==3.1.2 -joblib==1.2.0 -keras==2.11.0 +joblib==1.3.2 +jupyter_client==8.6.0 +jupyter_core==5.5.0 +keras==2.15.0 Keras-Preprocessing==1.1.2 -kiwisolver==1.4.4 +kiwisolver==1.4.5 langcodes==3.3.0 -libclang==15.0.6.1 -lxml==4.9.2 -Markdown==3.4.1 -MarkupSafe==2.1.2 -matplotlib==3.6.3 -matplotlib-venn==0.11.7 +libclang==16.0.6 +lxml==4.9.3 +Markdown==3.5.1 +MarkupSafe==2.1.3 +matplotlib==3.8.2 +matplotlib-inline==0.1.6 +matplotlib-venn==0.11.9 miniful==0.0.6 -munch==2.5.0 -murmurhash==1.0.9 -nagisa==0.2.8 -networkx==3.0 +ml-dtypes==0.2.0 +munch==4.0.0 +murmurhash==1.0.10 +nagisa==0.2.9 +nest-asyncio==1.5.8 +networkx==3.2.1 nltk==3.8.1 -numpy==1.24.1 +numpy==1.26.2 oauthlib==3.2.2 opt-einsum==3.3.0 -outcome==1.2.0 -packaging==23.0 -pandas==1.5.3 -pathy==0.10.1 -patsy==0.5.3 -Pillow==9.4.0 -preshed==3.0.8 -protobuf==3.19.6 -pyasn1==0.4.8 -pyasn1-modules==0.2.8 +outcome==1.3.0.post0 +packaging==23.2 +pandas==2.1.3 +parso==0.8.3 +pathy==0.10.3 +patsy==0.5.4 +pexpect==4.9.0 +Pillow==10.1.0 +platformdirs==4.0.0 +preshed==3.0.9 +prompt-toolkit==3.0.41 +protobuf==4.23.4 +psutil==5.9.6 +ptyprocess==0.7.0 +pure-eval==0.2.2 +pyasn1==0.5.1 +pyasn1-modules==0.3.0 pydantic==1.8.2 pyFUME==0.2.25 -pyparsing==3.0.9 -pyproj==3.4.1 +Pygments==2.17.2 +pyparsing==3.1.1 +pyproj==3.6.1 PySocks==1.7.1 python-dateutil==2.8.2 python-louvain==0.16 -pytz==2022.7.1 -regex==2022.10.31 -requests==2.28.2 +pytz==2023.3.post1 +pyzmq==25.1.1 +regex==2023.10.3 +requests==2.31.0 requests-oauthlib==1.3.1 rsa==4.9 -scikit-learn==0.24.2 -scipy==1.10.0 -seaborn==0.12.2 -selenium==4.8.0 -shapely==2.0.0 +scikit-learn==1.3.2 +scipy==1.11.4 +seaborn==0.13.0 +selenium==4.15.2 +shapely==2.0.2 shifterator==0.3.0 -simpful==2.9.0 +simpful==2.11.1 six==1.16.0 -smart-open==6.3.0 +smart-open==6.4.0 sniffio==1.3.0 sortedcontainers==2.4.0 spacy==3.1.7 spacy-legacy==3.0.12 -spacy-loggers==1.0.4 -srsly==2.4.5 -statsmodels==0.13.5 +spacy-loggers==1.0.5 +srsly==2.4.8 +stack-data==0.6.3 +statsmodels==0.14.0 tabulate==0.9.0 -tensorboard==2.11.2 -tensorboard-data-server==0.6.1 +tensorboard==2.15.1 +tensorboard-data-server==0.7.2 tensorboard-plugin-wit==1.8.1 -tensorflow==2.11.0 -tensorflow-estimator==2.11.0 -tensorflow-io-gcs-filesystem==0.30.0 -termcolor==2.2.0 +tensorflow==2.15.0 +tensorflow-estimator==2.15.0 +tensorflow-io-gcs-filesystem==0.34.0 +termcolor==2.4.0 textwrap3==0.9.2 thinc==8.0.17 -threadpoolctl==3.1.0 -tqdm==4.64.1 -trio==0.22.0 -trio-websocket==0.9.2 +threadpoolctl==3.2.0 +tornado==6.4 +tqdm==4.66.1 +traitlets==5.14.0 +trio==0.23.1 +trio-websocket==0.11.1 typer==0.4.2 -typing_extensions==4.4.0 -ufal.udpipe==1.2.0.3 -urllib3==1.26.14 +typing_extensions==4.8.0 +tzdata==2023.3 +ufal.udpipe==1.3.1.1 +urllib3==2.1.0 wasabi==0.10.1 -Werkzeug==2.2.2 -wordcloud==1.8.2.2 +wcwidth==0.2.12 +Werkzeug==3.0.1 +wordcloud==1.9.2 wrapt==1.14.1 wsproto==1.2.0 xlrd==2.0.1