This repository relates to an article published in the July 2022 issue of First Monday. A pre-print version was published on SSRN in November 2021.
For this article, I first proceded to extract the 300 000 posts which garnered the most attention on pages administered mainly in Belgium, Canada, France and Switzerland for each month of the year 2020. After filtering this 13.4M-post initial sample, as decribed in the article, I kept a final sample of 3.3M posts in French.
One of the steps in the filtering involved determining the language of each post. This was done with the following python script :
The pages in this final sample were then manually classified in two categories (criteria described in the article) : media and non-media. The following four CSV files show how pages were classified in each country, along with the number of posts and sum of interactions from each page (only those posts that were included in my sample) :
- belgique2020-pages-fb-fr.csv 🇧🇪
- canada2020-pages-fb-fr.csv 🇨🇦
- france2020-pages-fb-fr.csv 🇫🇷
- suisse2020-pages-fb-fr.csv 🇨🇭
Those results are also summarized in the following graph.
CrowdTangle's ToS do not allow the sharing of raw data. However, a summary of interaction types by subcorpora (8 subcorpora in total; one per country and per type [media vs nonmedia]) can be found in the following CSV file :
To extract unigrams, bigrams and trigrams from each of the 8 subcorpora, I used this python script :
All n-grams were then cleaned-up (to remove residual punctuation or funky whitespace characters, for example) and uniformized using this script :
The 24 csv files (3 n-gram types * 2 categories * 4 countries) produced by these scripts were between 3.6M and 37.3M lines long. Each lined contained a term and the interaction figure for the post it was found in. To find the frequency of each word and to weigh it by interactions, such as described in the article, a pivot table was performed using pandas. An example of the code used in the case the Belgium corpus CSV files is found in this notebook :
I then proceeded to compare media and non-media unigrams, bigrams and trigrams for each country. This was done in a jupyter notebook for each country, producing graphs with plotly express for python. The raw notebooks are too big to be shared directly in github. They were placed on an personal server in HTML format :
This step is, IMHO, the most relevant and revealing of a newsless Facebook.
For example, the most characteristic bigrams of the media and non-media canadian subcorpora really show how different Facebook would be without news.
In the paper, compound graphs were published showing which terms were most characteristic in all four countries for media pages...
... and for non-media pages.
The numbers at the right of the bars show the number of countries (two or more) in which the terms were found.
The last step involved an exploratory topic modeling on all 8 subcorpora using BERTopic using the following script :
- topicBERT.py (including my own list of stopwords in the file blabla.py)
I used BERTopic with three different models :
- spaCy's
fr_core_news_md
model (2 runs with different parameters) - FlauBERT
- CamemBERT
The four runs were performed on each month for all 8 suborpora. Since topic modeling is extremely memory intensive, some months with a very hefty amount of material had to be cut in two (such as in the case of the non-media French subcorpus). Below are examples of the topics given for the month of June for both media and non-media subcorpora by country and by model.
Topics for media subcorpora (June 2020) :
-
In Belgium 🇧🇪 :
- As given by the first run with spaCy's model (20 topics, 2 lemmas per term, 20 terms per topic)
- As given by the second run with spaCy's model (12 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the FlauBERT model (12 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the CamemBERT model (12 topics, 1-2 lemmas per term, 8 terms per topic)
-
In Canada 🇨🇦 :
- As given by the first run with spaCy's model (20 topics, 2 lemmas per term, 20 terms per topic)
- As given by the second run with spaCy's model (12 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the FlauBERT model (12 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the CamemBERT model (12 topics, 1-2 lemmas per term, 8 terms per topic)
-
In France 🇫🇷 :
- As given by the first run with spaCy's model (20 topics, 2 lemmas per term, 20 terms per topic)
- As given by the second run with spaCy's model (12 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the FlauBERT model (12 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the CamemBERT model (12 topics, 1-2 lemmas per term, 8 terms per topic)
-
In Switzerland 🇨🇭 :
- As given by the first run with spaCy's model (20 topics, 2 lemmas per term, 20 terms per topic)
- As given by the second run with spaCy's model (12 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the FlauBERT model (12 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the CamemBERT model (12 topics, 1-2 lemmas per term, 8 terms per topic)
Topics for non-media subcorpora (June 2020) :
-
In Belgium 🇧🇪 :
- As given by the first run with spaCy's model (20 topics, 2 lemmas per term, 20 terms per topic)
- As given by the second run with spaCy's model (12 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the FlauBERT model (12 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the CamemBERT model (12 topics, 1-2 lemmas per term, 8 terms per topic)
-
In Canada 🇨🇦 :
- As given by the first run with spaCy's model (20 topics, 2 lemmas per term, 20 terms per topic)
- As given by the second run with spaCy's model (12 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the FlauBERT model (12 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the CamemBERT model (12 topics, 1-2 lemmas per term, 8 terms per topic)
-
In France 🇫🇷 (here, with the FlauBERT and CamemBERT models, the subcorpus had to be split in half, thus the CSV files contain 24 topics instead of 12:
- As given by the first run with spaCy's model (20 topics, 2 lemmas per term, 20 terms per topic)
- As given by the second run with spaCy's model (12 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the FlauBERT model (24 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the CamemBERT model (24 topics, 1-2 lemmas per term, 8 terms per topic)
-
In Switzerland 🇨🇭 :
- As given by the first run with spaCy's model (20 topics, 2 lemmas per term, 20 terms per topic)
- As given by the second run with spaCy's model (12 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the FlauBERT model (12 topics, 1-2 lemmas per term, 8 terms per topic)
- As given by the CamemBERT model (12 topics, 1-2 lemmas per term, 8 terms per topic)
I found that asking the models to provide either one or two lemmas per term (unigrams or bigrams) produced richer topics. I also found CamemBERT produced much more coherent, robust and easy to interpret topics with French-language text.
The following figure, in the article, presents a compound of all 384 tables produced by my topic modeling runs containing more than 5,000 topics.
I will gladly answer any question researchers wanting to reproduce these findings or replicate them in another context would have : roy.jean-hugues@uqam.ca