Presidential Election 2017 in France: Twitter Analysis

In 2017, Emmanuel Macron and Marine Le Pen were the final two candidates in the French Presidential Election. The two candidates had drastically different approaches to governing, and as such, the election was a major topic of discussion on Twitter.

Objective

The objective of this case study was to analyzie twitter discussions happening in France in time before the election to compare popularity of both candidates, and to point media influencers.

Approach

Use data gathered from Twitter API
Process data in Spark with PySpark
Visualize insights with Python library Matplotlib

Challenges

Deciding between DataFrame and RDD
Deciphering cause of errors from PySpark
Retreiving data using .count() and .collect() methods on nested data
PySpark Docker crashing

Tweet Analysis

To load data, we feed .json file into Spark docker container:


spark_df = spark.read.json('./data/french_tweets.json')

Our DataFrame has quite comples structure. Below is an example of first 50 lines (out of 608):


root
 |-- contributors: string (nullable = true)
 |-- coordinates: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- display_text_range: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |-- media: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- display_url: string (nullable = true)
 |    |    |    |-- expanded_url: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- id_str: string (nullable = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |    |-- media_url: string (nullable = true)
 |    |    |    |-- media_url_https: string (nullable = true)
 |    |    |    |-- sizes: struct (nullable = true)
 |    |    |    |    |-- large: struct (nullable = true)
 |    |    |    |    |    |-- h: long (nullable = true)
 |    |    |    |    |    |-- resize: string (nullable = true)
 |    |    |    |    |    |-- w: long (nullable = true)
 |    |    |    |    |-- medium: struct (nullable = true)
 |    |    |    |    |    |-- h: long (nullable = true)
 |    |    |    |    |    |-- resize: string (nullable = true)
 |    |    |    |    |    |-- w: long (nullable = true)
 |    |    |    |    |-- small: struct (nullable = true)
 |    |    |    |    |    |-- h: long (nullable = true)
 |    |    |    |    |    |-- resize: string (nullable = true)
 |    |    |    |    |    |-- w: long (nullable = true)
 |    |    |    |    |-- thumb: struct (nullable = true)
 |    |    |    |    |    |-- h: long (nullable = true)
 |    |    |    |    |    |-- resize: string (nullable = true)
 |    |    |    |    |    |-- w: long (nullable = true)
 |    |    |    |-- source_status_id: long (nullable = true)
 |    |    |    |-- source_status_id_str: string (nullable = true)
 |    |    |    |-- source_user_id: long (nullable = true)
 |    |    |    |-- source_user_id_str: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- url: string (nullable = true)
 |    |-- symbols: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- indices: array (nullable = true)

Reading in JSON file to RDD for processing:

def failsafe_decode(row):
    try:
        return json.loads(row)
    except:
        return None
    
rdd = sc.textFile("./data/french_tweets.json")

tweets_rdd = rdd.map(lambda x: failsafe_decode(x)).filter(lambda x: x!=None).cache()

Which candidate is more popular?

For analysis, we first wanted to see who was mentioned more often - Emmanuel Macron or Marine Le Pen. We used the following code:

#counting the number of original tweets mentioning names of candidates:
origins = (spark_df.select('retweeted', 'text')
           .where((spark_df.retweeted == 'false')&(spark_df.text.contains('EmmanuelMacron'))))
origins.count()

We got this numbers:

Emmanuel Macron: 2,185 mentions (63%)
Marine Le Pen: 1,279 mentions (36%)

Since the elections are in the past, we can compare our numbers to the voting results:

Who else were mentioned

For the next step, we wanted to find out who else were mentioned often in tweets. For that, we used RDD file:

user_mentions = (tweets_rdd.flatMap(lambda row: entity_fix(row))
              .filter(lambda empty: (empty != []) and (empty != 'x'))
              .map(lambda name: name['screen_name']))

user_mentions_list = user_mentions.collect()
user_mentions_count = Counter(user_mentions_list)
users = []
counts = []

sort_users = sorted(user_mentions_count.items(), key=lambda x: x[1], reverse=True)

for i in sort_users:
    users.append(i[0])
    counts.append(i[1])

fig, ax = plt.subplots(figsize=(12,5))
ax.title.set_text('Top 10 Mentioned Users')
ax.bar(users[0:10], counts[0:10])
plt.xticks(rotation=45)
fig.tight_layout()

Besides Emmanual Macron and MLP, who are these other mentions?

TLMP: It's only TV, a French television show known as Touche pas à mon poste ! that was covering the election heavily.
BFMTV: A 24 hour news and weather channel.
cyrilhanouna: Cyril Hanouna is the host of TLMP.
JLMelenchon: Jean-Luc Melenchon was another candidate running for the French presidency in 2017.
dupontaignon: Nicolas Dupont-Aignan former wine maker and president of Debout la France.
narcissik13: BANNED FROM TWITTER.

What hashtags did people use most often

Using same approach, we found the most popular hashtags:

hashtags = (tweets_rdd.flatMap(lambda row: hashtag_fix(row))
          .filter(lambda empty: (empty != []) and (empty != 'x'))
          .map(lambda text: text['text']))

hashtags_list = hashtags.collect()
hashtags_count = Counter(hashtags_list)
hashtags_count
hashtags = []
use_counts = []

sort_hashtags = sorted(hashtags_count.items(), key=lambda xg: xg[1], reverse=True)

for j in sort_hashtags:
    hashtags.append(j[0])
    use_counts.append(j[1])
    
hashtags

fig1, ax1 = plt.subplots(figsize=(12,5))
ax1.title.set_text('Top 10 Hashtags')
ax1.bar(hashtags[0:10], use_counts[0:10], color='blue')
plt.xticks(rotation=45)
fig1.tight_layout()

Conclusion

Our exploratory data analysis shows that Twitter can be used as an instument to forecast president election results.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
Untitled.ipynb		Untitled.ipynb
helper.py		helper.py
helpers.ipynb		helpers.ipynb
reading_data.ipynb		reading_data.ipynb
top10hashtags.png		top10hashtags.png
top10users.png		top10users.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Presidential Election 2017 in France: Twitter Analysis

Table of Content

Objective

Approach

Challenges

Tweet Analysis

Reading in JSON file to RDD for processing:

Which candidate is more popular?

Who else were mentioned

What hashtags did people use most often

Conclusion

About

Releases

Packages

Contributors 2

Languages

JuliaSokolova/2017_French_Presidential_election_tweets_analysis-

Folders and files

Latest commit

History

Repository files navigation

Presidential Election 2017 in France: Twitter Analysis

Table of Content

Objective

Approach

Challenges

Tweet Analysis

Reading in JSON file to RDD for processing:

Which candidate is more popular?

Who else were mentioned

What hashtags did people use most often

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages