The Problem

Can I predict/find influential users within a community?

What is the drive behind this?

Business promotion. By identifying influential users, or being able to predict who is influential, you can target advertisment around those users and promote a business or its product. Can incentivize users to go to some store and gather people with common interests.

Solution: Apache Spark's PySpark

What is PySpark?

PySpark is one of APIs for Apache Spark and supports the usage of R, Java, Scala, and Python. PySpark is the Python API for Apache Spark. Pyspark API runs on the Hadoop File System (HDFS) and allows developers to compute programs in clusters.

How does it work?

Spark programs consist of a driver node (a main function) that initiates the environment and worker nodes to conduct the computations.

A data structure known as Resilient Distributed Dataset (RDD) is used to distribute the data to those worker nodes. RDDs are generally in the form of dictionaries or arrays. A list of packages are included with PySpark but we will be using a 3rd-party package to solve our community detection problem called GraphFrames.

GraphFrames

GraphFrames is a 3rd-party package built for PySpark that specializes in graph computations. PySpark already has a built-in graph computation but GraphX is only for Scala. GraphFrames uses a data structure similar to RDDs known as DataFrames.

GraphFrame's Community Detection Algorithm

GraphFrames has a built-in community detection algorithm that we can use called Label Propagation Algorithm (LPA). The results should be similar to the following image.

How do you determine an inflential user?

There are many different factors to determine a user's influence so there is no clear cut answer to this. One idea is the PageRank algorithm where it determines a nodes "importance" based on the number of incoming edges. Depending on the social platform used, there would be different factors.

Note: Because this is an unpublished research project, there is no actual code to show regarding this community detection. There is only a broad overview of the plan.

If you have any questions regarding PySpark, I will try my best to answer.

my email: jtlien@usc.edu

Apache Spark Official Website

Apache Spark's website to learn more is located here and contains the overview of the API.

Quick PySpark Demo for Monte Carlo Simulation of Pi

Just for a quick demonstration of PySpark, I have included a python file that estimates Pi using Monte Carlo simulation.

Spark Setup

In order to run the code, some environment setup is required. A few things need to be installed:

Spark 3.0.1
Java JDK 1.8
Python 3
Scala 2.12

In addition to downloading these files, you need to setup your computer's Environment Variables. You can ask me for assistance for this if you want.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
PySpark_MC_Pi_Example		PySpark_MC_Pi_Example
README.md		README.md
lpa.png		lpa.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Problem

What is the drive behind this?

Solution: Apache Spark's PySpark

What is PySpark?

How does it work?

GraphFrames

GraphFrame's Community Detection Algorithm

How do you determine an inflential user?

Apache Spark Official Website

Quick PySpark Demo for Monte Carlo Simulation of Pi

Spark Setup

About

Releases

Packages

Languages

johnsonlien/Python_ApacheSpark

Folders and files

Latest commit

History

Repository files navigation

The Problem

What is the drive behind this?

Solution: Apache Spark's PySpark

What is PySpark?

How does it work?

GraphFrames

GraphFrame's Community Detection Algorithm

How do you determine an inflential user?

Apache Spark Official Website

Quick PySpark Demo for Monte Carlo Simulation of Pi

Spark Setup

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages