Spark

Goals

Provide a Spark application that can parse an input file of associates, each with an arbitrary number of tests, and predict the chance that each is dropped.
Output concise equations that relate each exam type with the predicted drop chance.
Determine accuracy metrics by week, to provide a complete picture so that, with supplementary data, the optimal time to make a decision to drop an associate can be determined.

Implementation

What is Spark?

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. Reference: Spark documentation

Why Spark?

Speed
- In-memory processing
Real-time processing
Machine Learning Capabilities
Rich Libraries
- SparkSQL: Help porting existing data loads into a Spark backend with minimal rewriting of the application stack. Only SQL syntax needed.

Running the Jar

The current code in SparkAnalysis takes 4 arguments: the spark master, the input file, the output file, and a control data output file.

On an EMR, in the same directory as the jar, run the code with spark-submit --master yarn --class com.revature.Driver Biforce_Analysis.jar p3in/spark_input.csv sparkOutput controlOutput where spark_input.csv is in the HDFS under ~/p3in. Output is written to an s3 bucket automatically (consider changing this and just specifying the s3 in the input)

Locally on Cloudera or HortonWorks, create a directory p3out, and touch files sparkOutput.txt and controlOutput.txt Set s3Location to an empty string & recompile the Jar. Run spark-submit --class com.revature.Driver Biforce_Analysis.jar p3in/spark_input.csv p3out/sparkOutput.txt p3out/controlOutput.txt

Outputs

Note that there are TWO file written to the output. The first contains the predicted release chances of every associate and will be located under the 'sparkOutput' folder on 's3://revature-analytics-dev'. The second contains the output equations as well as accuracy statistics when tests are limited to a certain week, and is located under the 'controlOutput' folder in the s3.

Overview

The only useful data for model building is test data from associates that are employed or dropped. Test data from those in training are filtered out.
This data is then split by associate ids (70/30) into model and control data, for building the model and testing its accuracy.
The input file contains rows of test scores. It contains 11 columns, of which 5 are used. _c1:test_type, _c3:score, _c4:test period, _c9:associate id, _c10:associate status.

If the number of columns is changed in the future, column names as well as RDD row indices will have to be changed. I do not believe it has a significant effect on performance.

The output file (sparkOutput) contains rows of associate predictions. It contains 4 columns with no header: associate_id,% Chance to Fail,Most Recent Week,Prediction.

Logistic regression model

Logistic regression The following steps are done for each test type (Exam, Verbal, Project)

Available data (With a status of 0 for 'Dropped' or 1 for 'Employed') is first split into 10 buckets.
A percentage chance of being dropped is found for each bucket.
The natural log of the ratio of dropped to passed is found for each.
This linearizes the model so that simple linear regression can be applied.
The final equation is output as a 2d array of doubles.

Applying the model

In order to make use of RDD methods, the following approach was used. Given:

Three independent drop chances, T1, T2, T3 and three coefficients of determination (r^2), ie r1, r2, r3.
A battery can have multiple tests of any given type, for example: T1_a, T1_b, T1_c, T2_a, T2_b, T3_a ...
We want to weight each test by its r^2 value in the following form, where, for the above example, rs = r1×3+r2×2+r3×1 = r1×(#of test type 1 available) + r2×(# of type 2 available)...
Resultant_drop_chance = T1_a×r1/rs + T1_b×r1/rs + T1_c×r1/rs + T2_a×r2/rs + T2_b×r2/rs + T3_a×r3/rs...
Solution:
Multiply each test by its associated r^2
Sum up these products with reduceBy
Sum up the individual r^2's
Divide the total by rs to get something like (T1_a×r1 + T1_a×r1 + T2×r2 + T3×r3)/rs for each battery.

Control

The control output file is used for determining week-by-week accuracy, and logging of the equations.

These equations are used to find partial chances of being dropped.
These partial chances are weighted according to the strength of their test's correlation to the drop chance.
They are finally combined into a single drop chance.
The prediction is made by splitting the drop chances control data at the point with the least number of incorrect predictions.

Future iteration considerations

Consider using Parquet files rather than text.
Further optimize ModelFunction for accuracy and (potentially) speed.
Implement Maximum Likelihood Estimator (MLE) method to find the logistic regression parameters b_0 and b_1.
Currently 90% accurate by week 4 with 850 associates, 170 of which are confirmed or dropped.
Currently takes about 2.5 min to run.
Allow for appending of output files, rather than overriding.
Name columns in the datasets so they don't have the confusing default names of _c0, _c1, _c2...
Drop unnecessary columns. Coordinate with the ETL/Oozie team so they aren't included in the input at all.
Consider including assessment category (_c5) if more data is available.
Sentiment analysis of comments.
An alternative model and some general advice from dec-17-big-data batch member, Timothy Law Reference: Google Doc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly