Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard #1

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Procfile
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
web: streamlit run --server.enableCORS false --server.port $PORT web_app.py
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,10 @@ As additional stories are collected and labeled, their URL should also be collec

While the model developed here did a good job classifying news stories, with the continual advancements in NLP, such as the recent GPT-3, detection should become harder. For example, if the GPT-3 is told to generate a news story in the style of a writer at the New York Times for instance, it would be hard to detect since our model is not actually checking the accuracy of the news story, just its style.

# Deployed Model

View the deployed model built with Streamlit and hosted on Heroku [here](https://agile-tor-23064.herokuapp.com/). To view the code specific to the webapp go to the [dashboard](https://github.com/merb92/fake-news-classification/tree/dashboard) branch and refer to the [blog post](https://medium.com/analytics-vidhya/deploy-an-nlp-model-with-streamlit-and-heroku-5f0ae4b9048c).

# For Further information

Please review the narrative of the analysis in the [Jupyter notebooks](index.ipynb), review the [presentation](fake_news_classification.pdf), or read the related blog articles on the [project as a whole](https://merb92.medium.com/too-good-to-be-true-nlp-c97868c2db55) and [deploying it](https://medium.com/analytics-vidhya/deploy-an-nlp-model-with-streamlit-and-heroku-5f0ae4b9048c).
1 change: 1 addition & 0 deletions gist_stopwords.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
0o,0s,3a,3b,3d,6b,6o,a,a1,a2,a3,a4,ab,able,about,above,abst,ac,accordance,according,accordingly,across,act,actually,ad,added,adj,ae,af,affected,affecting,affects,after,afterwards,ag,again,against,ah,ain,ain't,aj,al,all,allow,allows,almost,alone,along,already,also,although,always,am,among,amongst,amoungst,amount,an,and,announce,another,any,anybody,anyhow,anymore,anyone,anything,anyway,anyways,anywhere,ao,ap,apart,apparently,appear,appreciate,appropriate,approximately,ar,are,aren,arent,aren't,arise,around,as,a's,aside,ask,asking,associated,at,au,auth,av,available,aw,away,awfully,ax,ay,az,b,b1,b2,b3,ba,back,bc,bd,be,became,because,become,becomes,becoming,been,before,beforehand,begin,beginning,beginnings,begins,behind,being,believe,below,beside,besides,best,better,between,beyond,bi,bill,biol,bj,bk,bl,bn,both,bottom,bp,br,brief,briefly,bs,bt,bu,but,bx,by,c,c1,c2,c3,ca,call,came,can,cannot,cant,can't,cause,causes,cc,cd,ce,certain,certainly,cf,cg,ch,changes,ci,cit,cj,cl,clearly,cm,c'mon,cn,co,com,come,comes,con,concerning,consequently,consider,considering,contain,containing,contains,corresponding,could,couldn,couldnt,couldn't,course,cp,cq,cr,cry,cs,c's,ct,cu,currently,cv,cx,cy,cz,d,d2,da,date,dc,dd,de,definitely,describe,described,despite,detail,df,di,did,didn,didn't,different,dj,dk,dl,do,does,doesn,doesn't,doing,don,done,don't,down,downwards,dp,dr,ds,dt,du,due,during,dx,dy,e,e2,e3,ea,each,ec,ed,edu,ee,ef,effect,eg,ei,eight,eighty,either,ej,el,eleven,else,elsewhere,em,empty,en,end,ending,enough,entirely,eo,ep,eq,er,es,especially,est,et,et-al,etc,eu,ev,even,ever,every,everybody,everyone,everything,everywhere,ex,exactly,example,except,ey,f,f2,fa,far,fc,few,ff,fi,fifteen,fifth,fify,fill,find,fire,first,five,fix,fj,fl,fn,fo,followed,following,follows,for,former,formerly,forth,forty,found,four,fr,from,front,fs,ft,fu,full,further,furthermore,fy,g,ga,gave,ge,get,gets,getting,gi,give,given,gives,giving,gj,gl,go,goes,going,gone,got,gotten,gr,greetings,gs,gy,h,h2,h3,had,hadn,hadn't,happens,hardly,has,hasn,hasnt,hasn't,have,haven,haven't,having,he,hed,he'd,he'll,hello,help,hence,her,here,hereafter,hereby,herein,heres,here's,hereupon,hers,herself,hes,he's,hh,hi,hid,him,himself,his,hither,hj,ho,home,hopefully,how,howbeit,however,how's,hr,hs,http,hu,hundred,hy,i,i2,i3,i4,i6,i7,i8,ia,ib,ibid,ic,id,i'd,ie,if,ig,ignored,ih,ii,ij,il,i'll,im,i'm,immediate,immediately,importance,important,in,inasmuch,inc,indeed,index,indicate,indicated,indicates,information,inner,insofar,instead,interest,into,invention,inward,io,ip,iq,ir,is,isn,isn't,it,itd,it'd,it'll,its,it's,itself,iv,i've,ix,iy,iz,j,jj,jr,js,jt,ju,just,k,ke,keep,keeps,kept,kg,kj,km,know,known,knows,ko,l,l2,la,largely,last,lately,later,latter,latterly,lb,lc,le,least,les,less,lest,let,lets,let's,lf,like,liked,likely,line,little,lj,ll,ll,ln,lo,look,looking,looks,los,lr,ls,lt,ltd,m,m2,ma,made,mainly,make,makes,many,may,maybe,me,mean,means,meantime,meanwhile,merely,mg,might,mightn,mightn't,mill,million,mine,miss,ml,mn,mo,more,moreover,most,mostly,move,mr,mrs,ms,mt,mu,much,mug,must,mustn,mustn't,my,myself,n,n2,na,name,namely,nay,nc,nd,ne,near,nearly,necessarily,necessary,need,needn,needn't,needs,neither,never,nevertheless,new,next,ng,ni,nine,ninety,nj,nl,nn,no,nobody,non,none,nonetheless,noone,nor,normally,nos,not,noted,nothing,novel,now,nowhere,nr,ns,nt,ny,o,oa,ob,obtain,obtained,obviously,oc,od,of,off,often,og,oh,oi,oj,ok,okay,ol,old,om,omitted,on,once,one,ones,only,onto,oo,op,oq,or,ord,os,ot,other,others,otherwise,ou,ought,our,ours,ourselves,out,outside,over,overall,ow,owing,own,ox,oz,p,p1,p2,p3,page,pagecount,pages,par,part,particular,particularly,pas,past,pc,pd,pe,per,perhaps,pf,ph,pi,pj,pk,pl,placed,please,plus,pm,pn,po,poorly,possible,possibly,potentially,pp,pq,pr,predominantly,present,presumably,previously,primarily,probably,promptly,proud,provides,ps,pt,pu,put,py,q,qj,qu,que,quickly,quite,qv,r,r2,ra,ran,rather,rc,rd,re,readily,really,reasonably,recent,recently,ref,refs,regarding,regardless,regards,related,relatively,research,research-articl,respectively,resulted,resulting,results,rf,rh,ri,right,rj,rl,rm,rn,ro,rq,rr,rs,rt,ru,run,rv,ry,s,s2,sa,said,same,saw,say,saying,says,sc,sd,se,sec,second,secondly,section,see,seeing,seem,seemed,seeming,seems,seen,self,selves,sensible,sent,serious,seriously,seven,several,sf,shall,shan,shan't,she,shed,she'd,she'll,shes,she's,should,shouldn,shouldn't,should've,show,showed,shown,showns,shows,si,side,significant,significantly,similar,similarly,since,sincere,six,sixty,sj,sl,slightly,sm,sn,so,some,somebody,somehow,someone,somethan,something,sometime,sometimes,somewhat,somewhere,soon,sorry,sp,specifically,specified,specify,specifying,sq,sr,ss,st,still,stop,strongly,sub,substantially,successfully,such,sufficiently,suggest,sup,sure,sy,system,sz,t,t1,t2,t3,take,taken,taking,tb,tc,td,te,tell,ten,tends,tf,th,than,thank,thanks,thanx,that,that'll,thats,that's,that've,the,their,theirs,them,themselves,then,thence,there,thereafter,thereby,thered,therefore,therein,there'll,thereof,therere,theres,there's,thereto,thereupon,there've,these,they,theyd,they'd,they'll,theyre,they're,they've,thickv,thin,think,third,this,thorough,thoroughly,those,thou,though,thoughh,thousand,three,throug,through,throughout,thru,thus,ti,til,tip,tj,tl,tm,tn,to,together,too,took,top,toward,towards,tp,tq,tr,tried,tries,truly,try,trying,ts,t's,tt,tv,twelve,twenty,twice,two,tx,u,u201d,ue,ui,uj,uk,um,un,under,unfortunately,unless,unlike,unlikely,until,unto,uo,up,upon,ups,ur,us,use,used,useful,usefully,usefulness,uses,using,usually,ut,v,va,value,various,vd,ve,ve,very,via,viz,vj,vo,vol,vols,volumtype,vq,vs,vt,vu,w,wa,want,wants,was,wasn,wasnt,wasn't,way,we,wed,we'd,welcome,well,we'll,well-b,went,were,we're,weren,werent,weren't,we've,what,whatever,what'll,whats,what's,when,whence,whenever,when's,where,whereafter,whereas,whereby,wherein,wheres,where's,whereupon,wherever,whether,which,while,whim,whither,who,whod,whoever,whole,who'll,whom,whomever,whos,who's,whose,why,why's,wi,widely,will,willing,wish,with,within,without,wo,won,wonder,wont,won't,words,world,would,wouldn,wouldnt,wouldn't,www,x,x1,x2,x3,xf,xi,xj,xk,xl,xn,xo,xs,xt,xv,xx,y,y2,yes,yet,yj,yl,you,youd,you'd,you'll,your,youre,you're,yours,yourself,yourselves,you've,yr,ys,yt,z,zero,zi,zz
103 changes: 103 additions & 0 deletions model_helper_functions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
from sklearn.metrics import classification_report, plot_confusion_matrix
import matplotlib.pyplot as plt
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
from nltk.tokenize import word_tokenize


def passthrough(doc):
"""passthrough function for use in the pipeline because the text is already tokenized"""
return doc

def confustion_matrix_and_classification_report(estimator, X, y, labels, set_name):
"""
Display a Classfication Report and Confusion Matrix for the given data.
"""

predictions = estimator.predict(X)

print(f'Classification Report for {set_name} Set')
print(classification_report(y, predictions, target_names=labels))

matrix = plot_confusion_matrix(estimator,
X,
y,
display_labels = labels,
cmap = plt.cm.Blues,
xticks_rotation = 70,
values_format = 'd')
matrix.ax_.set_title(f'{set_name} Set Confustion Matrix, without Normalization')

plt.show()

matrix = plot_confusion_matrix(estimator,
X,
y,
display_labels = labels,
cmap = plt.cm.Blues,
xticks_rotation = 70,
normalize = 'true')
matrix.ax_.set_title(f'{set_name} Set Confustion Matrix, with Normalization')

plt.show()

class LemmaTokenizer:
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, doc):
return [self.wnl.lemmatize(t) for t in doc]

def remove_stopwords(doc):
"""Remove the stopwords from the input document"""
stop_words = stopwords.words('english')
return [token for token in doc if ((token not in stop_words) and (token.lower() not in stop_words))]

def lowercase_tokens(doc):
"""lowercase all letters in doc"""
return [token.lower() for token in doc]

def lowercase_and_remove_stopwords(doc):
"""Remove stopwords and lowercase tokens"""
stop_words = stopwords.words('english')
return [token.lower() for token in doc if token.lower() not in stop_words]

def lower_unless_all_caps(string_):
"""
Make all words in the input string lowercase unless that
word is in all caps
"""
words = string_.split()
processed_words = [w.lower() if not (w.isupper() and len(w) > 1) else w for w in words]
return ' '.join(processed_words)

def remove_single_characters(word_list, exception_list):
"""Remove all the single characters, except those on the exception list"""
return [w for w in word_list if (len(w) > 1 or w in exception_list)]

def remove_words(word_list, words_to_remove):
"""Remove all the words in the words_to_remove list from the words_list"""
return [w for w in word_list if w not in words_to_remove]

def tokenize_and_normalize_title_and_text(title, text):
"""Combine, tokenize, and normalize the title and text of a news story"""

URL_REGEX = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
TWITTER_HANDLE_REGEX = r'(?<=^|(?<=[^\w]))(@\w{1,15})\b'
DATE_WORDS = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday',
'saturday', 'sunday', 'january', 'february', 'march', 'april',
'may', 'june', 'july', 'august', 'september', 'october',
'november', 'december']

title_text = ' '.join([title, text])
title_text = re.sub(URL_REGEX, '{link}', title_text)
title_text = re.sub(TWITTER_HANDLE_REGEX, '@twitter-handle', title_text)
title_text = lower_unless_all_caps(title_text)
title_text = re.sub(r'\d+', ' ', title_text)
title_text = re.sub(r'\(reuters\)', ' ', title_text)
tokens = word_tokenize(title_text)
tokens = remove_single_characters(tokens, ['i', '!'])
tokens = remove_words(tokens, ["'s"])
tokens = remove_words(tokens, DATE_WORDS)

return tokens
1 change: 1 addition & 0 deletions nltk.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
punkt
6 changes: 6 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
nltk==3.4.5
matplotlib==3.1.1
pandas==1.0.3
streamlit==0.69.2
numpy==1.16.5
scikit_learn==0.23.2
1 change: 1 addition & 0 deletions runtime.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
python-3.6.9
77 changes: 77 additions & 0 deletions web_app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Imports
import streamlit as st
import pickle
import pandas as pd
import numpy as np

import model_helper_functions

# Constants
MODEL_PATH = './'
MODEL_FILE_NAME = 'rf_tfidf_plus_guardian_model.sav'
RANDOM_STATE = 42
DATA_PATH = './'

# Local Model Helper Function and Stopwords list
gist_file = open(DATA_PATH + "gist_stopwords.txt", "r")
try:
content = gist_file.read()
expanded_stopwords = content.split(",")
finally:
gist_file.close()

expanded_stopwords.remove('via')
expanded_stopwords.remove('eu')
expanded_stopwords.remove('uk')

def lowercase_and_only_expanded_stopwords(doc):
"""Remove stopwords and lowercase tokens"""
stop_words = expanded_stopwords
return [token.lower() for token in doc if token.lower() in stop_words]

# Load pipeline
@st.cache(allow_output_mutation=True)
def load_pipeline(model_path=MODEL_PATH, model_file_name=MODEL_FILE_NAME):
"""
Load the Text Processing and Classifier Pipeline
"""
return pickle.load(open(model_path + model_file_name, 'rb'))

pipeline = load_pipeline()


st.title('News Classification')

st.write("""
Enter the title and text of a news story and a trained random forest
classifier will classify it as Truthful or Fake. Please note that the
algorithm is not checking the facts of the news story, it is basing
the classification on the style of the text of the story; specifically, it
is basing the classification only on the stop words (common words) in
the story and its title.
""")

news_title = st.text_input('Enter a News Title')

if news_title:
news_story = st.text_area('Enter a News Story', height=400)

if news_story and news_title:
tokens = model_helper_functions.tokenize_and_normalize_title_and_text(news_title, news_story)
stop_words_only = lowercase_and_only_expanded_stopwords(tokens)
if len(stop_words_only) == 0:
st.write('There were no stopwords in your news title and story.')
else:
class_ = pipeline.predict([tokens])
if class_ == 0:
class_text = 'Fake'
else:
class_text = 'Truthful'

probability = round(pipeline.predict_proba([tokens])[0][class_][0] * 100, 2)
st.subheader('Classification Results')
st.write('Your news story is classified as ', class_text, 'with a ',
probability, '% probability.')
st.write()
st.subheader('Your news story with only stop words:')
st.write(' '.join(stop_words_only))