- Good Outcome: Create a website which can take in any text, find misinformation based on data from an authoritative source (we used “Politifact” for this) and highlight it.
- Better Outcome: Being a bit more ambitious, we also wanted to create a Chrome extension that parses the text on the web pages the user is browsing and does the same thing (flags for misinformation).
- Best Outcome: Further, if time permits we wanted to extend the matching server to look for “similar” statements along with exact matches.
We basically have 3 parts to our architecture: the client, the server, and the database. Before the server is booted up, the database would be created using a webscraper contacting Politifact.com (this is done periodically, when we feel the server's database should be updated). When the server boots up, it takes in all the information from the database and creates a search index from it. After this, the client will send the statements to search to the server (a search engine service), which will respond with the results of these searches.
When first creating the program, we had 3 different problems to deal with.- Obtaining the necessary data from Politifact. This was done using a Python webscraper, which would process each webpage in Politifact's most recent articles pages. This would be converted into a .csv format as our database. Since the webpage we were scraping from was somewhat complicated, we used BeautifulSoup4 to process the HTML, and make it easier to find the data we need. Fortunately, Politifact's consistent use of divs made things easy for us. The statements were also processed and adjusted in order to make it easier for the server to read.
- Creating the search server. The server was essentially a search engine that can take a string and search for it in our prepared database. In this version, the search is for an exact string match but it can be extended to similar matches. On booting up, the server would read the data from the CSV file, process it and store each misinformation statement in an easily searchable index in memory. To minimize the complexity of searching, we used a HashMap. In order to scale up the server, we used multithreading, and in order to deal with concurrency issues from that, we used a ConcurrentHashMap instead. Since we were using a HashMap, we had to create a class which would store both the URL and Misinformation Type (so we could have one object associated with each key). Finally, using the Sun HTTP package, the server could receive data from the client and send responses in a fast, multithreaded way.
- Creating the client. The client was a simple HTML website that allows end users to type in any content in a text box to test for misinformation. The text was split into sentences (on the client and server side) using RegEx. The text was then sent using synchronous XHTTP requests using the URL, which would be in this format: [server].com/query?s=[URI-encoded-text]. The server would respond with the relevant data (the url and type of misinformation) in JSON, which would be parsed and used. Splitting up the sentences was difficult as we had to ensure we split the same way on both the server and client side. The challenge was that the client's sentences may be surrounded by other, unknown text. In order to minimize this issue, we removed non-alphanumeric and non-whitespace characters when searching and creating the HashMap, such as '%' and '!'.
The Chrome extension was definitely quite challenging, and as a result, while the extension is functional, it is not ready to be used normally. To build a chrome extension, we had to learn how browser extensions work - how they need to be implemented to access the browser content and manipulate the HTML structure so the end user can see some additional information that we wanted to add to the page they were browsing. The chrome extension essentially acted as a different sort of client, so it had to send and receive the data to and from the server in the same way.
The first challenge of the Chrome extension was being able to actually read all the text from a page. While this might seem easy at first, the issue is that modern websites often use a technique called AJAX in order to send and receive data from servers which are not necessarily in the initial HTML. In fact, we initially tried to read the HTML DOM only using regular JavaScript, but in the end decided to re-do all the extension code using jQuery in order to deal with this issue. jQuery's DOM manipulation tools makes it easier to find things like DOM Subtree Modifications, allowing us to see new HTML sent from the server via AJAX request.
The second problem was ensuring the broswer wouldn't hang. While we initially used synchronous XHTTP requests to get data from the server, when done with a Chrome extension, this would cause the website to hang until all the requests are done. Since our server is running on just one dyno on Heroku, it can take a long time for each request to be handled. This would not work from a usability perspective. No user is going to wait 10-15 seconds just for one loaded page to be processed. Therefore, we decided to use AJAX asynchronously instead. This meant that the program would allow the website to be responsive while it was waiting for the server's response.
The third issue was dealing with CORS. Most websites have a CORS check, checking whether certain data can or cannot be sent between domains. This is done for security, but as a result, we had to ensure our server was adding the correct Access-Control-Allow-Origin header to it's responses. This took time since we had to learn what CORS was, identify the specific issue, and figure out how to solve it. At the end, we just added a CORS header to every message sent by our server saying it could be sent anywhere, meaning our client could be on any domain, and our server could send it's response to it.
This chrome extension is working on local, short HTML files. However, it is running into issues in external, large sites. These issues would likely require significant changes to the code (we believe). Unfortunately, we do not have enough experience or knowledge in web development to fully understand why we are running into these issues (It's likely related to CORS and other techniques real web developers use, or just performance issues due to memory problems, or both). However, in order to test the extension for now, we created a sample.html file, which is in the Chrome extension folder. Steps to run and test this are in the README.