Skip to content

Latest commit

 

History

History
60 lines (43 loc) · 8.52 KB

README.md

File metadata and controls

60 lines (43 loc) · 8.52 KB

A multithreaded Tomcat web-app multiple heuristically informed searches that generates a word cloud from the top (n) words associated with a chosen search term and categorizes the search term using a Neural Network

Contents

Running the Program

Requirements for Running Locally

  • Java 8 (Untested on alternative versions of Java)

Libraries and Development Tools

How to Run

  • Project .jar files were not generated considering the project .war file is compiled and built via Tomcat. For information on how to compile and deploy the .war archive see How to Deploy a WAR file to Tomcat

Alternatively

  • The Program may be accessed via Heroku. (May take a couple of seconds to load initially, application is set to sleep)

Overview

The goal of the project is to develop a multithreaded AI search application that can generate a word cloud from the top (n) words associated with an internet search term.

The program takes a Query Word from the user and using Jsoup, returns the top resulting web-pages associated with that word. Jsoup then scrapes the content of those pages and heuristically scores the page content using multiple scoring functions and JFuzzyLogic allowing for smart scoring and word-frequency matching. The content contained within the top heuristically scoring pages is then processed and mapped to it's frequency of occurrence.

The result of the frequency matching consists of the top (n) words associated with the user's query. The resulting top words then get returned to the user in image form, with the most frequent being the largest and the lesser words scaled representing their frequency in a spiral formation.

Implementation

The project is driven using both a Recursive Beam Search & Recursive Best First Search, both searches being implemented as greedy searches meaning they are driven solely on nodal heuristic value.

  • Best First Search

    • Generates and scores children nodes on the fly and places those nodes into a priority queue ordered by the score of the child nodes, meaning the best child will always be polled first. The process of scoring a node involves checking the URL, title, heading and paragraph tags for occurrences of the query word and administering appropriate scores respective of the location of the query word location within that webpage. Once the score is tallied, the logarithmic value is obtained and that gets passed into relevant Fuzzy Inference System along with the depth level of the child where the relevance of the child is evaluated by the ruleset and defuzzifier.

      Once all children of the initial node have been scored and placed into the queue, the queue is polled and the highest scoring child removed, but not before the contents of the page are mapped and more children are generated from this node. This process repeats and the node generation function is called recursively on the new best scoring node in the queue until a stop condition is reached. A closed list is used to make sure pages aren't revisited.

      Best First Search turned out to be an ideal algorithm considering the given search space, especially since more nodes than the Beam Search will be polled and mapped, allowing for a wider range of results rather than a small sample size.

      Executed in linear space with complexity of O(bd) and time complexity of O(bd).
  • Beam Search

    • This Beam Search implementation works by generating and scoring (n) nodes initially and keeping the best two nodes in a reversed priority queue, meaning unlike R-BFS, the lowest node will always be polled first. If a queue with two nodes with the values [95, 50] receives a new node with the value of [100], the smallest scoring node which will always be the last node placed in the queue is removed and discarded, the new queue is now [100, 95] giving it a LIFO structure.

      Only the nodes on the queue used to generate children, if one of these nodes has a child with a score greater than that of the last node into the queue, offer it to the queue otherwise and check the new node for any children better. Like R-BFS, this search has its own unique Fuzzy Inference System. The recursive process is repeated until either a stop condition is encountered or both nodes in the queue have been polled and neither of their children have scored high enough to be placed into the queue. Like the Best First Search, a closed list is used to keep track of visited pages.

      Beam search is a very strong search, but while it can be good it does have its drawbacks. Beam can potentially chuck away a very relevant node and return sub-optimal results, but could also be fortunate enough to identify and traverse a very promising path faster than Recursive Best First Search, even if it's not the absolute optimal path. When testing Beam and Best First Search with a timer the bigger the search space the bigger the time gap between the two searches. Beam works best with an unambiguous space.

      If no stop conditions were implemented, the time for Beam to complete would completely depend on the accuracy of the heuristic, giving it a worst case time complexity of going straight to the bottom of the tree O(bd). Since beam only stores n nodes at each level, the worst case space complexity is O(bd). The linear memory consumption means beam can probe fairly deep into larger spaces and find solutions other searches may not reach.
  • Fuzzy Implementation

Recursive Best First Search For a detailed description of the Fuzzy Implementation please see the fuzzy file

Beam Search For a detailed description of the Fuzzy Implementation please see the fuzzy file

  • Word Classification using Encog

    • Using Encog Core 3.4, a Neural Network was created and saved locally. The Neural Network was trained and tested on locally fed data that was written to ensure it yielded reasonable classification output. The Neural Network has one hidden layer and uses the resilient propagation optimization algorithm, which uses a specific update value for every neuron connection, the values are automatically determined unlike the learning rate of back-propagation algorithms.

      The Neural Network allows for the categorization of the query word, the 5 most associated words with each of the 8 possible categories of classification are compared against the final word frequency map and scored depending on the occurrence of that word anywhere in the node and amplified depending on the frequency. A normalized result is generated, scaled between a value of 0 and 3, and then fed into the neural network via an 8 indexed array. Based on the data it was trained with, the Neural Network looks at the input and attempts to provide a result based on the input data and the data it was previously trained on.
  • Executing the Program

    • The application allows for the user to select the Branching Factor, Max Depth, Search Algorithm and the amount of words to return via the word cloud. Once a search is executed, the search for the most relevant nodes begin, once the conditions have been met and the search runs it course, the nodal content is scored, mapped and returned to the Service Handler where, using Base64 and a Logarithmic Spiral Placer, the words are placed on a canvas with sizes representing their frequencies. The image is then displayed with details about the specific search presented to the user.