Skip to content

Latest commit

 

History

History
408 lines (325 loc) · 18.4 KB

README.md

File metadata and controls

408 lines (325 loc) · 18.4 KB

LOGM 655 - Text Mining

Course Syllabus - Fall 2018

MW 1200-1330 Bldg. 640 Rm. 326 (3.0 Credit Hours)

Welcome!

Text mining, or natural language processing (NLP) is the organization, classification, labeling and extraction of information from structured and unstructured text data. In these days of more information readily available through the internet, analysts and decision makers find themselves overloaded with data. Text mining is an application which can help analysts glean necessary information either for general understanding about a corpus of text documents, or for putting text into a form useful for the application of alternative analysis techniques.

This course will introduce students to this rapidly growing field and equip them with some of its basic principles and tools as well as its general mindset. Students will learn concepts, techniques and tools they need to deal with text mining practice in a Joint military context.

Course Learning Objectives

The primary objective of this course is to help students understand and implement the techniques/processes of text mining to make better decisions. More specifically, at the completion of the course, students should be able to:

  • Understanding when, where, and how to perform text mining to appropriate problems and data sets.
  • Code and understand code using relevant R packages in text mining.
  • Lead and work within a group of researchers analyzing a specific text mining problem.
  • Use software to analyze text data and communicate the results of their analyses.

Pre-requisites & Co-requisites

Unless otherwise waived by the course instructor, students are required to have completed the following courses prior to enrolling in LOGM 655 - Text Mining.

  • None

Requests to waive these required courses must be made in accordance with AFIT EN policy guidelines and must be approved by the course instructor. Waivers my be approved for students that have has successfully completed a similar courses or if special arrangements are made to meet AFIT requirements.

Instructor and Contact Information

Course Resources (Required & Optional)

There is no required text for this course however, there are a number of freely available resources that I will refer to throughout the quarter. These resources are listed below (with links provided). Other references will be provided using the course Github Page

Grading Policy & Course Deliverables

Your final course grade will be determined according to the following requirements and their respective weights. These deliverables are described at the bottom of this syllabus.

In this course, grades will be assigned according to the values shown in the table below. The following sections detail the requirements of each graded deliverable.

Numerical grade range Corresponding letter grade
(1.00 - 0.93] A
(0.93 - 0.90] A-
(0.90 - 0.87] B+
(0.87 - 0.83] B
(0.83 - 0.80] B-
(0.80 - 0.77] C

Github

Information about course requirements distributed and collected using GitHub. Thus, each student will first need to create an account on GitHub, if they don’t already have one.

Tip: Don’t use your awesome.student@afit.edu email address to create your account. Use an email address that you’ll have access to after you graduate. If desired, you can add your AFIT address as a secondary email.

After you’ve created your account on GitHub, click HERE to access the course “Roll Call”. Read and follow the instructions to post your comment so I know that you’ve successfully created an account.

Software/Computer Programming

A key component of this course involves developing the skills and knowledge to create reproducible & dynamic data products to present results from your research. In previous offerings of this course, I allowed students to use any software package to complete their assignments. This became difficult, for the students to complete their work and for me to grade them. So, I’ve decided to require you to use the R programming language to complete and submit your assignments.

I realize that some of you may be new to coding or may have never coded before. Don’t worry, you don’t need an extensive background in R, , or HTML to be successful in this course. Each week I’ll provide you with LOTS of code examples that you can copy/paste to help you get more familiar with the software. Also, many created several demo presentations to get you up to speed and I’m always willing to help out when needed. The first link provided for Week 1 in the schedule below walks you through the process of getting the R/RStudio tool-chain installed and ready for the course.

Each of these tools listed below will be used this quarter:

  • R Project for Statistical Computing
  • RStudio IDE
  • Git/Github
  • LaTeX/Mathjax
  • Pandoc Markdown
  • HTML5, CSS3, and JavaScript (don’t need to know these - already built in!)

Course Schedule

The course schedule below is TENTATIVE and is therefore subject to change. Any changes made to this schedule will be clearly communicated by the instructor.

Week Dates Lesson Description Learning Material Presenter
1 10/01-10/05 Course Intro/Software Intro DataCamp & Github Maj Freels
2 10/08-10/12 Extracting Data from Documents Maj Freels
3 10/15-10/19 Regular expressions & text parsing Maj Freels
4 10/22-10/26 Methods for word relationship analysis Maj Freels
5 10/28-11/02 Sentiment analysis Maj Freels
6 11/05-11/09 Topic modeling Student Led Student Group
7 11/12-11/16 Parts of speech tagging Student Led Student Group
8 11/19-11/23 Document summarization Student Led Student Group
9 11/26-11/30 Named entity recognition Student Led Student Group
10 12/03-12/07 In class project work None Maj. Freels
11 12/10-12/13 Student project reports None Student Group

Other Noteworthy Dates

In addition to the course schedule, you should take note of the following dates.

  • 08 Oct - Columbus Day (No Classes)
  • 12 Nov - Veteran’s Day (No Classes)
  • 22 Nov - Thanksgiving Day (No Classes)
  • 23 Nov - AETC Family Day (Classes will meet - TBD)
  • 07 Dec - Fall Quarter Classes End
  • 10 Dec - Final Exam Week Begins
  • 13 Sep - Final Exam Week Ends

Important Policy Statements

Academic Integrity Policy Statement

All students must adhere to the highest standards of academic integrity. Students are prohibited from engaging in plagiarism, cheating, misrepresentation, or any other act constituting a lack of academic integrity. Failure on the part of any individual to practice academic integrity is not condoned and will not be tolerated. Individuals who violate this policy are subject to adverse administrative action including disenrollment from school and disciplinary action. Individuals subject to the Uniform Code of Military Justice may be prosecuted under the UCMJ. Violations by government civilian employees may result in administrative disciplinary action without regard to otherwise applicable criminal or civil sanctions for violations of related laws. (References: Student Handbook, ENOI 36-107, Academic Integrity)

Attendance Policy Statement

Attendance at all class sessions and exams is mandatory for military and civilians assigned to AFIT as full-time students except for extenuating circumstances. Part-time students are expected to attend scheduled classes, and absences should be explained to the instructor. The student should provide advance notice, if possible. Scheduled classes and exams are defined by the instructor and they are documented in the course schedule. (References: Student Handbook, Graduate School Catalog)

Academic Grievance Policy Statement

AFIT and the Graduate School of Engineering and Management affirm the right of each student to resolve grievances with the Institution. Students are guaranteed the right of fair hearing and appeal in all matters of judgment of academic performance. Procedures are detailed in ENOI 36-138, Student Academic Performance Appeals.

Additional Notes

My Teaching Philosophy

As AFIT graduates, you should be expected to know how to approach and solve real-world problems AND present your results in a meaningful way so that decision makers can make defensible decisions.

As AFIT instructors, we do a disservice to our students by not teaching new and improved ways to produce and share your results. Further, we do a disservice by teaching you to solve problems using tools that you won’t have access to after leaving AFIT. Therefore, I re-built this course using the R/RStudio tool-chain to help you produce better results…faster.

Challenge your instructor

Challenging your instructor can often be a good thing. If you can’t trip me up in this class from time to time, you’re not trying. Discussion leads to a more interesting class, so questions are always good.

Course deliverables

In-class group activities

Many data science projects cannot be completed by one person - especially if the project has a deadline. Thus, it’s important that every analyst be able to work with other analysts in a team - and even with other teams of analysts. In this course students will exercise their teamwork skills by completing two in-class activities as described below. Teams must work with each other and must develop their method for working successfully within a group.

For both of these activities students will be separated into four teams

  • the number of students in each team will depend on the number of students registered for the course. Each team will address one of the following text mining goals:

    • Topic modeling
    • Parts of speech tagging
    • Document summarization
    • Named entity recognition

Student teams will be assigned to lead class discussions for both class meetings during one week of the quarter - the specific weeks are specified in the course syllabus. What the teams should during these class meetings is described below. My goal is to balance the number of students in each team - which means that some students may not be assigned to their prefereed topic. However, preference will be given to students who can show that they are working on thesis projects related to one of these topic areas.

Activity #1 Overview of Current Methods (20 pts):

During the first class meeting of their assigned week, teams will present a detailed overview of the current methods used to accomplish their chosen goal. Teams should present the concept and mathematical theory behind several methods used to accomplish their chosen text mining goal. Teams can use any publicly available resource, so long as they cite them in their presentation.

This activity is worth 20% of your overall course grade. Student grades for this activity will be graded based on (1) completion of the assignment, and (2) the degree to which each student contributed to the team in completing this activity. I will assign this value based a survey each student will fill out to describe the level of participation for each member of the team. The survey will be used to distribute the grade appropriately as demonstrated by your effort towards assisting the group accomplish the assignment. Survey scores will be based on a scale from 1 to 5, where a ‘5’ means that the student was very involved in the effort and made important contributions a ‘1’ means that a student was not involved and made few (if any) contributions.

Activity #2 R Programming Tutorial (20 pts):

During the second class meeting of their assigned week, teams will give a tutorial on how to use the R programming language to implement the text mining methods and accomplish the goal. There are a number of excellent resources to help with this project.

Students will cite all sources from which they obtained any examples or data used in their tutorial and will clearly indicate which packages they used obtain their results.

This activity is worth 20% of your overall course grade. Student grades for this activity will be graded based on (1) completion of the assignment, and (2) the degree to which each student contributed to the team in completing this activity. I will assign this value based a survey each student will fill out to describe the level of participation for each member of the team. The survey will be used to distribute the grade appropriately as demonstrated by your effort towards assisting the group accomplish the assignment. Survey scores will be based on a scale from 1 to 5, where a ‘5’ means that the student was very involved in the effort and made important contributions a ‘1’ means that a student was not involved and made few (if any) contributions.

The group assignments are as shown in the table below

Parts of Speech Tagging Named Entity Recognition Topic Modeling Document Summarization
Walton Beveridge Kline Speranza
Hyder Kane Westerman Kilgore
NA Kolano St. Peter Buck
NA NA NA Gold

Final project

This course requires completion of an individual project that implements one or more of the the methods presented in the class. The intent of the project is to apply one or multiple concepts learned within the class to an actual data set in order to perform knowledge discovery. The project must be approved by Maj Freels, NLT 19 October 2018. Students may use an application within their thesis project for work within this class. The data set used for this project can be either pre-formatted text or something in textual form. At a bare minimum, the text should include more than 5,000 words and should be in English. While these requirements are minimums, the intent is that the problem is analyzed according to the concepts presented in this course and that the results are articulated in a clear manner.

Deliverables

This project is worth 60% of your total grade for this class. You will provide two deliverables.

  1. Project IPR - Due Friday, 3 November (10 pts) - This should be a focused 4-5 slide presentation (no more than 5 slides). The goal of these slides is to describe what you are doing, why and where you are in your analysis. You will not present this presentation; make sure the slides can speak for themselves (no use of notes in powerpoint – just the slides). At a minimum, the slides should include the following:
  • Title Slide
  • Purpose
    • What are you doing?
    • Why?
  • Background
    • Context into your analysis
    • Insight into your topic with rudimentary analysis
  • Interim Analysis
  • Work not used in any other class previously
  • Discuss Next Steps
  1. Project Report - Due Friday, 8 December (50 pts) - For this part of the project assignment, provide a written report describing your analysis of your text and how it informs on the problem laid out in your project proposal. Focus on insights, trends and analysis. This is where you are allowed to put the work you’ve done, but remember this document should flow. This document (not including appendices) should be no longer than 20 pages (single spaced), with any charts/graphs relevant to the discussion imbedded and referenced in the body of the document. Bulky or large figures or charts and examples of your code should be put into the appendix(cies). The format of the report should have at least the following headings:
  • Abstract
  • Problem Background (ending with a clear statement of the problem)
  • Methodology (describe the data you used and your analysis)
  • Findings and Conclusions (to include any future work)

Think of this report as something that could be read by a research sponsor. All reports will be made available to the rest of the class (unless there are circumstances that might prevent this).