Text_Mining_Hacker_News.Rmd

---
title: "4 years of The Hacker News, in 5 charts"
output: html_document
---

## Introduction 

[Hacker News](https://news.ycombinator.com/) is one of my favorite sites to catch up on technology and startup news, but navigating the minimalistic website can be sometimes tedious. Therefore, my plan in this post is to introduce you that how this social news site can be analyzed, in as non-technical a fashion as I can, as well as presenting some initial results, along with some ideas about where we will take it next. 

To avoid dealing with SQL, I downloaded Hacker News dataset from [David Robinson's website](http://varianceexplained.org/), it includes one million Hacker News article titles from September 2013 to June 2017. 

To begin, let's look at the visualization of the most common words in Hacker News titles. 

```{r global_options, include=FALSE}
knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)
```

```{r}
library(tidyverse)
library(lubridate)
library(tidytext)
library(stringr)
library(ggplot2)
library(ggthemes)
library(reshape2)
library(wordcloud)
library(tidyr)
library(igraph)
library(ggraph)
```

```{r}
hackernews <- read_csv("stories_1000000.csv.gz") %>%
  mutate(time = as.POSIXct(time, origin = "1970-01-01"),
         month = round_date(time, "month"))
```

## Some initial simple exploration

Before we get into the statistical analysis, the first step is to look at the most frequent words that appeared on Hacker News titles from September 2013 to June 2017. 

```{r}
hackernews %>% 
  unnest_tokens(word, title) %>% 
  anti_join(stop_words) %>% 
  count(word, sort=TRUE) %>% 
  filter(n > 8000) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() + 
  labs(title = "The Most Common Words in Hacker News Titles") +
  theme_fivethirtyeight()
```

For the most part, we would expect it is a fairly standard list of common words in Hacker News titles. The top word is "hn", because "ask hn", "show hn" are part of the social news site's structure. The second most frequent words such as "google", "data", "app", "web", "startup" and so on are all within our expectation for a social news site like Hacker News. 

```{r}
tidy_hacker <- hackernews %>%
  unnest_tokens(word, title) %>%
  anti_join(stop_words)
```

```{r}
tidy_hacker %>%
  count(word, sort = TRUE)
```

## Simple Sentiment Analysis

Let's address the topic of sentiment analysis. Sentiment analysis detects the sentiment of a body of text in terms of polarity (positive or negative). When used, particularly at scale, it can show you how people feel towards the topics that are important to you.

We can analyze word counts that contribute to each sentiment. From the Hacker news articles, We fount out how much each word contributed to each sentiment

```{r}
bing_word_counts <- tidy_hacker %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
bing_word_counts
```

```{r}
bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(title = "The Most Common Positive and Negative Words in Hacker News Titles", y = "Words Contribute to sentiment", x = NULL) +
  coord_flip() + theme_fivethirtyeight()
``` 

Word cloud is always a good idea to identify trends and patterns that would otherwise be unclear or difficult to see in a tabular format. We can also compare most frequent positive and negative words in word cloud.

```{r}
tidy_hacker %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("#F8766D", "#00BFC4"),
                   max.words = 100)
```

## Relationship between words

we often want to understand the relationship between words in a document. What sequences of words are common across text? Given a sequence of words, what word is most likely to follow? What words have the strongest relationship with each other? Therefore, many interesting text analysis are based on the relationships. When we exam pairs of two consecutive words, it is often called "bigrams"

```{r}
hacker_bigrams <- hackernews %>%
  unnest_tokens(bigram, title, token = "ngrams", n = 2)
hacker_bigrams
```

```{r}
bigrams_separated <- hacker_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)
bigram_counts
```

```{r}
bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")
bigrams_united
```

```{r}
bigram_tf_idf <- bigrams_united %>%
  count(bigram)
bigram_tf_idf <- bigram_tf_idf %>% filter(n>1000)
ggplot(aes(x = reorder(bigram, n), y=n), data=bigram_tf_idf) + geom_bar(stat = 'identity') + labs(title = "The Most Common Bigrams in Hacker News") + coord_flip() + theme_fivethirtyeight()
```

Winner of most common bigram in Hacker news data goes to "machine learning" and the second is "silicon valley".

The challenge in analyzing text data, as mentioned earlier, is in understanding what the words mean. The use of the word "deep" has different meaning if it is paired with the word "water" as opposed to the word "learning". As a result, a simple summary of word counts in text data will likely be confusing unless the analysis relate it to the other words that also appear without assuming an independent process of word choice. 

```{r}
bigram_graph <- bigram_counts %>%
  filter(n > 600) %>%
  graph_from_data_frame()
bigram_graph
```

## Networks of words

Words networks analysis is one method for encoding the relationships between words in a text and constructing a network of the linked words. This technique is based on the assumption that language and knowledge can be modeled as networks of words and the relations between them.

For Hacker news data, we can visualize some details of the text structure. For example, we can see pairs or triplets that form common short phrases ("Social media network" or "neural networks").

```{r}
set.seed(2017)
ggraph(bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point(color = "plum4", size = 5) +
  geom_node_text(aes(label = name), vjust = 1.8) + labs(title = "Word Network in Hacker News Dataset Titles") + 
  theme_fivethirtyeight()
```

This type of network analysis is mainly showing us the important nouns in a text, and how they are related. The resulting graph can be used to get a quick visual summary of the text, read the most relevant excerpts. 

Once we have the capability to automatically derive insights from text analytics, they then can translate the insights into actions. 

There is no structured survey data does a better job predicting customer behavior as well as actual voice of customer text comments and messages!