Job-Market-Explorer

SCRAPPING USING R programming

Proficiently utilizing web scraping techniques, I extract internship data to get interesting analysis. This project is self created from idea to implementing on R studio, This project showcases my expertise in mining valuable information, conducting thorough analysis, and statistical rigor to extract meaningful conclusions from intricate data.

Set Up Your Environment

install some packages that will be handy. Open your R console or script and run:

install.packages("rvest") install.packages("dplyr")

rvest is a great package for web scraping, and dplyr will help us manipulate and analyze the data.

Write Code to Scrape Job Listings

library(rvest) library(dplyr)

Link = "https://internshala.com/internships/analytics,data-analysis-internship/"

Create an html document from a url

page = read_html(Link)

#html_nodes : selects parts of a document using CSS selectors

html_text will extract text from selected nodes

name = page %>% html_nodes(".profile .view_detail_button") %>% html_text() View(name)

Company = page %>% html_nodes(".link_display_like_text") %>% html_text()

install.packages("stringr") library(stringr) Place = page %>% html_nodes("#location_names .view_detail_button") %>% html_text() %>% head(40) %>% str_sub(1, 40)

Stipend = page %>% html_nodes(".stipend") %>% html_text()

Duration = page %>% html_nodes(".other_detail_item+ .other_detail_item .item_body") %>% html_text()

Create a dataframe

Internship.data <- data.frame(name , Company , Stipend , Duration, Place)

#print the data frame print(Internship.data)

ANALYSIS OF STIPEND

Summary of stipend distribution

summary(Internship.data$Stipend)

#It seems like the Stipend column is being treated as a character instead of a numeric variable. To perform numerical analysis such as calculating mean, median, min, and max, we need to convert it to a numeric format.

Convert Stipend column to numeric

#Internship.data$Stipend <- as.numeric(Internship.data$Stipend)

Check the data types

str(Internship.data)

Replace commas in Stipend column with an empty string

Internship.data$Stipend <- gsub("/month", "", Internship.data$Stipend)

Internship.data$Stipend <- gsub("/week", "", Internship.data$Stipend)

Internship.data$Stipend <- gsub("+ Incentives", "", Internship.data$Stipend)

Internship.data$Stipend <- gsub("Unpaid", "0", Internship.data$Stipend)

library(ggplot2)

Create a bar chart for top companies

ggplot(Internship.data, aes(x = Company)) + geom_bar(stat = "count", fill = "skyblue") + theme_minimal() + labs(title = "Top Companies Offering Internships", x = "Company", y = "Count")

#error as x asis is not that visible

ggplot(Internship.data, aes(x = Company)) + geom_bar(stat = "count", fill = "skyblue") + theme_minimal() + labs(title = "Top Companies Offering Internships", x = "Company", y = "Count")+ theme(axis.text.x = element_text(angle = 90, hjust = 1))

Create a bar chart for popular locations

ggplot(Internship.data, aes(x = Place)) + geom_bar(stat = "count", fill = "lightgreen") + theme_minimal() + labs(title = "Popular Internship Locations", x = "Location", y = "Count")

#ERROR : Aesthetics must be either length 1 or the same as the data (40)

SO, manually creating table

Create a table of location counts

location_counts <- table(Place)

Convert the table to a data frame

location_data <- as.data.frame(location_counts)

Rename the columns for better clarity

colnames(location_data) <- c("Place", "Count")

Order the data frame by count

location_data <- location_data[order(location_data$Count, decreasing = TRUE), ]

Print the resulting data frame

print(location_data)

Internship.data$Duration <- gsub("Months", "", Internship.data$Duration)

Create a histogram for internship durations

ggplot(Internship.data, aes(x = Duration)) + geom_histogram(fill = "coral", binwidth = 1, boundary = 0) + theme_minimal() + labs(title = "Duration Trends", x = "Duration", y = "Count") Internship.data$Duration <- as.numeric(Internship.data$Duration)

Create a boxplot for stipend analysis

ggplot(Internship.data, aes(y = Stipend)) + geom_boxplot(fill = "lightblue") + theme_minimal() + labs(title = "Stipend Analysis", x = "", y = "Stipend")

(THIS CODE IS NOT SUCCESSFUL AS STIPEND IS IN RANGE ), BUT THE OUTPUT IS

Create a scatter plot for stipend vs. duration

ggplot(Internship.data, aes(x = Duration, y = Stipend)) + geom_point(color = "darkorange") + theme_minimal() + labs(title = "Stipend vs. Duration Analysis", x = "Duration", y = "Stipend")

ggplot(Internship.data, aes(x = Duration, y = Stipend)) + geom_col(color = "darkorange") + theme_minimal() + labs(title = "Stipend vs. Duration Analysis", x = "Duration", y = "Stipend")

install.packages("tm") installed.packages("wordcloud") library(tm)

Text analysis on internship titles involves extracting insights from the text data in the internship title column.

'name' is the column with internship titles

corpus <- Corpus(VectorSource(Internship.data$name)) dtm <- DocumentTermMatrix(corpus) word_freq <- colSums(as.matrix(dtm))

Create a bar chart for word frequency

barplot(sort(word_freq, decreasing = TRUE)[1:10], col = "purple", main = "Top 10 Words in Internship Titles")

STEP TO STEP GUIDE : FOR THIS CODE Preprocessing:

Tokenization: Break down the titles into individual words or tokens. Lowercasing: Convert all words to lowercase to ensure consistency. Removing Stop Words: Eliminate common words (e.g., "and," "the") that don't carry significant meaning. Stemming/Lemmatization: Reduce words to their root form for consistency.

Creating a Document-Term Matrix (DTM):

Convert the tokenized and preprocessed titles into a matrix format, where rows represent documents (internship titles) and columns represent unique words.

Word Frequency Analysis:

Calculate the frequency of each word in the document-term matrix. Identify the most common words.

Visualization:

Create visualizations such as bar charts or word clouds to represent the frequency of words. This provides a quick overview of the most prevalent terms.

hypothesis testing

(just my assumptions)

Hypothesis: Internships in Certain Locations Have Higher Stipends

Null Hypothesis (H0): The average stipend for internships is the same across all locations.

Alternative Hypothesis (H1): The average stipend for internships varies by location.

location_groups <- split(Internship.data$Stipend, Internship.data$Place) result <- aov(location_groups) summary(result)

my code has stipend in range so can't perform it

Hypothesis: Internships in Specific Categories Have Different Durations

Null Hypothesis (H0): The average duration for internships is the same across all place .

Alternative Hypothesis (H1): The average duration for internships varies by place .

Check the levels and counts of 'Place'

table(Internship.data$Place) ----> This will display the counts for each level. If any level has very few observations, it might be a good idea to either combine categories or ensure that you have a reasonable amount of data for each level before conducting the ANOVA test.

Convert 'Place' to a factor with specified contrasts

Internship.data$Place <- factor(Internship.data$Place, levels = unique_levels)

Specify contrasts directly

contrasts(Internship.data$Place) <- contr.treatment(length(unique_levels))

Perform ANOVA test

result <- aov(Duration ~ Place, data = Internship.data) summary(result)

Result:

Based on this output, it does not appear that the 'Place' variable has a significant effect on the 'Duration' variable. The high p-value (0.589) suggests that there is not enough evidence to reject the null hypothesis, which posits that there are no differences in mean 'Duration' across different levels of 'Place'.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Job-Market-Explorer

SCRAPPING USING R programming

Set Up Your Environment

Write Code to Scrape Job Listings

Create an html document from a url

html_text will extract text from selected nodes

Create a dataframe

ANALYSIS OF STIPEND

Summary of stipend distribution

Convert Stipend column to numeric

Check the data types

Replace commas in Stipend column with an empty string

Create a bar chart for top companies

Create a bar chart for popular locations

SO, manually creating table

Create a table of location counts

Convert the table to a data frame

Rename the columns for better clarity

Order the data frame by count

Print the resulting data frame

Create a histogram for internship durations

Create a boxplot for stipend analysis

Create a scatter plot for stipend vs. duration

Text analysis on internship titles involves extracting insights from the text data in the internship title column.

Create a bar chart for word frequency

hypothesis testing

About

Releases

Packages

ishanimahajan20/Job-Market-Explorer

Folders and files

Latest commit

History

Repository files navigation

Job-Market-Explorer

SCRAPPING USING R programming

Set Up Your Environment

Write Code to Scrape Job Listings

Create an html document from a url

html_text will extract text from selected nodes

Create a dataframe

ANALYSIS OF STIPEND

Summary of stipend distribution

Convert Stipend column to numeric

Check the data types

Replace commas in Stipend column with an empty string

Create a bar chart for top companies

Create a bar chart for popular locations

SO, manually creating table

Create a table of location counts

Convert the table to a data frame

Rename the columns for better clarity

Order the data frame by count

Print the resulting data frame

Create a histogram for internship durations

Create a boxplot for stipend analysis

Create a scatter plot for stipend vs. duration

Text analysis on internship titles involves extracting insights from the text data in the internship title column.

Create a bar chart for word frequency

hypothesis testing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages