Chapter 1

Introduction

Value of data:

Fundamental unit of any organization is data. Data derives decision-making in most organizations. e.g.

Where to locate a new franchise
What customers to target in marketing
Where bottlenecks exist in a process
How customers feel about a product Fig. 1a Graph of Data value vs Data Age, depicting the importance of individual data and aggregate data.

Data needs to be in a format that allows for qualitative, quantitative and statistical analysis. In an ideal world data is well organized with no element missing and properly formatted. Although real world data is often unformatted or formatted in a way that is not conducive to analysis or is missing critical pieces.

Fig. 1b Well formatted data

Fig. 1c Poorly formatted data with missing values

Role of Data scientist

Data scientists turn data from various sources into actionable information.

Collecting the data in the raw form
Data munging and data wrangling to make it useful for analysis and visualization.
Cleaning of data to deal with missing values
Curation of data to make it available for reuse and preservation

Big Data

Big data is a relatively new term that describes data set that are so large and complex that traditional methods of storing and processing them are not sufficient. The need and importance of big data can be depicted easily with the following statistics:

Every 60 seconds there are

Over 100,000 tweets.
695,000 Facebook status updates.
11 million instant messages.
700,000+ Google searches.
168 million+ emails sent.
1,820 TB of data created.
217 new mobile web users.

Above statistics is just a glimpse of volume of data, which is constantly growing:

Every day over 2.5 quintillion bytes of data is being generated
90% of the world’s data has been generated over the past two years
Data from multiple sources is being integrated into single massive data sets.

Due to the complexity involved with the term itself there is no single agreed upon definition of “Big Data”, One possible definition:

Big data is the integration of large amounts of multiple types of structured and unstructured data into a single data set that can be analyzed to gain insight and new understanding of an industry, business, the environment, medicine, disease control, science, and the human interactions and expectations.

Examples of Big data:

The Large Hadron Collider would generate 5 ×1020bytes per day if all of its sensors were turned on, almost 200 times more than all other data sources in the world combined.
The Square Kilometer Array radio telescope is expected to collect 14 exabytesof data per day for analysis
Walmart generates over 1 million customer transactions per hour that are curated in a multi-petabyte database for trend analysis

Characteristics of Big Data:

Very large, distributed aggregations of loosely structured data –often incomplete
In excess of multiple petabytes or exabytesof data
Billions of records about people or transactions
Loosely-structured and often distributed data
Flat schemas with few complex interrelationships
Time series data containing time-stamped events
Connections between data elements that must be probabilistically inferred through machine learning

Larger data sets allow for more detailed analysis and application to social sciences, biology, pharmacology, business, marketing and more. Data is everywhere and a lot of it is free. Organizations don't necessarily have to build their own massive data repositories before starting with big data analytics. Steps taken by many companies and government agencies to put large amounts of information into the public domain have made large volumes of data accessible to everyone.

Some of the Important sources of data are:

Web Behavior and content:

There are nearly five billion web pages
Collected data includes network traffic, site and page visits, page navigation, page searches

User Generated Content:

Also known as "Internet trail" or "Net trail"
Content generated by millions of users on social media, including Facebook, Twitter, Instagram, blogs, YouTube, forums, wikis, and so forth

Activity Generated data:

Computer and mobile device log files
Includes web site tracking information, application logs, sensor data such as check-ins and other location tracking

RFID Data:

Radio Frequency Identifiers
Tags for tracking merchandise and shipments, mobile payments, sports performance measurement, and automated toll collection

Geo Data:

GPS tracking data generated by mobile devices
Tracking of movement of equipment, vehicles, and people

Environmental Data:

Weather conditions
Tidal movements
Seismic activity

Organizational Transactional Data:

Transactional activities such as purchases, registration, manufacturing

Research Data:

Social science data, e.g., census, polls
Health care data
Education, law and order, economic activity, agriculture, food production
“Big Data” such as radio telescopes, particle physics

Big data is set to offer tremendous insights, but with the terabytes and petabytes of data pouring in to organizations today, traditional architectures are not up to the challenge. There are many challenges which comes along with big data:

Analysis -

With the enormous amount of data available the major challenge is to leverage the value that data have to offer. Big data requires complex analysis within relatively short time spans in order to detect trends and make decisions. Analysis techniques include, among many others:

A/B Testing
Visualization
Machine Learning
Time Series Analysis

Collection -

Data is not free.
Data is in format not conducive to analysis.
Data contains missing values or bad entries.
Data is not downloadable.

Storage -

Storage of such enormous data is a challenge in itself. There is a need for the system to be able to deal with terabytes/petabytes of data on a daily basis.

Curation -

Curation of data deals with addressing the quality of data. Data has a real value only if it is accurate and timely and thus can help in the decision making process.Poor information quality can be costly:

One study estimates that on average bad information costs businesses up to 10% of revenue
Another study pegs the loss at over $600 billion annually in the U.S. alone

Search and retrieval -

Timely retrival of meaningful data from the entire data set is one of the most important challenge.

Sharing / Transfer -

Sharing/Transferring data is another concern as there is no platform easily available which allows transfer of such huge data, Organizations tend to invest a lot of money to design special architectures and infrastructures to facilitate data sharing/transfer.

Visualization -

Visualization helps in extracting the meaningful information by processing the data and representing it in a way which can be easily deduced.

Privacy -

Data security becomes the major concern especially when it comes to credit card data, personal ID information or other sensitive assets.

Storing Big Data

Traditional data storage technologies including text files, XML, and relational databases reach their limits when used to store very large amounts of data. Furthermore, the data that is needed for analysis includes not only text and numeric data, but unstructured data, such as text files, video, audio, blogs, sensor data, geospatial data, among others. Due to these hurdles storing big data becomes challenging, non relational databases provides a good alternative. Non relational database is a database that does not incorporate the table/key model that relational database management systems (RDBMS) promote. It has the ability to deal with large amount of data and can accomodate unstructured data easily. Fetching data from non relational database provides remarkable speed over relational database as the search query doesn't have to go through each table and key combination in this case.

6 V's of Big Data

Volume

Volume is one of the core defining attributes of “big data”.Big Data implies enormous amounts of structured and unstructured data that is generated by social and sensor networks, transaction and search history, and manual data collection. Eg- 100 terabytes of data are uploaded daily to Facebook; Akamai analyses 75 million events a day to target online ads; Walmart handles 1 million customer transactions every single hour.

Variety

Data comes from a variety of sources and contains both structured and unstructured data. Data types are not restricted to simply numbers and short text fields, but also include images, emails, text messages, web pages, blog entries, documents, audio, video, and time series.

Velocity

The flow of data that needs to be stored and analyzed is continuous. Human interactions, business processes, machines, and networks generate data continuously and in enormous quantity. The data is generally analyzed in real-time to gain a strategic advantage, which allows companies to do things like display personalised ads on the web pages you visit, based on your recent search, viewing and purchase history. Sampling can help mitigate some of the problems with large data volume and velocity. Eg- Every minute of every day, we upload 100 hours of video on Youtube, send over 200 million emails and send 300,000 tweets.

Veracity

Data veracity characterizes the inherent noise, biases, abnormalities, and mistakes present in virtually all data streams. “Dirty” data presents a significant risk as analyzes are incorrect when based on “bad” data. Data must be cleaned in real-time and processes must be established to keep “dirty data” from accumulating. Data scientist needs to work as a "data janitor" before analysing the data.

Validity

While the data may not be “dirty”, biased, or abnormal and it may not be valid for the intended use. Valid data for the intended use is essential to making decisions based on the data.

Volatility

Data changes over time and volatility characterizes the degree to which the data changes over time. Decisions and analyses are based on data that has an “expiration date”. Data scientists must define at what point in time a data stream is no longer relevant and cannot be used to make a decision.

Additional V's

Viscosity

Viscosity measures the resistance to flow in the volume of data, resistance to navigate in the dataset, data flow rates or complexity of data processing required. Technologies to deal with viscosity include improved streaming, agile integration bus, and complex event processing.

Virality

Virality measures how quickly data is spread and shared to each unique node. Time is an important characteristic along with rate of proliferation. Virality of data can provide companies with instant insights into the target areas to launch marketing campaigns.

Checkpoint

Concepts Pharma has built a data repository that collects self-reported eating habits of clinical trial participants through a mobile habit. The translation medicine group is using the data to determine if the drug in trial is causing digestive issues when taken with certain food groups. Which of the V’s should be of most concern to them?

Veracity

Volume
Volatility
Velocity
Variety

Answer at the end of chapter

Planning A Big Data project

Carrying out a "Big Data" project requires thoughtful planning. The project must have clearly defined objectives and "questions" that need to be answered through analysis. The project plan must also address where the data will come from, the processes for collecting, cleaning, and loading the data, and the infrastructure used to house the data. Finally, the project plan must state how the data is expected to be analyzed and how data will be kept free of identiable properties and keep personal data confidential. Data scientist or data analyst planning a big data project should address:

Objectives
Data
Process
Infrastructure
Analytics
Governance and Privacy

Objectives

Objectives needs to be clearly defined with proper outline of every step of the project. Data scientist needs to answer the questions like:

What is the purpose of the data project?
How is the data going to be used?
What is the business or organizational value of the data project?

Data

What data needs to be collected?
Where will the data come from?
- Internal systems?
- Social networks?
- External data sources?
What is the structure of the data?
- Quantitative or qualitative?
What is the quality of the data?

Processes

How will it be collected?
Who is involved in collection of the data?
How will the data be cleaned?
How will the data be loaded and transferred?
What kind of analysis needs to be done?
- Real time analysis?

Infrastructure

Where will the data be stored?
What database or data store will be needed based on the volume, complexity, type, and required access of the data?
What hardware is needed to support responsive access to the data?
Who will manage the data store?
Who will supply the data store?

Analytics

How will the data be presented?
- Tables?
- Visualizations?
What predictive models will be built?
How will the data from different sources be combined?
What skills are needed to do the analysis?
What programs or applications need to be built or purchased?

Governance and Privacy

Organizations must be transparent in how they manage personal data and how they use it.
Government regulations may limit which data can be collected and how that data can be stored, transferred, or used.
Organizations must protect private data and not allow persons to be “identifiable”.

Checkpoint Answer

Concepts Pharma has built a data repository that collects self-reported eating habits of clinical trial participants through a mobile habit. The translation medicine group is using the data to determine if the drug in trial is causing digestive issues when taken with certain food groups. Which of the V’s should be of most concern to them?

Veracity

Volume
Volatility
Velocity
Variety

Chapter 2

Basic R Programming

Before we start with the basics of R lets make sure we have latest version of R. To install R on your computer go to the home website of R and follow the instructions there: http://www.r-project.org/ We recommend the use of Rstudio, a powerful IDE for R. Rstudio is also free and can be downloaded from their home website: http://www.rstudio.com/

Why R? As we learned in the first chapter when we deal with big data we face many challenges, R provides a perfect platform to deal with these challenges as we will see as we go further. R provides a powerful environment which runs on several platforms, it can process enormous amount of data in one go or million chunks of data one by one. R also allows you to deal with bad or missing data and it makes reshaping and restructuring of data easy.

R objects

Data is stored as objects in R. Objects are created by:

Reading data from an external file
Retrieving data from a URL
Creating an object directly from the command line
Instantiating an object from within a program

Expressions and Assignment

R can be directly used to solve simple or complex expressions:

> 12*21
[1] 252

> ((2^3)*5)-1
[1] 39

> sqrt(4)* exp(2)
[1] 14.77811

Note: sqrt and exp are in build functions in R for Square root and exponential respectively.

Assignment of value to any variable can be done in 2 ways in R:

> x=12
> x
[1] 12

> word = "Hello"
> word
[1] "Hello"

> x <- 12
> x
[1] 12

> word <- "Hello"
> word
[1] "Hello"

Later method is more frequently used by R users. Object names are case sensitive and cannot contain spaces or special characters. An object identifier must start with a letter, but may contain any letter or digit thereafter.

Note that R is case sensitive which means that R treats the object names "AP" and "ap" as different objects. Accessing files is also most commonly case sensitive, so there’s a difference between “AirPassengers.txt” and “airpassengers.txt”.

Functions

R functions can be invoked by their name. Details of any in build functions or dataset can be accessed by adding a question mark (?) in front of the function or dataset name.

> ?sum

> sum(1,2,30)
[1] 33

> r <- c(1,2,3,4,5,6,7)
> mean(r)
[1] 4

user defined functions are an important part of programming in R. They allow code to be reused.

> fraction<-function(x,y){
+ result <-x/y
+ print (result)
+ }
> fraction(3,2)
[1] 1.5

Concatenation and Arrays

Concatenating numeric or character values using the built-in c() function results in an indexable array.

> a<-c(1,2,3,4,5,6,7,8)
> a
[1] 1 2 3 4 5 6 7 8
> a[2]
[1] 2
> a + 10
[1] 11 12 13 14 15
> a
[1] 1 2 3 4 5
> b <-a/2
> b
[1] 0.5 1.0 1.5 2.0 2.5
> c <-a + b
> c
[1] 1.5 3.0 4.5 6.0 7.5

Sequences and subscripting

Sequencing or range of numbers can be selected in R using ":" operator.

> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> 5:12
[1] 5 6 7 8 9 10 11 12
> 3:-3
[1] 3 2 1 0 -1 -2 -3
> 2*1:5
[1] 2 4 6 8 10
> 2*(1:5)
[1] 2 4 6 8 10
> a<-c(1:25)
> a
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Advanced sequencing can be done using inbuild R function called seq()

# Increment by 3
> seq(from=5,to=15,by=3)
[1] 5 8 11 14
# divide in 6 parts
> seq(from=1,to=10,length=6)
[1] 1.0 2.8 4.6 6.4 8.2 10.0
# divide in 4 parts with decrement of 2.5
> seq(from=100,length=4,by=-2.5)
[1] 100.0 97.5 95.0 92.5
# divide in parts equal to the vector range
> x <-10:20
> seq(from=50,to=52,along=x)
[1] 50.0 50.2 50.4 50.6 50.8 51.0 51.2 51.4 51.6 51.8 52.0

Sequences are essentially arrays and particular elements of the sequence can be extracted with the [] subscript operator. Subscripting in R is much more flexible than many other programming languages.

> # extract the 3rd element
> x[3]
[1] 12
> # extract all BUT the 3rd element
> x[-3]
[1] 10 11 13 14 15 16 17 18 19 20

Concatenation can be used in conjunction with sequencing to retrieve a subset of elements.

> x
[1] 10 11 12 13 14 15 16 17 18 19 20
> #retrieve the 5th and 7th elements
> x[c(5,7)]
[1] 14 16
> #retrieve all but the 3rd, 5th, and 9th elements
> x[c(-3,-5,-9)]
[1] 10 11 13 15 16 17 19 20

Specific elements meeting a logical criterion can be selected using subscripting.

> x
[1] 10 11 12 13 14 15 16 17 18 19 20
> #extract all elements greater than 14
> x[x>14]
[1] 15 16 17 18 19 20
> z<-c(T,T,F,T,F,T,F)
> z
[1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE
> z[z==T]
[1] TRUE TRUE TRUE TRUE

Listing and Deleting objects

Great strength of R lies in the flexibility it provides. We can use the shell scripting commands ls() and rm() to list and delete objects in R

> ls()
[1] "a"    "b"    "c"    "r"    "word" "x"    "z"  
> rm("word")
> ls()
[1] "a" "b" "c" "r" "x" "z"
#remove all objects  from current session
rm(list=ls())

Comments

Comments are the most important part of any program/code. Scripts should be commented so that you or others understand the intent of the commands and functions.Any text after a hash mark (#) is ignored by R.

Data Frames

Data frames are the most common type of compound data structure used in R in addition to scalar values (vectors) and collections of values (array sequences). They are similar to C++ and Java objects or C struct’s. A data frame is composed of multiple values each of which is commonly a sequence.

A data frame is often created by loading data from an external file or created internally. Data frames are essentially spreadsheets of columns and rows.

> x<-1:10
> y<-seq(from=100,to=300,by=5)
> # create a new data frame 'df'
> df<-data.frame(x,y)
Error in data.frame(x, y) :
arguments imply differing number of rows: 10, 41
> y<-seq(from=100,to=300,length=10)
> df<-data.frame(x,y)

Note: To combine two vectors into a dataframe they have to be of same length Individual elements of a dataframe can be accessed same way as array using [] subscript operator.

> df
    x        y
1   1 100.0000
2   2 122.2222
3   3 144.4444
4   4 166.6667
5   5 188.8889
6   6 211.1111
7   7 233.3333
8   8 255.5556
9   9 277.7778
10 10 300.0000
> df[6,2]
[1] 211.1111
# Accessing entire column
> df[,2]
 [1] 100.0000 122.2222 144.4444 166.6667 188.8889 211.1111 233.3333 255.5556 277.7778 300.0000
 # Accessing entire row
 > df[1,]
  x   y
1 1 100

Details about what any variable is storing can be determined using str() function. Other functions to determine the dimensions of any variable are: dim(), length(), ncol(), nrow etc

> str(df)
'data.frame':	10 obs. of  2 variables:
 $ x: int  1 2 3 4 5 6 7 8 9 10
 $ y: num  100 122 144 167 189 ...
> ncol(df)
[1] 2
> nrow(df)
[1] 10
> length(df)
[1] 2
> dim(df)
[1] 10  2
> length(df$x)
[1] 10
> dim(df)[1]
[1] 10
#referencing the last element
> x
[1] 1 2 3 4 5 6 7 8 9 10
> x[length(x)]
[1] 10

R comes with many built in datasets which can be used for practice purpose. A complete list of built in datasets can be accessed using the homepage of package datasets. https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

discoveries dataset contains the numbers of “great” inventions and scientific discoveries in each year from 1860 to 1959.

> discoveries
Time Series:
Start = 1860 
End = 1959 
Frequency = 1 
  [1]  5  3  0  2  0  3  2  3  6  1  2  1  2  1  3  3  3  5  2  4  4  0  2  3  7 12  3 10  9  2  3  7
 [33]  7  2  3  3  6  2  4  3  5  2  2  4  0  4  2  5  2  3  3  6  5  8  3  6  6  0  5  2  2  2  6  3
 [65]  4  4  2  2  4  7  5  3  3  0  2  2  2  1  3  4  2  2  1  1  1  2  1  4  4  3  2  1  4  1  1  1
 [97]  0  0  2  0

#converting in built data into a dataframe
> Discoveries<- data.frame(year=1860:1959,count=discoveries)
> head(Discoveries)
  year count
1 1860     5
2 1861     3
3 1862     0
4 1863     2
5 1864     0
6 1865     3

The head() and tail() functions list the first or last six rows of a data frame. These functions comes in handy while dealing with larger datasets.

Given a logical statement, any() tests if at least one value in the set meets the criterion.

> any(Discoveries[,2]<0)
[1] FALSE
> any(Discoveries[,1] < 1860 | Discoveries[,1] > 1959)
[1] FALSE

Descriptive statistics

We will be mainly dealing with big data through R, descriptive statistics functions like mean(), max(), which(), will be very helpful in navigating through the huge dataset quickly and accurately.

> mean(Discoveries[,2])
[1] 3.1
> round(mean(Discoveries[,2]))
[1] 3
> max(Discoveries[,2])
[1] 12
> which(Discoveries[,2]==12)
[1] 26
> Discoveries[26,]
   year count
26 1885    12

To obtain quick summary statistics on a data object, use the summary() function.

> summary(Discoveries)
      year          count     
 Min.   :1860   Min.   : 0.0  
 1st Qu.:1885   1st Qu.: 2.0  
 Median :1910   Median : 3.0  
 Mean   :1910   Mean   : 3.1  
 3rd Qu.:1934   3rd Qu.: 4.0  
 Max.   :1959   Max.   :12.0

To sum a column in a data frame, use the colSums() function.

> colSums(Discoveries[2])
count 
 310

Note 1: Note that the colSums()function requires an array reference rather than a data frame, therefore no comma. Note 2: Note the camel casing in function name. camelCase is the practice of writing compound words or phrases such that each word or abbreviation begins with a capital letter. Remember R is case sensitive.

Running queries on dataframes

Queries become the most important part of data analysis. We have a huge dataset and we need to know something about a particular data. We use simple or complex queries to do statistical analysis on data or for searching something specific within the data.

# how many years were fewer than 5 discoveries observed?
> length(which(Discoveries[,2] < 5))
[1] 79
#  In which years were fewer than 5 discoveries observed?
> Discoveries[(which(Discoveries[,2] < 5)),]
    year count
2   1861     3
3   1862     0
4   1863     2
5   1864     0
# List of years with 0 discoveries
> Discoveries[(which(Discoveries[,2] == 0)),1]
[1] 1862 1864 1881 1904 1917 1933 1956 1957 1959

Dealing with missing data

Missing data values in a data frame are encoded as NA. A missing value bars any calculation of summary statistics or numeric expression. Missing values can be removed using in built R function na.omit. Some functions support na.rm as argument to remove missing values while doing the calculations. To find out whether the dataset has missing values or not functions like any() or is.na() are used. Lets see these functions using in build dataset of R called "airquality" which contains measurements of daily air quality in New York City from May through September 1973.

> head(airquality)
Ozone Solar.RWind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NANA14.3 56 5 5
6 28 NA14.9 66 5 6
> mean(airquality$Solar.R)
[1] NA
Warning message:
In mean.default(airquality) :
  argument is not numeric or logical: returning NA
  
> any(is.na(airquality))
[1] TRUE
> mean(airquality$Solar.R,na.rm=TRUE)
[1] 185.9315
> which(is.na(airquality$Solar.R))
[1] 5 6 11 27 96 97 98

> air_complete<-na.omit(airquality)
> head(air_complete)
Ozone Solar.RWind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8

Saving R scripts

R commands can be created in a text file and loaded on demand rather than typing it in over and over. Create a text file in a text editor and save the file with the .R extension. Use the source()function to load and execute the script.

# Simple R script: created.R
x<-1:10
y<-seq(from=100,to=300,length=10)
df<-data.frame(x,y)

> source("created.R")
> df
    x        y
1   1 100.0000
2   2 122.2222
3   3 144.4444
4   4 166.6667
5   5 188.8889
6   6 211.1111
7   7 233.3333
8   8 255.5556
9   9 277.7778
10 10 300.0000

Source function specially comes in handy when you are dealing with huge dataset and loading data takes time.

Conditional Statements

In conditional execution, code statements are only executed if certain conditions are TRUE. The if() statement is used to construct conditional execution paths. The conditional code statements are enclosed in curly braces { and }.

a<-10
> b<-5
> if(a < b) {
+ print("a is less than b")
+ }

Logical operators

Operators	Semantics
==	Equality
<	Less than
>	Greater than
<=	Less than or equal to
>=	Greater than or equal to

Binary Operators

Operators	Semantics
&&	AND (both statements are true)

!	NOT (Statement is false)

Nested IF statements

> if (sum(1:10) >= sqrt(75)) {
+ print("true")
+ } else {
+ print("false")
+ }
[1] "true"

The ifelse() function provides a more compact syntax for if-else constructs.

> ifelse(sum(1:5) >= 10, "it's greater", "it's smaller")
[1] "it's greater"

Switch statements

Switch statements are used when a particular variable can have multiple cases and different logic is needed for each case.

> name <- readline(prompt="Enter a name: ")
Enter a name: Michelle
> switch(name,
+ Michelle={
+ print("Hi Michelle! How are you?") # any logical statement for Michelle
+ },
+ John={
+ print("Hi John! How are you?") # any logical statement for John
+ },
+ {
+ Print("default") #default logic
+ }
+ )
[1] "Hi Michelle! How are you?"

Control structures

R supports two common forms of iteration (looping):

restricted iterationwhich executes commands a fixed number of times: for loop
unrestricted iterationin which the loop runs until some condition is no longer true: while loop

The for loop runs a fixed number of times based on the values assigned to an index or looping variable.

> for (i in 1:3) {
+ print(paste("i =",i))
+ }
[1] "i = 1"
[1] "i = 2"
[1] "i = 3"
> i
[1] 3

Instead of looping a fixed number of times, a forloop can also iterate over a set. The loop variable takes on each value in the set one at a time.

> cities <-c("Boston","NewYork","SanFrancisco")
> for (city in cities) {
+ print(city)
+ }
[1] "Boston"
[1] "New York"
[1] "San Francisco"

For loops can be nested to run through a each row and column of a matrix.

> mat<- matrix(nrow=4, ncol=5, sample(0:1))
> mat
     [,1] [,2] [,3] [,4] [,5]
[1,]    0    0    0    0    0
[2,]    1    1    1    1    1
[3,]    0    0    0    0    0
[4,]    1    1    1    1    1
> for (i in 1:nrow(mat) ) {
+ for (j in 1:ncol(mat)){
+ if(mat[i,j] == 1){
+ mat[i,j]<- "Michelle"
+ }
+ else{
+ mat[i,j]<- "John"
+ }
+ }
+ }
> mat
     [,1]       [,2]       [,3]       [,4]       [,5]      
[1,] "John"     "John"     "John"     "John"     "John"    
[2,] "Michelle" "Michelle" "Michelle" "Michelle" "Michelle"
[3,] "John"     "John"     "John"     "John"     "John"    
[4,] "Michelle" "Michelle" "Michelle" "Michelle" "Michelle"

In unrestricted iteration, the loop executes the loop statements until a condition is no longer true.

> x <-0
> while (x < 10) {
+ print (x)
+ x <-x + 1
+ }
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9

Apply function

Apply function is a modern way of looping which actually surpasses the lengthy process of looping and gives the same result.

Applies a function to components of a list or other object, and then returns the results as a list, a vector, or a matrix.

> x <- matrix(c(1:10), ncol=5, byrow=TRUE)
> x
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
> apply(x, 1, mean)
[1] 3 8
> apply(x, 2, mean)
[1] 3.5 4.5 5.5 6.5 7.5

Note: apply(x, 1, mean) calculates the mean of two rows in x, and apply(x, 2, mean) calculates the mean of five columns in x.

In above example, apply extracts each column/row as a vector, one at a time and passes it to mean function. It substitutes the use of loop. Above code can be written as:

> avgs<-numeric(5)
> for(i in 1:5){
+   avgs[i]<-mean(x[,i])
+ }
> avgs
[1] 3.5 4.5 5.5 6.5 7.5

#OR
> apply(x, 2, mean)
[1] 3.5 4.5 5.5 6.5 7.5

Looping becomes very slow, much more so in large datasets. Apply function reduces the processing time considerably as looping in apply function is done in compiled code, like c or fortran, not in R’s own interpreted code.

Types of apply

lapply - L in lapply stands for list. So lapply(x) returns a list of the same length of x

> x<- list(a<-c(1:20),b<-c(10:20),c<-c(20:30))
> x
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

[[2]]
 [1] 10 11 12 13 14 15 16 17 18 19 20

[[3]]
 [1] 20 21 22 23 24 25 26 27 28 29 30
> results<-lapply(x,mean)
> results
[[1]]
[1] 10.5

[[2]]
[1] 15

[[3]]
[1] 25
> class(results)
[1] "list"

sapply - S stands for simplify. Sapply works like lapply but instead of returning a list it returns a simple vector.

> results<-sapply(x,mean)
> results
[1] 10.5 15.0 25.0
> class(results)
[1] "numeric"

tapply - It is used to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.

> x <- 1:20
> x
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
> y <- factor(rep(letters[1:5], each = 4))
> y
 [1] a a a a b b b b c c c c d d d d e e e e
Levels: a b c d e
> tapply(x, y, mean)
   a    b    c    d    e 
 2.5  6.5 10.5 14.5 18.5

mapply - mapply is used when you want to apply a function to the 1st element of each and then the 2nd elements of each etc.

> mapply(sum, 1:5, 1:5, 1:5)
[1]  3  6  9 12 15
> mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4

Split-Apply-Combine strategy

Break up big problems into manageable pieces
Perform operations on each piece separately.
Combine the output of each piece into a single output. Plyr package provides intuitive functions for split-apply-combine strategy

Example:

Split the iris dataset into three parts.
Remove the species name variable from the data.
Calculate the mean of each variable for the three different parts separately.
Combine the output into a single data frame.

> library (plyr)
> ddply(iris,~Species,function(x) colMeans(x[,-which(colnames(x)=="Species")]))
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width

1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
virginica        6.588       2.974        5.552       2.026

OR

> iris_mean <- adply(iris3,3,colMeans)
> iris_mean
          X1 Sepal L. Sepal W. Petal L. Petal W.
1     Setosa    5.006    3.428    1.462    0.246
2 Versicolor    5.936    2.770    4.260    1.326
3  Virginica    6.588    2.974    5.552    2.026
> class(iris_mean)
[1] "data.frame"

Note: You need to install the plyr package to run this code.

Reading and writing data

The ability to read and write external text files is an essential part of data processing. Many data sets are stored in simple text files. Excel and other programs can export and import text files in certain formats.

lets Load the built-in data set AirPassengers containing monthly international airline passenger data between 1949 and 1960, After displaying the data set, copy the data into a simple text file.

> AirPassengers
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432

While R has several functions for reading files, the most commonly used function for reading text files is read.table().

> ap<-read.table("airPassengers.txt",header=TRUE,sep="")
> ap
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432

The read.table() function has a skip=xparameter which allows you to skip some number of lines.

> ap<-read.table("airPassengers.txt",skip=4,header=TRUE,sep="")
> ap
  X1952 X171 X180 X193 X181 X183 X218 X230 X242 X209 X191 X172 X194
1  1953  196  196  236  235  229  243  264  272  237  211  180  201
2  1954  204  188  235  227  234  264  302  293  259  229  203  229
3  1955  242  233  267  269  270  315  364  347  312  274  237  278
4  1956  284  277  317  313  318  374  413  405  355  306  271  306
5  1957  315  301  356  348  355  422  465  467  404  347  305  336
6  1958  340  318  362  348  363  435  491  505  404  359  310  337
7  1959  360  342  406  396  420  472  548  559  463  407  362  405
8  1960  417  391  419  461  472  535  622  606  508  461  390  432

New columns can be added to a data set using the cbind() function.

> ap$Total<-cbind(rowSums(ap))
> ap
  X1952 X171 X180 X193 X181 X183 X218 X230 X242 X209 X191 X172 X194 Total
1  1953  196  196  236  235  229  243  264  272  237  211  180  201  4653
2  1954  204  188  235  227  234  264  302  293  259  229  203  229  4821
3  1955  242  233  267  269  270  315  364  347  312  274  237  278  5363
4  1956  284  277  317  313  318  374  413  405  355  306  271  306  5895
5  1957  315  301  356  348  355  422  465  467  404  347  305  336  6378
6  1958  340  318  362  348  363  435  491  505  404  359  310  337  6530
7  1959  360  342  406  396  420  472  548  559  463  407  362  405  7099
8  1960  417  391  419  461  472  535  622  606  508  461  390  432  7674

A data object can be exported to a file using the write.table() function.

> getwd()
[1] "C:/Users/Martin/Downloads"
> write.table(ap,"AirPassNG.txt",col.names=NA,
row.names=TRUE,quote=FALSE,sep=",")

Note:R requires the use of a forward slash (‘/’) to separate directories (folders) not the back slash (‘\’) used by Windows.

Files

chapter 1 n 2.md

Latest commit

History

chapter 1 n 2.md

File metadata and controls

Chapter 1

Introduction

Value of data:

Role of Data scientist

Big Data

Examples of Big data:

Characteristics of Big Data:

Web Behavior and content:

User Generated Content:

Activity Generated data:

RFID Data:

Geo Data:

Environmental Data:

Organizational Transactional Data:

Research Data:

Analysis -

Collection -

Storage -

Curation -

Search and retrieval -

Sharing / Transfer -

Visualization -

Privacy -

Storing Big Data

6 V's of Big Data

Volume

Variety

Velocity

Veracity

Validity

Volatility

Additional V's

Viscosity

Virality

Checkpoint

Planning A Big Data project

Objectives

Data

Processes

Infrastructure

Analytics

Governance and Privacy

Checkpoint Answer

Chapter 2

Basic R Programming

R objects

Expressions and Assignment

Functions

Concatenation and Arrays

Sequences and subscripting

Listing and Deleting objects

Comments

Data Frames

Descriptive statistics

Running queries on dataframes

Dealing with missing data

Saving R scripts

Conditional Statements

Logical operators

Binary Operators

Nested IF statements

Switch statements

Control structures

Apply function

Types of apply

Split-Apply-Combine strategy

Example:

Reading and writing data