Fundamental unit of any organization is data. Data derives decision-making in most organizations. e.g.
- Where to locate a new franchise
- What customers to target in marketing
- Where bottlenecks exist in a process
- How customers feel about a product Fig. 1a Graph of Data value vs Data Age, depicting the importance of individual data and aggregate data.
Data needs to be in a format that allows for qualitative, quantitative and statistical analysis. In an ideal world data is well organized with no element missing and properly formatted. Although real world data is often unformatted or formatted in a way that is not conducive to analysis or is missing critical pieces.
Fig. 1b Well formatted data
Fig. 1c Poorly formatted data with missing values
Data scientists turn data from various sources into actionable information.
- Collecting the data in the raw form
- Data munging and data wrangling to make it useful for analysis and visualization.
- Cleaning of data to deal with missing values
- Curation of data to make it available for reuse and preservation
Big data is a relatively new term that describes data set that are so large and complex that traditional methods of storing and processing them are not sufficient. The need and importance of big data can be depicted easily with the following statistics:
Every 60 seconds there are
- Over 100,000 tweets.
- 695,000 Facebook status updates.
- 11 million instant messages.
- 700,000+ Google searches.
- 168 million+ emails sent.
- 1,820 TB of data created.
- 217 new mobile web users.
Above statistics is just a glimpse of volume of data, which is constantly growing:
- Every day over 2.5 quintillion bytes of data is being generated
- 90% of the world’s data has been generated over the past two years
- Data from multiple sources is being integrated into single massive data sets.
Due to the complexity involved with the term itself there is no single agreed upon definition of “Big Data”, One possible definition:
Big data is the integration of large amounts of multiple types of structured and unstructured data into a single data set that can be analyzed to gain insight and new understanding of an industry, business, the environment, medicine, disease control, science, and the human interactions and expectations.
- The Large Hadron Collider would generate 5 ×1020bytes per day if all of its sensors were turned on, almost 200 times more than all other data sources in the world combined.
- The Square Kilometer Array radio telescope is expected to collect 14 exabytesof data per day for analysis
- Walmart generates over 1 million customer transactions per hour that are curated in a multi-petabyte database for trend analysis
- Very large, distributed aggregations of loosely structured data –often incomplete
- In excess of multiple petabytes or exabytesof data
- Billions of records about people or transactions
- Loosely-structured and often distributed data
- Flat schemas with few complex interrelationships
- Time series data containing time-stamped events
- Connections between data elements that must be probabilistically inferred through machine learning
Larger data sets allow for more detailed analysis and application to social sciences, biology, pharmacology, business, marketing and more. Data is everywhere and a lot of it is free. Organizations don't necessarily have to build their own massive data repositories before starting with big data analytics. Steps taken by many companies and government agencies to put large amounts of information into the public domain have made large volumes of data accessible to everyone.
Some of the Important sources of data are:
- There are nearly five billion web pages
- Collected data includes network traffic, site and page visits, page navigation, page searches
- Also known as "Internet trail" or "Net trail"
- Content generated by millions of users on social media, including Facebook, Twitter, Instagram, blogs, YouTube, forums, wikis, and so forth
- Computer and mobile device log files
- Includes web site tracking information, application logs, sensor data such as check-ins and other location tracking
- Radio Frequency Identifiers
- Tags for tracking merchandise and shipments, mobile payments, sports performance measurement, and automated toll collection
- GPS tracking data generated by mobile devices
- Tracking of movement of equipment, vehicles, and people
- Weather conditions
- Tidal movements
- Seismic activity
- Transactional activities such as purchases, registration, manufacturing
- Social science data, e.g., census, polls
- Health care data
- Education, law and order, economic activity, agriculture, food production
- “Big Data” such as radio telescopes, particle physics
Big data is set to offer tremendous insights, but with the terabytes and petabytes of data pouring in to organizations today, traditional architectures are not up to the challenge. There are many challenges which comes along with big data:
With the enormous amount of data available the major challenge is to leverage the value that data have to offer. Big data requires complex analysis within relatively short time spans in order to detect trends and make decisions. Analysis techniques include, among many others:
- A/B Testing
- Visualization
- Machine Learning
- Time Series Analysis
- Data is not free.
- Data is in format not conducive to analysis.
- Data contains missing values or bad entries.
- Data is not downloadable.
Storage of such enormous data is a challenge in itself. There is a need for the system to be able to deal with terabytes/petabytes of data on a daily basis.
Curation of data deals with addressing the quality of data. Data has a real value only if it is accurate and timely and thus can help in the decision making process.Poor information quality can be costly:
- One study estimates that on average bad information costs businesses up to 10% of revenue
- Another study pegs the loss at over $600 billion annually in the U.S. alone
Timely retrival of meaningful data from the entire data set is one of the most important challenge.
Sharing/Transferring data is another concern as there is no platform easily available which allows transfer of such huge data, Organizations tend to invest a lot of money to design special architectures and infrastructures to facilitate data sharing/transfer.
Visualization helps in extracting the meaningful information by processing the data and representing it in a way which can be easily deduced.
Data security becomes the major concern especially when it comes to credit card data, personal ID information or other sensitive assets.
Traditional data storage technologies including text files, XML, and relational databases reach their limits when used to store very large amounts of data. Furthermore, the data that is needed for analysis includes not only text and numeric data, but unstructured data, such as text files, video, audio, blogs, sensor data, geospatial data, among others. Due to these hurdles storing big data becomes challenging, non relational databases provides a good alternative. Non relational database is a database that does not incorporate the table/key model that relational database management systems (RDBMS) promote. It has the ability to deal with large amount of data and can accomodate unstructured data easily. Fetching data from non relational database provides remarkable speed over relational database as the search query doesn't have to go through each table and key combination in this case.
Volume is one of the core defining attributes of “big data”.Big Data implies enormous amounts of structured and unstructured data that is generated by social and sensor networks, transaction and search history, and manual data collection. Eg- 100 terabytes of data are uploaded daily to Facebook; Akamai analyses 75 million events a day to target online ads; Walmart handles 1 million customer transactions every single hour.
Data comes from a variety of sources and contains both structured and unstructured data. Data types are not restricted to simply numbers and short text fields, but also include images, emails, text messages, web pages, blog entries, documents, audio, video, and time series.
The flow of data that needs to be stored and analyzed is continuous. Human interactions, business processes, machines, and networks generate data continuously and in enormous quantity. The data is generally analyzed in real-time to gain a strategic advantage, which allows companies to do things like display personalised ads on the web pages you visit, based on your recent search, viewing and purchase history. Sampling can help mitigate some of the problems with large data volume and velocity. Eg- Every minute of every day, we upload 100 hours of video on Youtube, send over 200 million emails and send 300,000 tweets.
Data veracity characterizes the inherent noise, biases, abnormalities, and mistakes present in virtually all data streams. “Dirty” data presents a significant risk as analyzes are incorrect when based on “bad” data. Data must be cleaned in real-time and processes must be established to keep “dirty data” from accumulating. Data scientist needs to work as a "data janitor" before analysing the data.
While the data may not be “dirty”, biased, or abnormal and it may not be valid for the intended use. Valid data for the intended use is essential to making decisions based on the data.
Data changes over time and volatility characterizes the degree to which the data changes over time. Decisions and analyses are based on data that has an “expiration date”. Data scientists must define at what point in time a data stream is no longer relevant and cannot be used to make a decision.
Viscosity measures the resistance to flow in the volume of data, resistance to navigate in the dataset, data flow rates or complexity of data processing required. Technologies to deal with viscosity include improved streaming, agile integration bus, and complex event processing.
Virality measures how quickly data is spread and shared to each unique node. Time is an important characteristic along with rate of proliferation. Virality of data can provide companies with instant insights into the target areas to launch marketing campaigns.
Concepts Pharma has built a data repository that collects self-reported eating habits of clinical trial participants through a mobile habit. The translation medicine group is using the data to determine if the drug in trial is causing digestive issues when taken with certain food groups. Which of the V’s should be of most concern to them?
- Veracity
- Volume
- Volatility
- Velocity
- Variety
Answer at the end of chapter
Carrying out a "Big Data" project requires thoughtful planning. The project must have clearly defined objectives and "questions" that need to be answered through analysis. The project plan must also address where the data will come from, the processes for collecting, cleaning, and loading the data, and the infrastructure used to house the data. Finally, the project plan must state how the data is expected to be analyzed and how data will be kept free of identiable properties and keep personal data confidential. Data scientist or data analyst planning a big data project should address:
- Objectives
- Data
- Process
- Infrastructure
- Analytics
- Governance and Privacy
Objectives needs to be clearly defined with proper outline of every step of the project. Data scientist needs to answer the questions like:
- What is the purpose of the data project?
- How is the data going to be used?
- What is the business or organizational value of the data project?
- What data needs to be collected?
- Where will the data come from?
- Internal systems?
- Social networks?
- External data sources?
- What is the structure of the data?
- Quantitative or qualitative?
- What is the quality of the data?
- How will it be collected?
- Who is involved in collection of the data?
- How will the data be cleaned?
- How will the data be loaded and transferred?
- What kind of analysis needs to be done?
- Real time analysis?
- Where will the data be stored?
- What database or data store will be needed based on the volume, complexity, type, and required access of the data?
- What hardware is needed to support responsive access to the data?
- Who will manage the data store?
- Who will supply the data store?
- How will the data be presented?
- Tables?
- Visualizations?
- What predictive models will be built?
- How will the data from different sources be combined?
- What skills are needed to do the analysis?
- What programs or applications need to be built or purchased?
- Organizations must be transparent in how they manage personal data and how they use it.
- Government regulations may limit which data can be collected and how that data can be stored, transferred, or used.
- Organizations must protect private data and not allow persons to be “identifiable”.
Concepts Pharma has built a data repository that collects self-reported eating habits of clinical trial participants through a mobile habit. The translation medicine group is using the data to determine if the drug in trial is causing digestive issues when taken with certain food groups. Which of the V’s should be of most concern to them?
- Veracity
- Volume
- Volatility
- Velocity
- Variety
Before we start with the basics of R lets make sure we have latest version of R. To install R on your computer go to the home website of R and follow the instructions there: http://www.r-project.org/ We recommend the use of Rstudio, a powerful IDE for R. Rstudio is also free and can be downloaded from their home website: http://www.rstudio.com/
Why R? As we learned in the first chapter when we deal with big data we face many challenges, R provides a perfect platform to deal with these challenges as we will see as we go further. R provides a powerful environment which runs on several platforms, it can process enormous amount of data in one go or million chunks of data one by one. R also allows you to deal with bad or missing data and it makes reshaping and restructuring of data easy.
Data is stored as objects in R. Objects are created by:
- Reading data from an external file
- Retrieving data from a URL
- Creating an object directly from the command line
- Instantiating an object from within a program
R can be directly used to solve simple or complex expressions:
> 12*21
[1] 252
> ((2^3)*5)-1
[1] 39
> sqrt(4)* exp(2)
[1] 14.77811
Note: sqrt and exp are in build functions in R for Square root and exponential respectively.
Assignment of value to any variable can be done in 2 ways in R:
> x=12
> x
[1] 12
> word = "Hello"
> word
[1] "Hello"
> x <- 12
> x
[1] 12
> word <- "Hello"
> word
[1] "Hello"
Later method is more frequently used by R users. Object names are case sensitive and cannot contain spaces or special characters. An object identifier must start with a letter, but may contain any letter or digit thereafter.
Note that R is case sensitive which means that R treats the object names "AP" and "ap" as different objects. Accessing files is also most commonly case sensitive, so there’s a difference between “AirPassengers.txt” and “airpassengers.txt”.
R functions can be invoked by their name. Details of any in build functions or dataset can be accessed by adding a question mark (?) in front of the function or dataset name.
> ?sum
> sum(1,2,30)
[1] 33
> r <- c(1,2,3,4,5,6,7)
> mean(r)
[1] 4
user defined functions are an important part of programming in R. They allow code to be reused.
> fraction<-function(x,y){
+ result <-x/y
+ print (result)
+ }
> fraction(3,2)
[1] 1.5
Concatenating numeric or character values using the built-in c() function results in an indexable array.
> a<-c(1,2,3,4,5,6,7,8)
> a
[1] 1 2 3 4 5 6 7 8
> a[2]
[1] 2
> a + 10
[1] 11 12 13 14 15
> a
[1] 1 2 3 4 5
> b <-a/2
> b
[1] 0.5 1.0 1.5 2.0 2.5
> c <-a + b
> c
[1] 1.5 3.0 4.5 6.0 7.5
Sequencing or range of numbers can be selected in R using ":" operator.
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> 5:12
[1] 5 6 7 8 9 10 11 12
> 3:-3
[1] 3 2 1 0 -1 -2 -3
> 2*1:5
[1] 2 4 6 8 10
> 2*(1:5)
[1] 2 4 6 8 10
> a<-c(1:25)
> a
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Advanced sequencing can be done using inbuild R function called seq()
# Increment by 3
> seq(from=5,to=15,by=3)
[1] 5 8 11 14
# divide in 6 parts
> seq(from=1,to=10,length=6)
[1] 1.0 2.8 4.6 6.4 8.2 10.0
# divide in 4 parts with decrement of 2.5
> seq(from=100,length=4,by=-2.5)
[1] 100.0 97.5 95.0 92.5
# divide in parts equal to the vector range
> x <-10:20
> seq(from=50,to=52,along=x)
[1] 50.0 50.2 50.4 50.6 50.8 51.0 51.2 51.4 51.6 51.8 52.0
Sequences are essentially arrays and particular elements of the sequence can be extracted with the [] subscript operator. Subscripting in R is much more flexible than many other programming languages.
> # extract the 3rd element
> x[3]
[1] 12
> # extract all BUT the 3rd element
> x[-3]
[1] 10 11 13 14 15 16 17 18 19 20
Concatenation can be used in conjunction with sequencing to retrieve a subset of elements.
> x
[1] 10 11 12 13 14 15 16 17 18 19 20
> #retrieve the 5th and 7th elements
> x[c(5,7)]
[1] 14 16
> #retrieve all but the 3rd, 5th, and 9th elements
> x[c(-3,-5,-9)]
[1] 10 11 13 15 16 17 19 20
Specific elements meeting a logical criterion can be selected using subscripting.
> x
[1] 10 11 12 13 14 15 16 17 18 19 20
> #extract all elements greater than 14
> x[x>14]
[1] 15 16 17 18 19 20
> z<-c(T,T,F,T,F,T,F)
> z
[1] TRUE TRUE FALSE TRUE FALSE TRUE FALSE
> z[z==T]
[1] TRUE TRUE TRUE TRUE
Great strength of R lies in the flexibility it provides. We can use the shell scripting commands ls() and rm() to list and delete objects in R
> ls()
[1] "a" "b" "c" "r" "word" "x" "z"
> rm("word")
> ls()
[1] "a" "b" "c" "r" "x" "z"
#remove all objects from current session
rm(list=ls())
Comments are the most important part of any program/code. Scripts should be commented so that you or others understand the intent of the commands and functions.Any text after a hash mark (#) is ignored by R.
Data frames are the most common type of compound data structure used in R in addition to scalar values (vectors) and collections of values (array sequences). They are similar to C++ and Java objects or C struct’s. A data frame is composed of multiple values each of which is commonly a sequence.
A data frame is often created by loading data from an external file or created internally. Data frames are essentially spreadsheets of columns and rows.
> x<-1:10
> y<-seq(from=100,to=300,by=5)
> # create a new data frame 'df'
> df<-data.frame(x,y)
Error in data.frame(x, y) :
arguments imply differing number of rows: 10, 41
> y<-seq(from=100,to=300,length=10)
> df<-data.frame(x,y)
Note: To combine two vectors into a dataframe they have to be of same length Individual elements of a dataframe can be accessed same way as array using [] subscript operator.
> df
x y
1 1 100.0000
2 2 122.2222
3 3 144.4444
4 4 166.6667
5 5 188.8889
6 6 211.1111
7 7 233.3333
8 8 255.5556
9 9 277.7778
10 10 300.0000
> df[6,2]
[1] 211.1111
# Accessing entire column
> df[,2]
[1] 100.0000 122.2222 144.4444 166.6667 188.8889 211.1111 233.3333 255.5556 277.7778 300.0000
# Accessing entire row
> df[1,]
x y
1 1 100
Details about what any variable is storing can be determined using str() function. Other functions to determine the dimensions of any variable are: dim(), length(), ncol(), nrow etc
> str(df)
'data.frame': 10 obs. of 2 variables:
$ x: int 1 2 3 4 5 6 7 8 9 10
$ y: num 100 122 144 167 189 ...
> ncol(df)
[1] 2
> nrow(df)
[1] 10
> length(df)
[1] 2
> dim(df)
[1] 10 2
> length(df$x)
[1] 10
> dim(df)[1]
[1] 10
#referencing the last element
> x
[1] 1 2 3 4 5 6 7 8 9 10
> x[length(x)]
[1] 10
R comes with many built in datasets which can be used for practice purpose. A complete list of built in datasets can be accessed using the homepage of package datasets. https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html
discoveries dataset contains the numbers of “great” inventions and scientific discoveries in each year from 1860 to 1959.
> discoveries
Time Series:
Start = 1860
End = 1959
Frequency = 1
[1] 5 3 0 2 0 3 2 3 6 1 2 1 2 1 3 3 3 5 2 4 4 0 2 3 7 12 3 10 9 2 3 7
[33] 7 2 3 3 6 2 4 3 5 2 2 4 0 4 2 5 2 3 3 6 5 8 3 6 6 0 5 2 2 2 6 3
[65] 4 4 2 2 4 7 5 3 3 0 2 2 2 1 3 4 2 2 1 1 1 2 1 4 4 3 2 1 4 1 1 1
[97] 0 0 2 0
#converting in built data into a dataframe
> Discoveries<- data.frame(year=1860:1959,count=discoveries)
> head(Discoveries)
year count
1 1860 5
2 1861 3
3 1862 0
4 1863 2
5 1864 0
6 1865 3
The head() and tail() functions list the first or last six rows of a data frame. These functions comes in handy while dealing with larger datasets.
Given a logical statement, any() tests if at least one value in the set meets the criterion.
> any(Discoveries[,2]<0)
[1] FALSE
> any(Discoveries[,1] < 1860 | Discoveries[,1] > 1959)
[1] FALSE
We will be mainly dealing with big data through R, descriptive statistics functions like mean(), max(), which(), will be very helpful in navigating through the huge dataset quickly and accurately.
> mean(Discoveries[,2])
[1] 3.1
> round(mean(Discoveries[,2]))
[1] 3
> max(Discoveries[,2])
[1] 12
> which(Discoveries[,2]==12)
[1] 26
> Discoveries[26,]
year count
26 1885 12
To obtain quick summary statistics on a data object, use the summary() function.
> summary(Discoveries)
year count
Min. :1860 Min. : 0.0
1st Qu.:1885 1st Qu.: 2.0
Median :1910 Median : 3.0
Mean :1910 Mean : 3.1
3rd Qu.:1934 3rd Qu.: 4.0
Max. :1959 Max. :12.0
To sum a column in a data frame, use the colSums() function.
> colSums(Discoveries[2])
count
310
Note 1: Note that the colSums()function requires an array reference rather than a data frame, therefore no comma. Note 2: Note the camel casing in function name. camelCase is the practice of writing compound words or phrases such that each word or abbreviation begins with a capital letter. Remember R is case sensitive.
Queries become the most important part of data analysis. We have a huge dataset and we need to know something about a particular data. We use simple or complex queries to do statistical analysis on data or for searching something specific within the data.
# how many years were fewer than 5 discoveries observed?
> length(which(Discoveries[,2] < 5))
[1] 79
# In which years were fewer than 5 discoveries observed?
> Discoveries[(which(Discoveries[,2] < 5)),]
year count
2 1861 3
3 1862 0
4 1863 2
5 1864 0
# List of years with 0 discoveries
> Discoveries[(which(Discoveries[,2] == 0)),1]
[1] 1862 1864 1881 1904 1917 1933 1956 1957 1959
Missing data values in a data frame are encoded as NA. A missing value bars any calculation of summary statistics or numeric expression. Missing values can be removed using in built R function na.omit. Some functions support na.rm as argument to remove missing values while doing the calculations. To find out whether the dataset has missing values or not functions like any() or is.na() are used. Lets see these functions using in build dataset of R called "airquality" which contains measurements of daily air quality in New York City from May through September 1973.
> head(airquality)
Ozone Solar.RWind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NANA14.3 56 5 5
6 28 NA14.9 66 5 6
> mean(airquality$Solar.R)
[1] NA
Warning message:
In mean.default(airquality) :
argument is not numeric or logical: returning NA
> any(is.na(airquality))
[1] TRUE
> mean(airquality$Solar.R,na.rm=TRUE)
[1] 185.9315
> which(is.na(airquality$Solar.R))
[1] 5 6 11 27 96 97 98
> air_complete<-na.omit(airquality)
> head(air_complete)
Ozone Solar.RWind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
R commands can be created in a text file and loaded on demand rather than typing it in over and over. Create a text file in a text editor and save the file with the .R extension. Use the source()function to load and execute the script.
# Simple R script: created.R
x<-1:10
y<-seq(from=100,to=300,length=10)
df<-data.frame(x,y)
> source("created.R")
> df
x y
1 1 100.0000
2 2 122.2222
3 3 144.4444
4 4 166.6667
5 5 188.8889
6 6 211.1111
7 7 233.3333
8 8 255.5556
9 9 277.7778
10 10 300.0000
Source function specially comes in handy when you are dealing with huge dataset and loading data takes time.
In conditional execution, code statements are only executed if certain conditions are TRUE. The if() statement is used to construct conditional execution paths. The conditional code statements are enclosed in curly braces { and }.
a<-10
> b<-5
> if(a < b) {
+ print("a is less than b")
+ }
Operators | Semantics |
---|---|
== | Equality |
< | Less than |
> | Greater than |
<= | Less than or equal to |
>= | Greater than or equal to |
Operators | Semantics |
---|---|
&& | AND (both statements are true) |
! | NOT (Statement is false) |
> if (sum(1:10) >= sqrt(75)) {
+ print("true")
+ } else {
+ print("false")
+ }
[1] "true"
The ifelse() function provides a more compact syntax for if-else constructs.
> ifelse(sum(1:5) >= 10, "it's greater", "it's smaller")
[1] "it's greater"
Switch statements are used when a particular variable can have multiple cases and different logic is needed for each case.
> name <- readline(prompt="Enter a name: ")
Enter a name: Michelle
> switch(name,
+ Michelle={
+ print("Hi Michelle! How are you?") # any logical statement for Michelle
+ },
+ John={
+ print("Hi John! How are you?") # any logical statement for John
+ },
+ {
+ Print("default") #default logic
+ }
+ )
[1] "Hi Michelle! How are you?"
R supports two common forms of iteration (looping):
- restricted iterationwhich executes commands a fixed number of times: for loop
- unrestricted iterationin which the loop runs until some condition is no longer true: while loop
The for loop runs a fixed number of times based on the values assigned to an index or looping variable.
> for (i in 1:3) {
+ print(paste("i =",i))
+ }
[1] "i = 1"
[1] "i = 2"
[1] "i = 3"
> i
[1] 3
Instead of looping a fixed number of times, a forloop can also iterate over a set. The loop variable takes on each value in the set one at a time.
> cities <-c("Boston","NewYork","SanFrancisco")
> for (city in cities) {
+ print(city)
+ }
[1] "Boston"
[1] "New York"
[1] "San Francisco"
For loops can be nested to run through a each row and column of a matrix.
> mat<- matrix(nrow=4, ncol=5, sample(0:1))
> mat
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 1 1 1 1 1
[3,] 0 0 0 0 0
[4,] 1 1 1 1 1
> for (i in 1:nrow(mat) ) {
+ for (j in 1:ncol(mat)){
+ if(mat[i,j] == 1){
+ mat[i,j]<- "Michelle"
+ }
+ else{
+ mat[i,j]<- "John"
+ }
+ }
+ }
> mat
[,1] [,2] [,3] [,4] [,5]
[1,] "John" "John" "John" "John" "John"
[2,] "Michelle" "Michelle" "Michelle" "Michelle" "Michelle"
[3,] "John" "John" "John" "John" "John"
[4,] "Michelle" "Michelle" "Michelle" "Michelle" "Michelle"
In unrestricted iteration, the loop executes the loop statements until a condition is no longer true.
> x <-0
> while (x < 10) {
+ print (x)
+ x <-x + 1
+ }
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
Apply function is a modern way of looping which actually surpasses the lengthy process of looping and gives the same result.
Applies a function to components of a list or other object, and then returns the results as a list, a vector, or a matrix.
> x <- matrix(c(1:10), ncol=5, byrow=TRUE)
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
> apply(x, 1, mean)
[1] 3 8
> apply(x, 2, mean)
[1] 3.5 4.5 5.5 6.5 7.5
Note: apply(x, 1, mean) calculates the mean of two rows in x, and apply(x, 2, mean) calculates the mean of five columns in x.
In above example, apply extracts each column/row as a vector, one at a time and passes it to mean function. It substitutes the use of loop. Above code can be written as:
> avgs<-numeric(5)
> for(i in 1:5){
+ avgs[i]<-mean(x[,i])
+ }
> avgs
[1] 3.5 4.5 5.5 6.5 7.5
#OR
> apply(x, 2, mean)
[1] 3.5 4.5 5.5 6.5 7.5
Looping becomes very slow, much more so in large datasets. Apply function reduces the processing time considerably as looping in apply function is done in compiled code, like c or fortran, not in R’s own interpreted code.
- lapply - L in lapply stands for list. So lapply(x) returns a list of the same length of x
> x<- list(a<-c(1:20),b<-c(10:20),c<-c(20:30))
> x
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
[[2]]
[1] 10 11 12 13 14 15 16 17 18 19 20
[[3]]
[1] 20 21 22 23 24 25 26 27 28 29 30
> results<-lapply(x,mean)
> results
[[1]]
[1] 10.5
[[2]]
[1] 15
[[3]]
[1] 25
> class(results)
[1] "list"
- sapply - S stands for simplify. Sapply works like lapply but instead of returning a list it returns a simple vector.
> results<-sapply(x,mean)
> results
[1] 10.5 15.0 25.0
> class(results)
[1] "numeric"
- tapply - It is used to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.
> x <- 1:20
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> y <- factor(rep(letters[1:5], each = 4))
> y
[1] a a a a b b b b c c c c d d d d e e e e
Levels: a b c d e
> tapply(x, y, mean)
a b c d e
2.5 6.5 10.5 14.5 18.5
- mapply - mapply is used when you want to apply a function to the 1st element of each and then the 2nd elements of each etc.
> mapply(sum, 1:5, 1:5, 1:5)
[1] 3 6 9 12 15
> mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
- Break up big problems into manageable pieces
- Perform operations on each piece separately.
- Combine the output of each piece into a single output. Plyr package provides intuitive functions for split-apply-combine strategy
- Split the iris dataset into three parts.
- Remove the species name variable from the data.
- Calculate the mean of each variable for the three different parts separately.
- Combine the output into a single data frame.
> library (plyr)
> ddply(iris,~Species,function(x) colMeans(x[,-which(colnames(x)=="Species")]))
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
OR
> iris_mean <- adply(iris3,3,colMeans)
> iris_mean
X1 Sepal L. Sepal W. Petal L. Petal W.
1 Setosa 5.006 3.428 1.462 0.246
2 Versicolor 5.936 2.770 4.260 1.326
3 Virginica 6.588 2.974 5.552 2.026
> class(iris_mean)
[1] "data.frame"
Note: You need to install the plyr package to run this code.
The ability to read and write external text files is an essential part of data processing. Many data sets are stored in simple text files. Excel and other programs can export and import text files in certain formats.
lets Load the built-in data set AirPassengers containing monthly international airline passenger data between 1949 and 1960, After displaying the data set, copy the data into a simple text file.
> AirPassengers
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
While R has several functions for reading files, the most commonly used function for reading text files is read.table().
> ap<-read.table("airPassengers.txt",header=TRUE,sep="")
> ap
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
The read.table() function has a skip=xparameter which allows you to skip some number of lines.
> ap<-read.table("airPassengers.txt",skip=4,header=TRUE,sep="")
> ap
X1952 X171 X180 X193 X181 X183 X218 X230 X242 X209 X191 X172 X194
1 1953 196 196 236 235 229 243 264 272 237 211 180 201
2 1954 204 188 235 227 234 264 302 293 259 229 203 229
3 1955 242 233 267 269 270 315 364 347 312 274 237 278
4 1956 284 277 317 313 318 374 413 405 355 306 271 306
5 1957 315 301 356 348 355 422 465 467 404 347 305 336
6 1958 340 318 362 348 363 435 491 505 404 359 310 337
7 1959 360 342 406 396 420 472 548 559 463 407 362 405
8 1960 417 391 419 461 472 535 622 606 508 461 390 432
New columns can be added to a data set using the cbind() function.
> ap$Total<-cbind(rowSums(ap))
> ap
X1952 X171 X180 X193 X181 X183 X218 X230 X242 X209 X191 X172 X194 Total
1 1953 196 196 236 235 229 243 264 272 237 211 180 201 4653
2 1954 204 188 235 227 234 264 302 293 259 229 203 229 4821
3 1955 242 233 267 269 270 315 364 347 312 274 237 278 5363
4 1956 284 277 317 313 318 374 413 405 355 306 271 306 5895
5 1957 315 301 356 348 355 422 465 467 404 347 305 336 6378
6 1958 340 318 362 348 363 435 491 505 404 359 310 337 6530
7 1959 360 342 406 396 420 472 548 559 463 407 362 405 7099
8 1960 417 391 419 461 472 535 622 606 508 461 390 432 7674
A data object can be exported to a file using the write.table() function.
> getwd()
[1] "C:/Users/Martin/Downloads"
> write.table(ap,"AirPassNG.txt",col.names=NA,
row.names=TRUE,quote=FALSE,sep=",")
Note:R requires the use of a forward slash (‘/’) to separate directories (folders) not the back slash (‘\’) used by Windows.