-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathintroduction.html
520 lines (465 loc) · 43.9 KB
/
introduction.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
<!DOCTYPE html>
<html >
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>An Introduction to Statistical Programming Methods with R</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
<meta name="description" content="This book is under construction and serves as a reference for students or other interested readers who intend to learn the basics of statistical programming using the R language. The book will provide the reader with notions of data management, manipulation and analysis as well as of reproducible research, result-sharing and version control.">
<meta name="generator" content="bookdown 0.1.16 and GitBook 2.6.7">
<meta property="og:title" content="An Introduction to Statistical Programming Methods with R" />
<meta property="og:type" content="book" />
<meta property="og:description" content="This book is under construction and serves as a reference for students or other interested readers who intend to learn the basics of statistical programming using the R language. The book will provide the reader with notions of data management, manipulation and analysis as well as of reproducible research, result-sharing and version control." />
<meta name="github-repo" content="smac-group/ds" />
<meta name="twitter:card" content="summary" />
<meta name="twitter:title" content="An Introduction to Statistical Programming Methods with R" />
<meta name="twitter:description" content="This book is under construction and serves as a reference for students or other interested readers who intend to learn the basics of statistical programming using the R language. The book will provide the reader with notions of data management, manipulation and analysis as well as of reproducible research, result-sharing and version control." />
<meta name="author" content="Matthew Beckman, Stéphane Guerrier, Justin Lee & Roberto Molinari">
<meta name="date" content="2017-08-16">
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } }
});
</script>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black">
<script src="libs/jquery-2.2.3/jquery.min.js"></script>
<link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-bookdown.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
code > span.kw { color: #007020; font-weight: bold; } /* Keyword */
code > span.dt { color: #902000; } /* DataType */
code > span.dv { color: #40a070; } /* DecVal */
code > span.bn { color: #40a070; } /* BaseN */
code > span.fl { color: #40a070; } /* Float */
code > span.ch { color: #4070a0; } /* Char */
code > span.st { color: #4070a0; } /* String */
code > span.co { color: #60a0b0; font-style: italic; } /* Comment */
code > span.ot { color: #007020; } /* Other */
code > span.al { color: #ff0000; font-weight: bold; } /* Alert */
code > span.fu { color: #06287e; } /* Function */
code > span.er { color: #ff0000; font-weight: bold; } /* Error */
code > span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #880000; } /* Constant */
code > span.sc { color: #4070a0; } /* SpecialChar */
code > span.vs { color: #4070a0; } /* VerbatimString */
code > span.ss { color: #bb6688; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #19177c; } /* Variable */
code > span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code > span.op { color: #666666; } /* Operator */
code > span.bu { } /* BuiltIn */
code > span.ex { } /* Extension */
code > span.pp { color: #bc7a00; } /* Preprocessor */
code > span.at { color: #7d9029; } /* Attribute */
code > span.do { color: #ba2121; font-style: italic; } /* Documentation */
code > span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
</style>
<link rel="stylesheet" href="style.css" type="text/css" />
</head>
<body>
<div class="book without-animation with-summary font-size-2 font-family-1" data-basepath=".">
<div class="book-summary">
<nav role="navigation">
<ul class="summary">
<li><a href="./">A Minimal Book Example</a></li>
<li class="divider"></li>
<li class="chapter" data-level="1" data-path=""><a href="#introduction"><i class="fa fa-check"></i><b>1</b> Introduction</a><ul>
<li class="chapter" data-level="1.1" data-path=""><a href="#r-and-rstudio"><i class="fa fa-check"></i><b>1.1</b> <code>R</code> and <code>RStudio</code></a><ul>
<li class="chapter" data-level="1.1.1" data-path=""><a href="#why-r"><i class="fa fa-check"></i><b>1.1.1</b> Why <code>R</code>?</a></li>
<li class="chapter" data-level="1.1.2" data-path=""><a href="#getting-started-with-r"><i class="fa fa-check"></i><b>1.1.2</b> Getting started with <code>R</code></a></li>
<li class="chapter" data-level="1.1.3" data-path=""><a href="#about-rstudio"><i class="fa fa-check"></i><b>1.1.3</b> About RStudio</a></li>
<li class="chapter" data-level="1.1.4" data-path=""><a href="#conventions"><i class="fa fa-check"></i><b>1.1.4</b> Conventions</a></li>
<li class="chapter" data-level="1.1.5" data-path=""><a href="#simple-calculations"><i class="fa fa-check"></i><b>1.1.5</b> Simple calculations</a></li>
<li class="chapter" data-level="1.1.6" data-path=""><a href="#getting-help"><i class="fa fa-check"></i><b>1.1.6</b> Getting help</a></li>
<li class="chapter" data-level="1.1.7" data-path=""><a href="#installing-packages"><i class="fa fa-check"></i><b>1.1.7</b> Installing packages</a></li>
</ul></li>
<li class="chapter" data-level="1.2" data-path=""><a href="#basic-probability-and-statistics-with-r"><i class="fa fa-check"></i><b>1.2</b> Basic Probability and Statistics with <code>R</code></a><ul>
<li class="chapter" data-level="1.2.1" data-path=""><a href="#probability-distributions"><i class="fa fa-check"></i><b>1.2.1</b> Probability Distributions</a></li>
<li class="chapter" data-level="" data-path=""><a href="#problem"><i class="fa fa-check"></i>Problem</a></li>
<li class="chapter" data-level="" data-path=""><a href="#answer"><i class="fa fa-check"></i>Answer</a></li>
<li class="chapter" data-level="1.2.2" data-path=""><a href="#summary-statistics"><i class="fa fa-check"></i><b>1.2.2</b> Summary Statistics</a></li>
<li class="chapter" data-level="1.2.3" data-path=""><a href="#numerical-input"><i class="fa fa-check"></i><b>1.2.3</b> Numerical Input</a></li>
<li class="chapter" data-level="1.2.4" data-path=""><a href="#factor-input"><i class="fa fa-check"></i><b>1.2.4</b> Factor Input</a></li>
<li class="chapter" data-level="1.2.5" data-path=""><a href="#dataset-inputs"><i class="fa fa-check"></i><b>1.2.5</b> Dataset inputs</a></li>
</ul></li>
<li class="chapter" data-level="1.3" data-path=""><a href="#main-references"><i class="fa fa-check"></i><b>1.3</b> Main references</a></li>
<li class="chapter" data-level="1.4" data-path=""><a href="#licence"><i class="fa fa-check"></i><b>1.4</b> Licence</a></li>
<li class="chapter" data-level="1.5" data-path=""><a href="#acknowledgments"><i class="fa fa-check"></i><b>1.5</b> Acknowledgments</a></li>
</ul></li>
<li class="divider"></li>
<li><a href="https://github.com/rstudio/bookdown" target="blank">Published with bookdown</a></li>
</ul>
</nav>
</div>
<div class="book-body">
<div class="body-inner">
<div class="book-header" role="navigation">
<h1>
<i class="fa fa-circle-o-notch fa-spin"></i><a href="./">An Introduction to Statistical Programming Methods with R</a>
</h1>
</div>
<div class="page-wrapper" tabindex="-1" role="main">
<div class="page-inner">
<section class="normal" id="section-">
<div id="header">
<h1 class="title">An Introduction to Statistical Programming Methods with R</h1>
<h4 class="author"><em>Matthew Beckman, Stéphane Guerrier, Justin Lee & Roberto Molinari</em></h4>
<h4 class="date"><em>2017-08-16</em></h4>
</div>
<div id="introduction" class="section level1">
<h1><span class="header-section-number">Chapter 1</span> Introduction</h1>
<p>This book is currently under development and has been designed as a support for students who are following (or are interested in) courses that provide the basic knowledge to master “statistical programming” with R. By the latter we mean that area of computer programming which focuses on the implementation of methods that not only manage data but also extract meaningful information from it. The importance of this area of research comes from the increased collection of data from different sources such as academic research, public institutions and private companies that has required a corresponding increase in data management and analysis tools. Consequently, the need to develop applications and methods that are able to deliver these tools has brought to a surge in the demand for expertise not only in computer programming but also in statistical and numerical analysis. Indeed, while it is essential to master the basics of programming to build the necessary software, it is now also paramount to understand the programming tools that can effectively respond to the need of finding, extracting and analysing the information to achieve the required goals.</p>
<p>Within the above framework, the statistical software <code>R</code> has seen a rise in use due to its flexibility as an efficient language that builds a bridge between software development and data analysis. There are of course many other programming languages that have different advantages over <code>R</code> but, as explained further on, the latter is able to develop and quickly adapt to the different needs coming from the data management and analysis community while at the same time making use of other languages in order to deliver computationally efficient solutions (as well as other interesting features described below). With this premise, this book intends to present the basic tools to statistical programming and software development using the wide variety of tools made available through <code>R</code>, from method-specific packages to version control programs. The general goals of the book are therefore the following:</p>
<ul>
<li>understand data structures in order to manage data, computer memory and computations in an appropriate manner;</li>
<li>manipulate data structures through controls, instructions and tailored functions in order to achieve the desired output;</li>
<li>create software platforms (packages and web applications) that collect the developed functions in order to respond to a certain need;</li>
<li>learn how to manage software development via version control tools (GitHub) and create documentation for this software (with embedded code) to allow others to make use of the software.</li>
</ul>
<p>All these goals are common to any basic programming course, however all these will be heavily focused on the use and development of statistical tools. In fact, as highlighted earlier, it has become increasingly important to include statistical methodologies within the programming framework thereby allowing software to not only manage data efficiently but also to extract and analyse data in an appropriate manner while doing so. The rest of this introductory chapter will present the R software by explaining why it is used for this book and describing the basic notations and tools that need to be known in order to better grasp its contents.</p>
<p>Once the reader has finished this book, they should be able to:</p>
<ul>
<li>Bla</li>
<li>Bla</li>
<li>Bla</li>
</ul>
<div class="rmdimportant">
This document is <strong>under development</strong> and it is therefore preferable to always access the text online to be sure you are using the most up-to-date version. Due to its current development, you may encounter errors ranging from broken code to typos or poorly explained topics. If you do, please let us know! Simply send an email to ???????? or add an issue to the GitHub repository used for this document (which can be accessed here????) and we will make the changes as soon as possible. In addition, if you know RMarkdown and are familiar with GitHub, make a pull request and fix an issue yourself, otherwise, if you’re not familiar with these tools, they will be explained later on in the book itself.
</div>
<p></p>
<div id="r-and-rstudio" class="section level2">
<h2><span class="header-section-number">1.1</span> <code>R</code> and <code>RStudio</code></h2>
<p>The statistical computing language <code>R</code> has now become a widely used software in academia and industry. Having started as an open-source language to make available different statistics and analytical tools to researchers and the public, it steadily developed into one of the major software languages which not only allows to develop up-to-date, sound and flexible analytical tools but also to include these tools within a platform which is well integrated with other important programming languages, communication and version-control features. The latter is also possible thanks to the development of the <code>RStudio</code> interface which provides a pleasant and functional user-interface for <code>R</code> as well as an efficient Integrated Development Environment (IDE) in which different programming languages, web-applications and other important tools are available to the user.</p>
<div id="why-r" class="section level3">
<h3><span class="header-section-number">1.1.1</span> Why <code>R</code>?</h3>
<p>There are many reasons to use <code>R</code> nowadays, the first of which is the fact that is a free and open-source software which <em>per se</em> does not necessarily imply that it is a good software (although it is also that). The reason why this is an important feature consists in the fact that the results of any code or program developed in the <code>R</code> environment can easily be replicated therefore ensuring accessibility and transparency for the general user. More importantly however, this replicability of results is also accompanied by a wide variety of packages that are made available through the <code>R</code> environment in which users can find a diversity of codes, functions and features that are designed to tackle a large amount of programming and analytical tasks. Moreover, these packages are relatively simple to create and are extremely useful for code-sharing purposes since they enclose the codes, functions and external dependencies that allow anyone to install any of these features all at once in easy and efficient manner.</p>
<p>In addition to its accessibility and code-sharing features, <code>R</code> has acquired visibility and importance mainly due to the cutting-edge tools that it makes available to the general user. Indeed, a growing area of research both in academia and in industry is Statistics and Machine Learning through which it is possible to find, extract and make an efficient use of the increasing amount of data and information being collected. All the latest methods and approaches going from data-mining techniques to predictive analysis are available in <code>R</code> and, due to its nature, all future methods and approaches will be made available to all users through <code>R</code>. For this reason, any individual, company or organization has a keen interest in acquiring and developing expertise in <code>R</code> since it makes available the most appropriate tools for any data-based analysis and decision-making process.</p>
<p>Like any other software, there are of course some drawbacks with using <code>R</code>. Firstly, the presence of an extended amount of user-contributed packages can make its usage and bug-reporting problematic. Although this does not represent a major problem since many forums exist and solutions are usually quickly fixed, there can be many issues concerning package updates or deletions that can create problems for other existing packages that depend on them. Despite this being rare, there can consequently be problems in the use of packages that become obsolete and need to be fixed due to these different dependency issues. Another drawback consists in the extensive use of computer memory that <code>R</code> entails through its commands which generally give little relevance to this issue. However, many different solutions are being developed which deal with this problem along with the increased memory made available by current operating systems.</p>
<p>In the perspective of improving the usage of computer memory, <code>R</code> has been developing efficient and “seemless” connections with high-performance languages which allow functions and packages to make use of them thereby greatly lightening and accelerating computations made through <code>R</code>. An important example of this is given by the connections made available to the <code>C++</code> language. In this book we will discuss the connections with this language that are particularly well implemented, but other high-performance languages can be used such as <code>C</code> and <code>FORTRAN</code>.</p>
</div>
<div id="getting-started-with-r" class="section level3">
<h3><span class="header-section-number">1.1.2</span> Getting started with <code>R</code></h3>
<p>As mentioned earlier, <code>R</code> can be thought of as a programming language as well as a software environment for statistical programming. Since it is a free and open-source software, all you will need to do is to download it from the following link:</p>
<ul>
<li><a href="https://cran.r-project.org/"><code>R</code></a> .</li>
</ul>
<p>Once you’ve downloaded and installed <code>R</code> on your computer you will be able to start using the programming language and packages that the <code>R</code> environment provides. Nevertheless, to make full use of the latest developments and features of this software, in this book we recommend using the IDE called <code>RStudio</code> which can be downloaded from the following link:</p>
<ul>
<li><a href="https://www.rstudio.com/">RStudio</a> .</li>
</ul>
<div class="rmdimportant">
You cannot use <code>RStudio</code> without having installed <code>R</code> on your computer.
</div>
<p></p>
</div>
<div id="about-rstudio" class="section level3">
<h3><span class="header-section-number">1.1.3</span> About RStudio</h3>
<p><code>RStudio</code> is a customizable IDE for the <code>R</code> enviornment where the user can have an easily accessbile overview of the working directory, files, plots, data, objects and many other features that are useful to work efficiently with <code>R</code>. Moreover, it is possible to create projects in which it is possible to develop a self-contained environment for sets of specific functions and files aimed to deal with various tasks.</p>
<p><strong>Matt</strong> would you have something we could use here? <strong>Justin</strong> what about a video here to introduce RStudio???</p>
<p>We should add a link to the RStudio “<a href="https://www.rstudio.com/wp-content/uploads/2016/01/rstudio-IDE-cheatsheet.pdf">cheatsheet</a>”</p>
<p>In addition, <code>RStudio</code> provides embedded functions that allow to synchronize your work on GitHub as well as a set of powerful tools to save and comunicate results (whether they be simulations, data analysis or presenting and making available a new package to other users). Some examples of these tools are <code>Rmarkdown</code> and <code>Shiny Web App</code> which can be used respectively to note down results with embedded <code>R</code> code and to create an online application which can provide a user interface to supply data and retrieve results using <code>R</code>. GitHub and <code>Rmarkdown</code> will be the object of a more in-depth description in the first chapters of this book in order to provide the reader with the version-control and annotation tools that can be useful for the following chapters of this book.</p>
</div>
<div id="conventions" class="section level3">
<h3><span class="header-section-number">1.1.4</span> Conventions</h3>
<p>Throughout this book, <code>R</code> code will be typeset using a <code>monospace</code> font which is syntax highlighted. For example:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">a =<span class="st"> </span>pi
b =<span class="st"> </span><span class="fl">0.5</span>
<span class="kw">sin</span>(a*b)</code></pre></div>
<p>Similarly, <code>R</code> output lines (that usally appear in your Console) will begin with <code>##</code> and will not be syntax highlighted. The output of the above example is the following:</p>
<pre><code>## [1] 1</code></pre>
<p>Aside from <code>R</code> code and its outputs, this book will also insert some boxes that will draw the reader’s attention to some details that can be important, curious or purely informative in nature. An example of these boxes was seen at the beginning of this introduction where an important aspect was pointed out to the reader regarding the “under construction” nature of this book. Therefore the following boxes and symbols can be used to represent information of different nature:</p>
<div class="rmdimportant">
This is an important piece of information.
</div>
<p></p>
<div class="rmdnote">
This is some additional information that could be useful to the reader.
</div>
<p></p>
<div class="rmdcaution">
This is something that the reader should pay caution to but should not create major problems if not considered.
</div>
<p></p>
<div class="rmdwarning">
This is a warning which should be considered by the reader to avoid problems of different nature.
</div>
<p></p>
<div class="rmdtip">
This is a tip for the reader when following or developing something based on this book.
</div>
<p></p>
</div>
<div id="simple-calculations" class="section level3">
<h3><span class="header-section-number">1.1.5</span> Simple calculations</h3>
<p>A basic aspect to underline about the <code>R</code> environment is that it serves as an advanced calculator which therefore allows also for simple calculations. In the table below we show a few examples of such calculations where the first column gives a mathematical expression (calculation), the second gives the equivalent of this expression in <code>R</code> and finally in the third column we can find the result that is output from <code>R</code>.</p>
<table>
<thead>
<tr class="header">
<th align="left">Math.</th>
<th align="left">R</th>
<th align="left">Result</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">2+2</td>
<td align="left"><code>2+2</code></td>
<td align="left"><code>4</code></td>
</tr>
<tr class="even">
<td align="left"><span class="math inline">\(\frac{4}{2}\)</span></td>
<td align="left"><code>4/2</code></td>
<td align="left"><code>2</code></td>
</tr>
<tr class="odd">
<td align="left"><span class="math inline">\(3 \cdot 2^{-0.8}\)</span></td>
<td align="left"><code>3*2^(-0.8)</code></td>
<td align="left"><code>1.723048</code></td>
</tr>
<tr class="even">
<td align="left"><span class="math inline">\(\sqrt{2}\)</span></td>
<td align="left"><code>sqrt(2)</code></td>
<td align="left"><code>1.414214</code></td>
</tr>
<tr class="odd">
<td align="left"><span class="math inline">\(\pi\)</span></td>
<td align="left"><code>pi</code></td>
<td align="left"><code>3.141593</code></td>
</tr>
<tr class="even">
<td align="left"><span class="math inline">\(\ln(2)\)</span></td>
<td align="left"><code>log(2)</code></td>
<td align="left"><code>0.6931472</code></td>
</tr>
<tr class="odd">
<td align="left"><span class="math inline">\(\log_{3}(9)\)</span></td>
<td align="left"><code>log(9, base = 3)</code></td>
<td align="left"><code>2</code></td>
</tr>
<tr class="even">
<td align="left"><span class="math inline">\(e^{1.1}\)</span></td>
<td align="left"><code>exp(1.1)</code></td>
<td align="left"><code>3.004166</code></td>
</tr>
<tr class="odd">
<td align="left"><span class="math inline">\(\cos(\sqrt{0.9})\)</span></td>
<td align="left"><code>cos(sqrt(0.9))</code></td>
<td align="left"><code>0.5827536</code></td>
</tr>
</tbody>
</table>
</div>
<div id="getting-help" class="section level3">
<h3><span class="header-section-number">1.1.6</span> Getting help</h3>
<p>In the previous section we presented some examples on how <code>R</code> can be used as a calculator and we have already seen several functions such as <code>sqrt()</code> or <code>log()</code>. To obtain documentation about a function in <code>R</code>, simply put a question mark in front of the function name (or just type <code>help()</code> around the function name) and its documentation will be displayed. For example, if you are interested in learning about the function <code>log()</code> you simply type:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">?log</code></pre></div>
<p>which will display something similar to:</p>
<div class="figure">
<img src="images/example-log-help.png" />
</div>
<p>The <code>R</code> documentation can sometimes be very technical or hard to interpret. In these cases, the best solution to understand a function is to search for help on any search engine and you will probably find different forums such as “CrossValidated” or “StackExchange” in which the questions you have about a function have probably already been asked and answered by many other users.</p>
<div class="rmdtip">
You can often use the error message to search for answers about a problem you may have with a function.
</div>
<p></p>
</div>
<div id="installing-packages" class="section level3">
<h3><span class="header-section-number">1.1.7</span> Installing packages</h3>
<p><code>R</code> comes with a number of built-in functions but one of its main strengths is that there is a large number of packages that you can install. These packages provide additional functions, features and data to the R environement. If you want to do something in <code>R</code> that is not available by default, there is a good chance that there are packages that will respond to your needs. In this case, an appropriate way to find a package in <code>R</code> is to use the search option in the CRAN repository which is the official network of file-transfer protocols and web-servers that store updated versions of code and documentation for <code>R</code> (see CRAN website). Another general approach to find a package in <code>R</code> is simply to use a search engine in which to type the keywords of the tools you are looking for followed by “R package”.</p>
<p><code>R</code> packages can be installed in various ways but in this section we will only discuss the most straightforward approach to do so, which is through the <code>install.packages()</code> function. Another way is to use the “Tools -> Install Packages…” path from the dropdown menus in <code>RStudio</code> but the <code>install.packages()</code> function is nevertheless transversal to any platform for the <code>R</code> environment. It must be underlined that these approaches to install packages require that the packages are available within the CRAN repository. However, there is a growing number of packages that are under-development or completed and are made available through other repositories. In the latter setting, Chapter ????? (github) will show other ways of installing packages from a commonly used repository called “GitHub”.</p>
<p>Sticking momentarily to the packages available in the CRAN repository, the use of the <code>install.packages()</code> is quite simple. For example, if you want to install the package <code>devtools</code> you can simply write:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">install.packages</span>(<span class="st">"devtools"</span>)</code></pre></div>
<p>Once a package is installed it is not directly usable within your <code>R</code> session. To do so you will have to “load” the package into your current <code>R</code> session which is generally done through the function <code>library()</code>. For example, after having installed the <code>devtools</code> package, in order to use it within your session you would write:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(devtools)</code></pre></div>
<p>Once this is done, all the functions and documentation of this package are available and can be used within your current session. However, once you close your <code>R</code> session, all loaded packages will be closed and you will have to load them again if you want to use them in a new <code>R</code> session.</p>
<div class="rmdnote">
Please notice that although packages need to be loaded at each session if you want to use them, they need to be installed only once. The only exception to this rule is when you need to update the package or reinstall it for some reason.
</div>
<p></p>
</div>
</div>
<div id="basic-probability-and-statistics-with-r" class="section level2">
<h2><span class="header-section-number">1.2</span> Basic Probability and Statistics with <code>R</code></h2>
<p>The <code>R</code> environment provides an up-to-date and efficient programming language to develop different tools and applications. Nevertheless, its main functionality lies in the core statistical framework and tools that consistute the basis of this language. Indeed, this book aims at introducing and describing the methods and approaches of statistical programming which therefore require a basic knowledge of Probability and Statistics in order to grasp the logic and usefulness of the features presented in this book.</p>
<p>For this reason, we will briefly take the reader through some of the basic functions that are available within <code>R</code> to obtain probabilities based on parametric distributions, compute summary statistics and understand basic data structures. The latter is just an introduction and a more in-depth description of different data structures will be given in Chapter ???.</p>
<div id="probability-distributions" class="section level3">
<h3><span class="header-section-number">1.2.1</span> Probability Distributions</h3>
<p>Probability distributions can be uniquely characterized by different functions such as, for example, their density or distribution functions. Based on these it is possible to compute theoretical quantiles and also randomly sample observations from them. Replacing the <code>R</code> syntax for a given probability distribution with the general syntax <code>name</code>, all these functions and calculations are made available in <code>R</code> through the built-in functions:</p>
<ul>
<li><code>dname</code> calculates the value of the density function (pdf);</li>
<li><code>pname</code> calculates the value of the distribution function (cdf);</li>
<li><code>qname</code> calculates the value of the theoretical quantile;</li>
<li><code>rname</code> generates a random sample from a particular distribution.</li>
</ul>
<p>Note that, when using these functions in practice, <code>name</code> is replaced with the syntax used in <code>R</code> to denote a specific probability distribution. For example, if we wish to deal with a Uniform probability distribution, then the syntax <code>name</code> is replaced by <code>unif</code> and, furthering the example, to randomly generate observations from a uniform distribution the function to use will be therefore <code>runif</code>. <code>R</code> allows to make use of these functions for a wide variety of probability distributions that include, but are not limited to: Gaussian (or Normal), Binomial, Chi-square, Exponential, F-distribution, Geometric, Poisson, Student-t and Uniform. In order to get an idea of how these functions can be used, below is an example of a problem that can be solved using them.</p>
</div>
<div id="problem" class="section level3 unnumbered">
<h3>Problem</h3>
<p>Assume that the test scores of a college entrance exam follows a Normal distribution. Furthermore, suppose that the mean test score is 70 and that the standard deviation is 15. How would we find the percentage of students scoring 90 or more in this exam?</p>
</div>
<div id="answer" class="section level3 unnumbered">
<h3>Answer</h3>
<p>In this case, we consider a random variable <span class="math inline">\(X\)</span> that is normally distributed as follows: <span class="math inline">\(X \sim N(\mu=70, \sigma^2=225)\)</span> where <span class="math inline">\(\mu\)</span> and <span class="math inline">\(\sigma^2\)</span> represent the mean and variance of the distribution respectively. Since we are looking for the probability of students scoring higher than 90, we are interested in finding <span class="math inline">\(\mathbb{P}(X > x=90)\)</span> and therefore we look at the upper tail of the Normal distribution. To find this probability we need the distribution function (<code>pname</code>) for which we therefore replace <code>name</code> with the <code>R</code> syntax for the Normal distribution: <code>norm</code>. The distribution function in <code>R</code> has various parameters to be specified in order to compute a probability which, at least for the Normal distribution, can be found by typing <code>?pnorm</code> in the Console and are:</p>
<ul>
<li><code>q</code>: the quantile we are interested in (e.g. 90);</li>
<li><code>mean</code>: the mean of the distribution (e.g. 70);</li>
<li><code>sd</code>: the standard deviation of the distribution (e.g. 15);</li>
<li><code>lower.tail</code>: a boolean determining whether to compute the probability of being smaller than the given quantile (i.e. <span class="math inline">\(\mathbb{P}(X \leq x)\)</span>) which requires the default argument <code>TRUE</code> or larger (i.e. <span class="math inline">\(\mathbb{P}(X > x)\)</span>) which requires to specify the argument <code>FALSE</code>.</li>
</ul>
<p>Knowing these arguments, it is now possible to compute the probability we are interested in as follows:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">pnorm</span>(<span class="dt">q =</span> <span class="dv">90</span>, <span class="dt">mean =</span> <span class="dv">70</span>, <span class="dt">sd =</span> <span class="dv">15</span>, <span class="dt">lower.tail =</span> <span class="ot">FALSE</span>) </code></pre></div>
<pre><code>## [1] 0.09121122</code></pre>
<p>As we can see from the output, there is roughly a 9% probability of students scoring 90 or more in the exam.</p>
</div>
<div id="summary-statistics" class="section level3">
<h3><span class="header-section-number">1.2.2</span> Summary Statistics</h3>
<p>While the previous functions deal with theoretical distributions, it is also necessary to deal with real data from which we would like to extract information. This data, for example, can be simulated using the function <code>rname</code> (e.g. <code>rnorm</code>) and, supposing we don’t know from which distribution it is generated, we would be interested in understanding the behavior of the data in order to eventually identify a distribution and estimate its parameters.</p>
<p>The use of certain functions varies according to the nature of the inputs since these can be, for example, numerical or factors.</p>
</div>
<div id="numerical-input" class="section level3">
<h3><span class="header-section-number">1.2.3</span> Numerical Input</h3>
<p>A first step in analysing numerical inputs is given by computing summary statistics of the data which, in this section, we can generally denote as <code>x</code> (we will discuss the structure of this data more in detail in the following chapters). For central tendency or spread statistics of a numerical input, we can use the following <code>R</code> built-in functions:</p>
<ul>
<li><code>mean</code> calculates the mean of an input <code>x</code>;</li>
<li><code>median</code> calculates the median of an input <code>x</code>;</li>
<li><code>var</code> calculates the variance of an input <code>x</code>;</li>
<li><code>sd</code> calculates the standard deviation of an input <code>x</code>;</li>
<li><code>IQR</code> calculates the interquartile range of an input <code>x</code>;</li>
<li><code>min</code> calculates the minimum value of an input <code>x</code>;</li>
<li><code>max</code> calculates the maximum value of an input <code>x</code>;</li>
<li><code>range</code> returns a vector containing the minimum and maximum of all given arguments;</li>
<li><code>summary</code> returns a vector containing a mixture of the above functions (i.e. mean, median, first and third quartile, minimum, maximum).</li>
</ul>
</div>
<div id="factor-input" class="section level3">
<h3><span class="header-section-number">1.2.4</span> Factor Input</h3>
<p>If the data of interest is a factor with different categories or levels, then it cannot obviously be treated as a numerical variable and other statistics need to be computed. For example, for a factor input we can extract counts and percentages to summarize the variable by using <code>table</code>. Using functions and data structures that will be described in the following chapters, below we create an example dataset with 90 observations of three different colors: 20 being <code>Yellow</code>, 10 being <code>Green</code> and 50 being <code>Blue</code>. We then apply the <code>table</code> function to it:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">table</span>(<span class="kw">as.factor</span>(<span class="kw">c</span>(<span class="kw">rep</span>(<span class="st">"Yellow"</span>, <span class="dv">20</span>), <span class="kw">rep</span>(<span class="st">"Green"</span>, <span class="dv">10</span>), <span class="kw">rep</span>(<span class="st">"Blue"</span>, <span class="dv">50</span>))))</code></pre></div>
<pre><code>##
## Blue Green Yellow
## 50 10 20</code></pre>
<p>By doing so we obtain a frequency (count) table of the colors.</p>
</div>
<div id="dataset-inputs" class="section level3">
<h3><span class="header-section-number">1.2.5</span> Dataset inputs</h3>
<p>In many cases, when dealing with data we are actually dealing with datasets (see Chapter ???) where variables of different nature are aligned together (usually in columns). For datasets there is another convenient way to get simple summary statistics which consists in applying the function <code>summary</code> to the dataset itself (instead of simply a numerical input as seen earlier).</p>
<p>As an example, let us explore the <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set">Iris</a> flower dataset contained in the <code>R</code> built-in <code>datasets</code> package. The data set consists of 50 samples from each of three species of Iris (Setosa, Virginica and Versicolor). Four features were measured from each sample consisting in the length and the width (in centimeters) of the both sepals and petals. This dataset is widely used as an example since it was used by Fisher to develop a linear discriminant model based on which he intended to distinguish the three species from each other using combinations of these four features.</p>
<p>Using this dataset, let us use the <code>summary</code> function on it to output the minimum, first quartile and thrid quartile, median, mean and maximum statistics (for the numerical variables in the dataset) and frequency counts (for factor inputs).</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">summary</span>(iris)</code></pre></div>
<pre><code>## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
## </code></pre>
</div>
</div>
<div id="main-references" class="section level2">
<h2><span class="header-section-number">1.3</span> Main references</h2>
<p>This is not the first (or the last) book that has been written explaining and describing statistical programming in <code>R</code>. Indeed, this can be seen as a book that brings together and reorganizes information and material from other sources structuring and tailoring it to a course in basic statistical programming. The main references (which are far from being an exhaustive review of literature) that can be used to have a more in-depth view of different aspects treated in this book are:</p>
<ul>
<li><span class="citation">Wickham (<a href="#ref-wickham2014advanced">2014</a>)</span> : a more technical and advanced introduction to <code>R</code>;</li>
<li><span class="citation">Xie (<a href="#ref-xie2015">2015</a>)</span> : an overview of document generation in <code>R</code>;</li>
<li>…</li>
</ul>
</div>
<div id="licence" class="section level2">
<h2><span class="header-section-number">1.4</span> Licence</h2>
<p>We probably should pick a liscence… How about: This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License???? We should probably move this at the end of this section, no?</p>
</div>
<div id="acknowledgments" class="section level2">
<h2><span class="header-section-number">1.5</span> Acknowledgments</h2>
<p>…</p>
<div id="refs" class="references">
<div>
<p>Wickham, Hadley. 2014. <em>Advanced R</em>. CRC Press.</p>
</div>
<div>
<p>Xie, Yihui. 2015. <em>Dynamic Documents with R and Knitr</em>. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. <a href="http://yihui.name/knitr/" class="uri">http://yihui.name/knitr/</a>.</p>
</div>
</div>
</div>
</div>
<h3>References</h3>
<div id="refs" class="references">
<div id="ref-wickham2014advanced">
<p>Wickham, Hadley. 2014. <em>Advanced R</em>. CRC Press.</p>
</div>
<div id="ref-xie2015">
<p>Xie, Yihui. 2015. <em>Dynamic Documents with R and Knitr</em>. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. <a href="http://yihui.name/knitr/" class="uri">http://yihui.name/knitr/</a>.</p>
</div>
</div>
</section>
</div>
</div>
</div>
<script src="libs/gitbook-2.6.7/js/app.min.js"></script>
<script src="libs/gitbook-2.6.7/js/lunr.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-search.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-sharing.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-fontsettings.js"></script>
<script src="libs/gitbook-2.6.7/js/plugin-bookdown.js"></script>
<script src="libs/gitbook-2.6.7/js/jquery.highlight.js"></script>
<script>
require(["gitbook"], function(gitbook) {
gitbook.start({
"sharing": {
"facebook": true,
"twitter": true,
"google": false,
"weibo": false,
"instapper": false,
"vk": false,
"all": ["facebook", "google", "twitter", "weibo", "instapaper"]
},
"fontsettings": {
"theme": "white",
"family": "sans",
"size": 2
},
"edit": {
"link": "https://github.com/rstudio/bookdown-demo/edit/master/%s",
"text": "Edit"
},
"download": null,
"toc": {
"collapse": "subsection"
},
"search": false
});
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
if (location.protocol !== "file:" && /^https?:/.test(script.src))
script.src = script.src.replace(/^https?:/, '');
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>