-
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDescriptive-statistics.html
1163 lines (1009 loc) · 81.4 KB
/
Descriptive-statistics.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta content="width=device-width, initial-scale=1.0" name="viewport">
<title>Descriptive statistics</title>
<meta content="" name="description">
<meta content="" name="keywords">
<!-- Favicons -->
<link href="assets/img/Favicon-1.png" rel="icon">
<link href="assets/img/Favicon-1.png" rel="apple-touch-icon">
<!-- Google Fonts -->
<link href="https://fonts.googleapis.com/css?family=Open+Sans:300,300i,400,400i,600,600i,700,700i|Raleway:300,300i,400,400i,500,500i,600,600i,700,700i|Poppins:300,300i,400,400i,500,500i,600,600i,700,700i" rel="stylesheet">
<!-- Vendor CSS Files -->
<link href="assets/vendor/aos/aos.css" rel="stylesheet">
<link href="assets/vendor/bootstrap/css/bootstrap.min.css" rel="stylesheet">
<link href="assets/vendor/bootstrap-icons/bootstrap-icons.css" rel="stylesheet">
<link href="assets/vendor/boxicons/css/boxicons.min.css" rel="stylesheet">
<link href="assets/vendor/glightbox/css/glightbox.min.css" rel="stylesheet">
<link href="assets/vendor/swiper/swiper-bundle.min.css" rel="stylesheet">
<!-- Creating a python code section-->
<link rel="stylesheet" href="assets/css/prism.css">
<script src="assets/js/prism.js"></script>
<!-- Template Main CSS File -->
<link href="assets/css/style.css" rel="stylesheet">
<!-- To set the icon, visit https://fontawesome.com/account-->
<script src="https://kit.fontawesome.com/5d25c1efd3.js" crossorigin="anonymous"></script>
<!-- end of icon-->
<script type="text/javascript" async
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<!-- =======================================================
* Template Name: iPortfolio
* Updated: Sep 18 2023 with Bootstrap v5.3.2
* Template URL: https://bootstrapmade.com/iportfolio-bootstrap-portfolio-websites-template/
* Author: BootstrapMade.com
* License: https://bootstrapmade.com/license/
======================================================== -->
</head>
<body>
<!-- ======= Mobile nav toggle button ======= -->
<i class="bi bi-list mobile-nav-toggle d-xl-none"></i>
<!-- ======= Header ======= -->
<header id="header">
<div class="d-flex flex-column">
<div class="profile">
<img src="assets/img/myphoto.jpeg" alt="" class="img-fluid rounded-circle">
<h1 class="text-light"><a href="index.html">Arun</a></h1>
<div class="social-links mt-3 text-center">
<a href="https://www.linkedin.com/in/arunp77/" target="_blank" class="linkedin"><i class="bx bxl-linkedin"></i></a>
<a href="https://github.com/arunp77" target="_blank" class="github"><i class="bx bxl-github"></i></a>
<a href="https://twitter.com/arunp77_" target="_blank" class="twitter"><i class="bx bxl-twitter"></i></a>
<a href="https://www.instagram.com/arunp77/" target="_blank" class="instagram"><i class="bx bxl-instagram"></i></a>
<a href="https://arunp77.medium.com/" target="_blank" class="medium"><i class="bx bxl-medium"></i></a>
</div>
</div>
<nav id="navbar" class="nav-menu navbar">
<ul>
<li><a href="index.html#hero" class="nav-link scrollto active"><i class="bx bx-home"></i> <span>Home</span></a></li>
<li><a href="index.html#about" class="nav-link scrollto"><i class="bx bx-user"></i> <span>About</span></a></li>
<li><a href="index.html#resume" class="nav-link scrollto"><i class="bx bx-file-blank"></i> <span>Resume</span></a></li>
<li><a href="index.html#portfolio" class="nav-link scrollto"><i class="bx bx-book-content"></i> <span>Portfolio</span></a></li>
<li><a href="index.html#skills-and-tools" class="nav-link scrollto"><i class="bx bx-wrench"></i> <span>Skills and Tools</span></a></li>
<li><a href="index.html#language" class="nav-link scrollto"><i class="bi bi-menu-up"></i> <span>Languages</span></a></li>
<li><a href="index.html#awards" class="nav-link scrollto"><i class="bi bi-award-fill"></i> <span>Awards</span></a></li>
<li><a href="index.html#professionalcourses" class="nav-link scrollto"><i class="bx bx-book-alt"></i> <span>Professional Certification</span></a></li>
<li><a href="index.html#publications" class="nav-link scrollto"><i class="bx bx-news"></i> <span>Publications</span></a></li>
<li><a href="index.html#extra-curricular" class="nav-link scrollto"><i class="bx bx-rocket"></i> <span>Extra-Curricular Activities</span></a></li>
<!-- <li><a href="#contact" class="nav-link scrollto"><i class="bx bx-envelope"></i> <span>Contact</span></a></li> -->
</ul>
</nav><!-- .nav-menu -->
</div>
</header><!-- End Header -->
<main id="main">
<!-- ======= Breadcrumbs ======= -->
<section id="breadcrumbs" class="breadcrumbs">
<div class="container">
<div class="d-flex justify-content-between align-items-center">
<h2></h2>
<ol>
<li><a href="machine-learning.html" class="clickable-box">Content section</a></li>
<li><a href="index.html#portfolio" class="clickable-box">Portfolio section</a></li>
</ol>
</div>
</div>
</section><!-- End Breadcrumbs -->
<!------ right dropdown menue ------->
<div class="right-side-list">
<div class="dropdown">
<button class="dropbtn"><strong>Shortcuts:</strong></button>
<div class="dropdown-content">
<ul>
<li><a href="cloud-compute.html"><i class="fas fa-cloud"></i> Cloud</a></li>
<li><a href="AWS-GCP.html"><i class="fas fa-cloud"></i> AWS-GCP</a></li>
<li><a href="amazon-s3.html"><i class="fas fa-cloud"></i> AWS S3</a></li>
<li><a href="ec2-confi.html"><i class="fas fa-server"></i> EC2</a></li>
<li><a href="Docker-Container.html"><i class="fab fa-docker" style="color: rgb(29, 27, 27);"></i> Docker</a></li>
<li><a href="Jupyter-nifi.html"><i class="fab fa-python" style="color: rgb(34, 32, 32);"></i> Jupyter-nifi</a></li>
<li><a href="snowflake-task-stream.html"><i class="fas fa-snowflake"></i> Snowflake</a></li>
<li><a href="data-model.html"><i class="fas fa-database"></i> Data modeling</a></li>
<li><a href="sql-basics.html"><i class="fas fa-table"></i> QL</a></li>
<li><a href="sql-basic-details.html"><i class="fas fa-database"></i> SQL</a></li>
<li><a href="Bigquerry-sql.html"><i class="fas fa-database"></i> Bigquerry</a></li>
<li><a href="scd.html"><i class="fas fa-archive"></i> SCD</a></li>
<li><a href="sql-project.html"><i class="fas fa-database"></i> SQL project</a></li>
<!-- Add more subsections as needed -->
</ul>
</div>
</div>
</div>
<!-- ======= Portfolio Details Section ======= -->
<section id="portfolio-details" class="portfolio-details">
<div class="container">
<div class="row gy-4">
<div class="col-lg-8">
<div class="portfolio-details-slider swiper">
<div class="swiper-wrapper align-items-center">
<div class="swiper-slide">
<h1>Descriptive statistics</h1>
<figure>
<img src="assets/img/data-engineering/classification.png" alt="" style="max-width: 50%; max-height: 50%;">
<figcaption></figcaption>
</figure>
</div>
</div>
<div class="swiper-pagination"></div>
</div>
</div>
</div>
<section id="introdction">
<h2>Descriptive statistics</h2>
Descriptive statistics is a branch of statistics that deals with the collection, organization, analysis, and presentation of data. It involves summarizing and describing the main features of a dataset, such as the central tendency, variability, and distribution of the data.
<figure>
<img src="assets/img/data-engineering/descriptive-stat.png" alt="" style="max-width: 90%; max-height: 90%;">
<figcaption>Image credit: Scribbr</figcaption>
</figure>
Some common measures of descriptive statistics include:
<ol>
<li><strong>Measures of central tendency: </strong>
<ul>
<li><strong>Mean: </strong>The mean is the arithmetic average of a dataset and is calculated by adding up all the values in the dataset and dividing by the total number of values. If
\(x_1, x_2, x_3, ..... x_i ...., x_k\) have frequency \(f_1, f_2, f_3,…… f_k\) then</p>
$$\mu = \sum_i \frac{f_i x_i}{N}$$
i.e.
$$\text{Mean} = \frac{\text{sum of all values}}{\text{total number of values}}$$
<p><strong>Example:</strong> if we have a dataset of test scores for a class of students: 70, 80, 90, 85, and 75, we can calculate the mean by adding up all the scores and dividing by the total number of scores: Mean = (70 + 80 + 90 + 85 + 75) / 5 = 80. So the mean test score for the class is 80.</p>
<p>The mean is commonly used in statistics to summarize and describe a dataset, and is often used as a benchmark for making comparisons between different groups or distributions. However, the mean can be affected by extreme values or outliers, which can skew the results. In such cases, it may be more appropriate to use other measures of central tendency, such as the median or mode, to represent the typical or central value of the dataset.</p>
</li>
<li><strong>median: </strong>
The median is the middle value of a dataset when the values are arranged in order of magnitude. It is used to represent the typical or central value when the data are skewed or have outliers.</p>
<ul>
<li><strong>How to calculate?:</strong> To calculate the median, follow these steps:</li>
</ul>
<ol>
<li>Arrange the values in the dataset in order from smallest to largest (or vice versa).</li>
<li>If the dataset has an odd number of values, the median is the middle value. For example, in the dataset {1, 3, 5, 7, 9}, the median is 5 because it is the middle value.</li>
<li>If the dataset has an even number of values, the median is the average of the two middle values. For example, in the dataset {1, 3, 5, 7, 9, 11}, the two middle values are 5 and 7, so the median is (5+7)/2 = 6.</li>
</ol>
<p>The median is a useful measure of central tendency for datasets that have outliers or extreme values, as it is less sensitive to these values than the mean. Additionally, the median is appropriate for ordinal data, where the values have an inherent order but the distance between values is not meaningful (e.g. ranks, grades).</p>
</li>
<li><strong>Mode: </strong>
The mode is the value that occurs most frequently in a dataset. It is used to represent the most common or typical value when the data are categorical or have a discrete distribution. Unlike mean and median, the mode does not take into account the actual numerical values of the data points, but only their frequencies.</p>
<ul>
<li><p><strong>How to calculate?:</strong> The mode can be calculated for any type of data, including nominal, ordinal, interval, and ratio data. In a dataset with a single mode, there is only one value that occurs more frequently than any other value. However, it is also possible to have datasets with multiple modes, where there are several values that occur with the same highest frequency.</p>
<p><strong>Example:</strong> Here is an example of how to calculate the mode for a dataset of heights:</p>
<ol>
<li><p>Sort the dataset in ascending order: 62, 64, 66, 66, 68, 68, 68, 70, 70, 72.</p></li>
<li><p>Count the frequency of each value: 62 (1), 64 (1), 66 (2), 68 (3), 70 (2), 72 (1).</p></li>
<li><p>Identify the value with the highest frequency: 68.</p></li>
<li><p>The mode of the dataset is 68, indicating that 68 is the most common height in the dataset.</p></li>
</ol>
<p>Note that in some cases, a dataset may not have a mode if all the values occur with the same frequency. In other cases, the mode may not be a meaningful measure of central tendency if there are extreme values or outliers that skew the distribution.</p>
</li>
<li><p>The mode is often used in conjunction with other measures of central tendency, such as mean and median, to gain a better understanding of the underlying distribution of the data. It is especially useful for describing skewed distributions, where the mean and median may not accurately represent the central tendency of the data.</p></li>
</ul>
<strong>Choice of which measure: </strong> The choice of which measure of central tendency to use depends on the nature of the data and the research question. The mean is commonly used when the data are normally distributed and have a symmetrical distribution. The median is used when the data are skewed or have outliers. The mode is used when the data are categorical or have a discrete distribution.
</li>
<li><strong>Measures of variability: </strong>
<p>Measures of variability are statistical measures that describe the spread or dispersion of a dataset. Some common measures of variability include:</p>
<ul>
<li><p><strong>Range:</strong> The range is the difference between the maximum and minimum values in a dataset. It is the simplest measure of variability but can be heavily influenced by outliers. It is calculated using the formula:</p>
$$\text{Range} = \text{max value} - \text{min value}$$
<p><strong>Example:</strong> if a dataset consists of the following values: 2, 5, 7, 8, 12, the range would be calculated as:</p>
<p>Range = 12 - 2 = 10</p>
</li>
<li><p><strong>Variance:</strong> The variance measures how much the values in a dataset vary from the mean. It is calculated by taking the average of the squared differences between each value and the mean. It is calculated using the formula:</p>
$$\text{Variance} = \sum \frac{(x-\mu)^2}{n}$$
<p>Variance is commonly used in statistical analysis and can be influenced by extreme values.</p>
<p>where \(\sum\) represents the sum of, \(x\) represents each value in the dataset, \(\mu\) represents the mean of the dataset, and \(n\) represents the number of values in the dataset.</p>
<p><strong>Example:</strong> If a dataset consists of the following values: 10, 15, 20, 25, 30, and the mean is calculated to be 20, the variance would be calculated as:</p>
$$Variance = [(10-20)^2 + (15-20)^2 + (20-20)^2 + (25-20)^2 + (30-20)^2] / 5 = 200 / 5 = 40$$
</li>
<li><p><strong>Standard deviation:</strong> Standard deviation is a measure of how spread out a set of data is from its mean or average. It tells you how much the data deviates from the average. A low standard deviation indicates that the data is clustered closely around the mean, while a high standard deviation indicates that the data is spread out over a larger range of values. It is a commonly used measure of variability and is often preferred over the variance because it is expressed in the same units as the original data. The formula for standard deviation is:</p>
$$\sigma = \sqrt{\frac{\sum (x-\mu)^2}{n}}$$
<p>(Standard deviation of the population)</p>
<p>where:</p>
<ul>
<li>\(\sigma\) is the standard deviation</li>
<li>\(\sum\) is the sum of all the data points</li>
<li>\(x\) is each individual data point</li>
<li>\(\mu\) is the mean or average of the data</li>
<li>\(n\) is the total number of data points</li>
</ul>
<p><strong>Method:</strong> To find the standard deviation, you first subtract each data point from the mean, square the result, sum up all the squared differences, divide by the total number of data points, and finally, take the square root of the result.</p>
<p><strong>Example:</strong> let's say you have the following set of data: {2, 4, 6, 8, 10}.</p>
<ul>
<li>First, find the mean: \(\mu = (2 + 4 + 6 + 8 + 10) / 5 = 6\).</li>
<li>Next, calculate the difference between each data point and the mean: (2 - 6) = -4, (4 - 6) = -2, (6 - 6) = 0, (8 - 6) = 2, (10 - 6) = 4.</li>
<li>Then, square each of these differences and add up all the squared differences: \((-4)^2 = 16, (-2)^2 = 4, (0)^2 = 0, (2)^2 = 4, (4)^2 = 6.\)</li>
<li>Divide by the total number of data points: 16 + 4 + 0 + 4 + 16 = 40.</li>
<li>Finally, take the square root of the result: 40 / 5 = 8.</li>
<li>So, the standard deviation of this set of data is approximately 2.83.</li>
</ul>
</li>
<li><p><strong>Interquartile range (IQR):</strong> The IQR is the difference between the third quartile (the value above which 75% of the data falls) and the first quartile (the value below which 25% of the data falls). It is a measure of the spread of the middle 50% of the data and is less influenced by extreme values than the range.</p>
<p>The formula for calculating the IQR is as follows:</p>
$$\text{IQR} =Q_3 -Q_1$$
<p>Where \(Q_3\) is the third quartile and \(Q_1\) is the first quartile. The quartiles are calculated by dividing the dataset into four equal parts. The first quartile (i.e. \(Q_1\)) represents the 25th percentile of the dataset, and the third quartile (i.e. \(Q_3\)) represents the 75th percentile.</p>
<figure>
<img src="assets/img/data-engineering/IQR.png" alt="" style="max-width: 90%; max-height: 90%;">
<figcaption style="text-align: center;"><strong>Image credit: </strong><a href="https://commons.wikimedia.org/wiki/File:Boxplot_vs_PDF.svg">Jhguch at en.wikipedia</a>, <a href="https://creativecommons.org/licenses/by-sa/2.5">CC BY-SA 2.5</a>, via Wikimedia Commons</figcaption>
</figure>
<p><strong>Example:</strong> Consider the following dataset: 1, 3, 5, 6, 7, 8, 9, 10, 11, 15.</p>
<ul>
<li>The first quartile (\(Q_1\)) is 4 and the third quartile (\(Q_3\)) is 10. Therefore, the IQR is:</li>
</ul>
$$IQR = Q_3 - Q_1 = 10 - 4 = 6$$
<ul>
<li>This means that the middle 50% of the dataset (between the 25th and 75th percentiles) falls within a range of 6.</li>
</ul>
<blockquote>
<p><strong>Quartiles:</strong> Quartiles are a way to divide a dataset into four equal parts or quarters. Quartiles are used to understand the distribution of a dataset and to calculate other measures of variability such as the interquartile range.
There are three quartiles that divide a dataset into four parts:</p>
<ul>
<li>The first quartile (\(Q_1\)) is the 25th percentile of the dataset. It divides the dataset into the bottom 25% and the top 75%.</li>
<li>The second quartile (\(Q_2\)) is the median of the dataset. It divides the dataset into two equal parts.</li>
<li>The third quartile (\(Q_3\)) is the 75th percentile of the dataset. It divides the dataset into the bottom 75% and the top 25%.</li>
</ul>
</blockquote>
</li>
<li><p><strong>Mean absolute deviation (MAD):</strong> The mean absolute deviation (MAD) is a measure of variability that indicates how much the observations in a dataset deviate, on average, from the mean of the dataset. The MAD is the average of the absolute differences between each value and the mean. It is a robust measure of variability that is less sensitive to outliers than the variance and standard deviation.</p>
<p><strong>Formula:</strong> MAD is calculated by finding the absolute difference between each data point and the mean, then taking the average of those absolute differences. The formula for calculating MAD is as follows:</p>
$$\text{MAD} = \frac{1}{n}\sum_i^n |x_i - \mu|$$
<p>Where \(n\) is the number of observations in the dataset, \(x_i\) is the value of the ith observation, \(\mu\) is the mean of the dataset, and \(\sum\) represents the sum of the absolute differences.</p>
<p><strong>Example:</strong> For example, consider the following dataset: 2, 3, 5, 6, 7, 8, 9, 10, 11, 15</p>
<p>To calculate the MAD, we first find the mean of the dataset:</p>
<p>\(\mu\) = (2 + 3 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 15) / 10 = 7.6</p>
<p>Next, we find the absolute difference between each data point and the mean: |2 - 7.6| = 5.6, |3 - 7.6| = 4.6, |5 - 7.6| = 2.6, |6 - 7.6| = 1.6, |7 - 7.6| = 0.6, |8 - 7.6| = 0.4, |9 - 7.6| = 1.4, |10 - 7.6| = 2.4, |11 - 7.6| = 3.4, |15 - 7.6| = 7.4.</p>
<p>Then we take the average of those absolute differences:</p>
<p>\(MAD = (1/10) \times (5.6 + 4.6 + 2.6 + 1.6 + 0.6 + 0.4 + 1.4 + 2.4 + 3.4 + 7.4) = 3.34\)</p>
<p>The MAD for this dataset is 3.34, which means that, on average, each observation deviates from the mean by approximately 3.34.</p>
</li>
</ul>
<p>These measures of variability are useful in providing information about how much the values in a dataset vary from each other. The appropriate measure to use depends on the specific characteristics of the data and the research question being asked.</p>
</li>
<li><strong>Measures of distribution: </strong>
<p>Skewness and kurtosis are two statistical measures used to describe the shape of a probability distribution.</p>
<ul>
<li><p><strong>Skewness:</strong> Skewness measures the degree of asymmetry in a distribution. A distribution with a positive skewness has a longer tail on the positive side of the mean, while a negative skewness means the tail is longer on the negative side of the mean. A perfectly symmetrical distribution has a skewness of zero.</p>
<table>
<tr>
<td><img src="/assets/img/data-engineering/Pos-skew.jpeg" alt="Positive Skew"></td>
<td><img src="/assets/img/data-engineering/neg-skew.jpeg" alt="Negative Skew"></td>
<td><img src="/assets/img/data-engineering/zero-skew.png" alt="Zero Skew"></td>
</tr>
</table>
<p>(<a href="https://www.analyticsvidhya.com/blog/2021/08/a-guide-to-complete-statistics-for-data-science-beginners/">Image credit</a>)</p>
<p>Here are three common measures of skewness:</p>
<ol>
<li><p><strong>Pearson's moment coefficient of skewness:</strong></p>
$$\text{Skewness} = \frac{3(\text{Mean}-\text{Mode})}{\text{Standard deviation}}.$$
<p>This is the formula described above that uses the third moment of the distribution to measure skewness.</p>
</li>
<li><p><strong>Sample skewness:</strong> This is a formula that uses the sample mean, standard deviation, and third central moment to estimate the skewness of the distribution. The formula for sample skewness is:</p>
$$\text{Skewness} = \frac{n}{(n - 1) * (n - 2)}\times \left(\frac{\sum(x_i - \mu)^3}{\sigma_s^3}\right)$$
<p>(known as Fisher-Pearson standardized moment coefficient)</p>
<p>where \(n\) is the sample size, \(\mu\) is the sample mean, \(x_i\) is the \(i\)-th observation in the sample, and \(\sigma_s\) is the sample standard deviation.</p>
<blockquote>
<p><strong>Sample standard deviation:</strong> The sample standard deviation measures the spread of the data around the mean. It tells you how much the individual data points deviate from the mean, on average. Note that the sample standard deviation is calculated using \(n - 1\) in the denominator instead of \(n\), which is known as Bessel's correction. This is because using \(n\) instead of \(n-1\) tends to underestimate the true variance of the population from which the sample was drawn.</p>
<p>Formula:</p>
$$\sigma_s = \sqrt{\frac{\sum_i^n (x_i-\mu)}{n-1}}$$
<p>Care should be taken when getting the standard deviation because the standard deviation is different from the standard deviation of a sample. If the problem describes a situation dealing with a sample or subset of a group, then the sample standard deviation, s, should be used.</p>
</blockquote>
<p><strong>How to Transform Skewed Data?</strong> The graph of skewed data may be transformed into a symmetrical, balanced bell curve shape by changing the data using various methods. The selection of which method to use depends on the characteristic of the data set and its behavior. Here are the most common ways of correcting the skewness of data distribution:</p>
<ul>
<li>Logarithmic transformation</li>
<li>Square root transformation</li>
<li>Inverse transformation</li>
<li>Box-Cox transformation</li>
</ul>
<p>It is important to note that transforming the data may not always be necessary or appropriate. The choice of transformation depends on the distribution of the data, the research question, and the statistical model being used. In addition, some transformations may change the interpretation of the data, so it is important to carefully consider the implications of any transformations before applying them.</p>
</li>
<li><p><strong>Quartile skewness:</strong> This measure of skewness is based on the difference between the median and the mode of the distribution. Specifically, the quartile skewness is defined as:</p>
$$\text{Skewness} = \frac{Q_1 + Q_3 - 2 * \text{median}}{Q_3 - Q_1}$$
<p>where \(Q_1\) and \(Q_3\) are the first and third quartiles of the distribution, and the median is the second quartile.</p>
</li>
</ol>
<p>Each of these measures of skewness has its own strengths and weaknesses, and the choice of measure may depend on the context and purpose of the analysis.</p>
</li>
<li><p><strong>Kurtosis:</strong> Kurtosis is a statistical measure that describes the shape of a distribution by measuring the degree of peakedness or flatness of the distribution compared to the normal distribution. A distribution with high kurtosis indicates that the data have many outliers or extreme values, while a distribution with low kurtosis indicates that the data are more spread out and have fewer outliers.</p>
<p><strong>How to calculate kurtosis:</strong> Mathematically speaking, kurtosis is the standardized fourth moment of a distribution. Moments are a set of measurements that tell you about the shape of a distribution.</p>
<p>Moments are standardized by dividing them by the standard deviation raised to the appropriate power.</p>
<ul>
<li><p><strong>Kurtosis of a population:</strong> The following formula describes the kurtosis of a population:</p>
$$\text{Kurtosis} = \tilde{\mu}_4 = \frac{\mu_4}{\sigma^4}.$$
<p>Where:</p>
<ul>
<li>\(\tilde{\mu}_4\) is the standardized fourth moment</li>
<li>\(\mu_4\) is the unstandardized central fourth moment</li>
<li>\(\sigma\) is the standard deviation</li>
</ul>
</li>
<li><p><strong>Kurtosis of a sample:</strong> The kurtosis of a sample is an estimate of the kurtosis of the population.</p>
<p>It might seem natural to calculate a sample’s kurtosis as the fourth moment of the sample divided by its standard deviation to the fourth power. However, this leads to a biased estimate.</p>
<p>The formula for the unbiased estimate of excess kurtosis includes a lengthy correction based on the sample size:</p>
$$\text{Kurtosis} = \frac{(n+1)(n-1)}{(n-1)(n-3)}\frac{\sum (x_i -\mu)^4}{(\sum (x_i - \mu)^2)^2}- 3\frac{(n-1)^2}{(n-2)(n-3)}$$
<p>Where</p>
<ul>
<li>\(n\) is the sample size</li>
<li>\(x_i\) are observations of the variable x</li>
<li>\(\mu\) is the mean of the variable x.</li>
</ul>
</li>
</ul>
<p><strong>Types of kurtosis:</strong> Examples of kurtosis include:</p>
<ol>
<li><p><strong>Mesokurtic distribution:</strong> A mesokurtic distribution has a kurtosis value of zero and is similar in shape to the normal distribution. It has a moderate degree of peakedness and is neither too flat nor too peaked.</p>
</li>
<li><p><strong>Leptokurtic distribution:</strong> A leptokurtic distribution has a kurtosis value greater than zero and is more peaked than the normal distribution. It has heavier tails and more outliers than a normal distribution.</p>
</li>
<li><p><strong>Platykurtic distribution:</strong> A platykurtic distribution has a kurtosis value less than zero and is flatter than the normal distribution. It has fewer outliers and less extreme values than a normal distribution.</p>
</li>
<figure>
<img src="assets/img/data-engineering/kurtosis.png" alt="" style="max-width: 90%; max-height: 90%;">
<figcaption style="text-align: center;"><strong>Image credit: </strong> scribbr</figcaption>
</figure>
</ol>
<p>It's important to note that kurtosis can only be interpreted in the context of the specific distribution being analyzed. A high or low kurtosis value does not necessarily indicate that the data are problematic or that any action needs to be taken. Rather, kurtosis can provide insight into the shape of the distribution and can help to identify potential issues with the data.</p>
</li>
</li>
</ol>
Descriptive statistics are commonly used in fields such as business, economics, psychology, sociology, and healthcare, among others. They are an important tool for making informed decisions and drawing meaningful conclusions from data.
</section>
<section id="Probability">
<h1>Probability distributions and hypothesis testing</h2>
<ul>
<li>Probability is a subject that deals with uncertainty.</li>
<li>In everyday terminology, probability can be thought of as a numerical measure of the likelihood that a particular event will occur.</li>
<li>Probability values are assigned on a scale from <code>0</code> to <code>1</code>, with values near <code>0</code> indicating that an event is unlikely to occur and those near <code>1</code> indicating that an event is likely to take place.</li>
<li>Suppose that an event <code>E</code> can happen in <code>h</code> ways out of a total of <code>n</code> possible equally likely ways. Then the probability of occurrence of the event (called its success) is denoted by
$$p=Pr\{E\}=\frac{h}{n} ~~~~~~~~~~~~~~~ (\text{success probability}) $$
</li>
<li>The probability of non-occurrence of the event (called its failure) is denoted by
$$𝑞=1−𝑝 \rightarrow 𝑝+𝑞=1 $$
</li>
</ul>
<h3>Conditional probability; Independent and dependent events</h3>
<ul>
<li><p>If \(E_1\) and \(E_2\) are two events, the probability that \(E_2\) occurs given that \(E_1\) has occurred is denoted by
$$Pr\{E_2|E_1\}, ~~~~~\text{or}~~~~~ Pr\{E_2 ~\text{given} ~E_1\},$$
and is called the conditional probability of \(E_2\) given that \(E_1\) has occurred.</p>
</li>
<li>If the occurrence or non-occurrence of \(E_1\) does not affect the probability of occurrence of \(E_2\), then
$$Pr\{E_2 | E_1\}=Pr\{E_2\}$$
and we say that \(E_1\) and \(E_2\) are independent events, they are dependents.
</li>
<li>If we denote by (\(E_1~ E_2\)) the event that "both \(E_1\) and \(E_2\) occur,’’ sometimes called a compound event, then</p>
$$Pr\{𝐸_1~ 𝐸_2\} = Pr\{𝐸_1\} Pr\{𝐸_2 | 𝐸_1\}$$
</li>
<li>Similarly for three events \((𝐸_1 𝐸_2 𝐸_3)\)
$$Pr\{𝐸_1~ 𝐸_2 ~𝐸_3\} = Pr\{ 𝐸_1 \} Pr\{ 𝐸_2 | 𝐸_1 \} Pr\{ 𝐸_3 | 𝐸_2 ~𝐸_1\} $$
<p>If these events are independent, then</p>
$$Pr\{𝐸_1 ~ 𝐸_2 \} = Pr\{ 𝐸_1 \} Pr\{ 𝐸_2 \}$$
<p>Similarly</p>
$$Pr\{𝐸_1 ~ 𝐸_2~ 𝐸_3\}=Pr\{ 𝐸_1 \} Pr\{ 𝐸_2 \} Pr\{𝐸_3\}$$
</li>
</ul>
<h3>Mutually exclusive events</h3>
<ul>
<li>Two or more events are called mutually exclusive if the occurrence of any one of them excludes the occurrence of the others.
Thus if \(E_1\) and \(E_2\) are mutually exclusive events, then
$$Pr\{ 𝐸_1~ 𝐸_2 \} = 0.$$
</li>
<li>If (\(E_1 + E_2\)) denotes the event that ‘‘either \(E_1\) or $E_2$ or both occur’’, then
$$Pr\{ 𝐸_1 + 𝐸_2 \} = Pr\{ 𝐸_1 \} + Pr\{ 𝐸_2 \} − Pr\{ 𝐸_1 ~ 𝐸_2 \}.$$
</li>
</ul>
<h2>Random Variables</h2>
Random variables play an important role in describing, measuring and analyzing uncertain events. It is a function that maps
every outcome in the sample space to a real number. A random variable can be classified as:
<ul>
<li><strong>Discrete random variable: </strong>
<ul>
<li>Takes on a countable number of distinct values.</li>
<li>Examples include the number of heads in multiple coin tosses or the count of occurrences in a specific time period.</li>
<li>Discrete random variable are described using "Probability mass Function (PMF)" and "Cumulative Distribution Function (CDF)".</li>
<li>PMF is the probability that a random variable X takes a specific value k; for example. the number of
fraudulent transactions at an e-commerce platform is 10, written as \(P(X=10)\). On the other hand, CDF is the probability that a random variable X, takes a value less than or equal to 10
which is written as \(P(X\leq 10)\).
</li>
</ul>
</li>
<li><strong>Continuous random variable</strong>
<ul>
<li>A random variable X which can take a value from an infintie set of values is called a continuous random variable.</li>
<li>Examples include measurements like height, weight, or time intervals.</li>
<li>Continuous random variables are described using "Probability Desnity Function (PDF)", and "Cumulative Distribution Fnction (CDF)".
PDF is the probabilitythat a continuous random variable \(X\) takes value in a small neighbourhood of "\(x\)" and is given by:
$$f(x) = \text{Lim}_{\delta x \rightarrow 0} P[x\leq X \leq x+\delta x].$$
The CDF of a continuous random varibale is the probability that the random variable \(X\) takes value less than or equal to a value "\(a\)". Mathematically:
$$F(a) = \int_{-\infty }^\infty f(x) dx.$$
</li>
</ul>
</li>
</ul>
<h2>Probability distributions</h2>
<h2>Types of probability distributions</h2>
<p>There are two types of probability distributions:</p>
<h3>1. Discrete</h3>
<p>A discrete probability distribution assigns probabilities to a finite or countably infinite number of possible outcomes. There are several types of discrete probability distributions, including:</p>
<ol>
<li><p><strong>Bernoulli distribution:</strong> The Bernoulli distribution is a simple probability distribution that describes the probability of success or failure in a single trial of a binary experiment.
The Bernoulli distribution has two possible outcomes:
<ul>
<li>success (with probability \(p\))</li>
<li>Failure (with probability \(1-p\))</li>
</ul>
The formula for the Bernoulli distribution is:</p>
$$P(X=x) = p^x \times (1-p)^{(1-x)}$$
<p>where \(X\) is the random variable, \(x\) is the outcome (either <code>0</code> or <code>1</code>), and \(p\) is the probability of success.</p>
<figure>
<img src="assets/img/data-engineering/Berno-pmf.png" alt="" style="max-width: 90%; max-height: 90%;">
<figcaption style="text-align: center;"><strong>Image credit: </strong><a href="index.html">Arun Kumar Pandey</a></figcaption>
</figure>
</li>
<li><p><strong>Binomial distribution:</strong> The binomial distribution describes the probability of getting a certain number of successes in a fixed number of independent trials of a binary experiment.
The binomial distribution has two parameters: \(n\), the number of trials, and \(p\), the probability of success in each trial. The formula for the binomial distribution is:</p>
$$P(X=x) = ^nC_x ~ p^x ~ (1-p)^{(n-x)}$$
<p>where \(X\) is the random variable representing the number of successes, \(x\) is the number of successes,\(n\) is the number of trials, \(p\) is the probability of success, and
$$^nC_x = \frac{n!}{x! (n-x)!}$$
is the binomial coefficient, which represents the number of ways to choose \(x\) objects from a set of n objects.</p>
</li>
<table>
<thead>
<tr>
<th>Statistics</th>
<th>Formula</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mean</td>
<td>\(\mu=n p\)</td>
</tr>
<tr>
<td>Variance</td>
<td>\(\sigma^2 = nn p (1-p)\)</td>
</tr>
<tr>
<td>Standard deviation</td>
<td>\(\sigma = \sqrt{n p (1-p)}\)</td>
</tr>
<tr>
<td>Moment coefficient of skewness</td>
<td>\(\alpha_3 = \frac{1-p-p}{\sqrt{n p (1-p)}}\)</td>
</tr>
<tr>
<td>Moment coefficient of Kurtosis</td>
<td>\(\alpha_4 = 3+ \frac{1-6 p (1-p)}{n p (1-p)}\)</td>
</tr>
</tbody>
</table>
<figure>
<img src="assets/img/data-engineering/Binomial.png" alt="" style="max-width: 90%; max-height: 90%;">
<figcaption style="text-align: center;"><strong>Image credit: </strong> A link to generate the plot: <a href="https://homepage.divms.uiowa.edu/~mbognar/applets/bin.html" target="_blank">Click here</a>(<a href="index.html">Arun Kumar Pandey</a>)</figcaption>
</figure>
<p><a href="https://homepage.divms.uiowa.edu/~mbognar/applets/bin.html">A link to generate the plot</a></p>
<li><p><strong>Poisson distribution:</strong> The Poisson distribution is used to describe the probability of a certain number of events occurring in a fixed time interval when the events occur independently and at a constant rate. The Poisson distribution has one parameter: $\lambda$, which represents the expected number of events in the time interval. The formula for the Poisson distribution is:</p>
<p>$$P(X=x) = e^{-λ} \frac{λ^x}{x!}$$</p>
<p>where \(X\) is the random variable representing the number of events, \(x\) is the number of events, \(e\) is the mathematical constant, \(\lambda\) is the expected number of events, and \(x!\) is the factorial function.</p>
</li>
<table>
<thead>
<tr>
<th>Statistics</th>
<th>Formula</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mean</td>
<td>\(\mu=\lambda \)</td>
</tr>
<tr>
<td>Variance</td>
<td>\(\sigma^2 = \lambda \)</td>
</tr>
<tr>
<td>Standard deviation</td>
<td>\(\sigma = \sqrt{\lambda}\)</td>
</tr>
<tr>
<td>Moment coefficient of skewness</td>
<td>\(\alpha_3 = \frac{1}{\sqrt{\lambda}}\)</td>
</tr>
<tr>
<td>Moment coefficient of Kurtosis</td>
<td>\(\alpha_4 = 3+ \frac{1}{\lambda}\)</td>
</tr>
</tbody>
</table>
<figure>
<img src="assets/img/data-engineering/Possion.png" alt="" style="max-width: 90%; max-height: 90%;">
<figcaption style="text-align: center;"><strong>Image credit: </strong><a href=""></a>, \</figcaption>
</figure>
<p><a href="https://homepage.divms.uiowa.edu/~mbognar/applets/bin.html">A link to generate the plot</a></p>
<p>The PMF is a function that gives the probability of each possible value of the random variable. The PMF for the Bernoulli distribution has two values
(\(p\) and \(1-p\)), the PMF for the binomial distribution has \(n+1\) values (corresponding to the number of successes), and the PMF for the Poisson
distribution has an infinite number of values (corresponding to the number of events).</p>
<blockquote>
<p><strong>probability mass functions (PMFs):</strong> A probability mass function (PMF) is a function that gives the probability of each possible value of a discrete random variable. It is a way of summarizing the probability distribution of a discrete random variable.
The PMF is defined for all possible values of the random variable and satisfies the following properties:</p>
<ul>
<li>The value of the PMF at any possible value of the random variable is a non-negative number.</li>
<li>The sum of the PMF over all possible values of the random variable is equal to one.</li>
</ul>
<p>The PMF is often represented graphically using a histogram or bar graph. The height of each bar represents the probability of the corresponding value of the random variable.</p>
<p><strong>Example:</strong> consider a fair six-sided die. The random variable X can take on values of 1, 2, 3, 4, 5, or 6, each with probability 1/6. The PMF for this random variable is:</p>
<p>P(X = 1) = 1/6</p>
<p>P(X = 2) = 1/6</p>
<p>P(X = 3) = 1/6</p>
<p>P(X = 4) = 1/6</p>
<p>P(X = 5) = 1/6</p>
<p>P(X = 6) = 1/6</p>
<p>This PMF is illustrated in the following figure:</p>
</blockquote>
</ol>
<h3>2. Continuous</h3>
<p>Continuous probability distributions are used to model continuous random variables, which can take on any value in a given range. Unlike discrete random variables, which take on only a finite or countably infinite set of possible values, continuous random variables can take on an uncountably infinite set of possible values.</p>
<p>There are several common continuous probability distributions, including:</p>
<ol>
<li><p><strong>Normal distribution:</strong> also known as the Gaussian distribution, this is a bell-shaped distribution that is symmetric around the mean. It is commonly used to model measurements that are expected to be normally distributed, such as heights or weights of individuals in a population. The probability density function (PDF) of the normal distribution is:</p>
$$f(x; μ, σ) = \frac{1}{\sigma \sqrt{2\pi}} \text{Exp}\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$
<p>where $x$ is the random variable, $\mu$ is the mean, $\sigma$ is the standard deviation.</p>
<figure>
<img src="assets/img/data-engineering/normal-df.png" alt="" style="max-width: 80%; max-height: 80%;">
<figcaption style="text-align: center;"><strong>Image credit: </strong><a href=""></a>
</figcaption>
</figure>
<p><strong>Empirical rule:</strong> The Empirical Rule, also known as the 68-95-99.7 Rule, is a rule of thumb for the normal distribution. It states that:</p>
<ul>
<li>Approximately 68% of the data falls within one standard deviation of the mean.</li>
<li>Approximately 95% of the data falls within two standard deviations of the mean.</li>
<li>Approximately 99.7% of the data falls within three standard deviations of the mean.</li>
</ul>
<p>This means that if a distribution is approximately normal, we can use these percentages to estimate the proportion of data that falls within a certain range of values.</p>
<figure>
<img src="assets/img/data-engineering/normal-df2.png" alt="" style="max-width: 80%; max-height: 80%;">
<figcaption style="text-align: center;"><strong>Image credit: </strong><a href=""></a>
</figcaption>
</figure>
<p><strong>Example:</strong> if we know that a distribution is approximately normal with a mean of 50 and a standard deviation of 10, we can use the Empirical Rule to estimate the proportion of data that falls within certain ranges:</p>
<ul>
<li>Approximately 68% of the data falls between 40 and 60 (one standard deviation from the mean).</li>
<li>Approximately 95% of the data falls between 30 and 70 (two standard deviations from the mean).</li>
<li>Approximately 99.7% of the data falls between 20 and 80 (three standard deviations from the mean).</li>
</ul>
<p>It's important to note that the Empirical Rule is only an approximation and may not hold for all normal distributions. It is also not applicable to non-normal distributions.</p>
<table>
<thead>
<tr>
<th>Statistics</th>
<th>Formula</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mean</td>
<td>\(\mu\)</td>
</tr>
<tr>
<td>Variance</td>
<td>\(\sigma^2 \)</td>
</tr>
<tr>
<td>Standard deviation</td>
<td>\(\sigma \)</td>
</tr>
<tr>
<td>Moment coefficient of skewness</td>
<td>\(\alpha_3 = 0\)</td>
</tr>
<tr>
<td>Moment coefficient of Kurtosis</td>
<td>\(\alpha_4 = 3\)</td>
</tr>
<tr>
<td>Mean deviation</td>
<td>\(\sigma\sqrt{\frac{2}{\pi}} = 0.7979 ~ \sigma \)</td>
</tr>
</tbody>
</table>
</li>
<li><strong>Uniform distribution:</strong> this is a distribution in which all values in a given range are equally likely to occur. The PDF of the uniform distribution is:
$$f(x)= \begin{cases}
\frac{1}{b-a}, & a \leq x \leq b \\
0, & \text{otherwise}
\end{cases}$$
<p>where \(x\) is the random variable, \(a\) is the lower bound of the range, and \(b\) is the upper bound of the range.</p>
<figure>
<img src="assets/img/data-engineering/uni-dist1.png" alt="" style="max-width: 60%; max-height: 60%;">
<figcaption style="text-align: center;"><strong>Image credit: </strong><a href="index.html">Arun Kumar Pandey</a></figcaption>
</figure>
</li>
<li><strong>Exponential distribution:</strong> this is a distribution that is commonly used to model the time between events that occur at a constant rate. The PDF of the exponential distribution is:
<p>$$ f(x; \lambda) =
\begin{cases}
\lambda e^{-\lambda x}, & x \geq 0 \\
0, & x < 0
\end{cases} $$</p>
<p>where \(x\) is the random variable, and \(\lambda \) is the rate parameter.</p>
<table>
<thead>
<tr>
<th>Statistics</th>
<th>Formula</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mean</td>
<td>\(𝐸[𝑋]=\frac{1}{\lambda}\)</td>
</tr>
<tr>
<td>Median</td>
<td>\(m[X] =\frac{ln(2)}{\lambda} < E[X]\)</td>
</tr>
<tr>
<td>Variance</td>
<td>\(𝑉𝑎𝑟[𝑋]=\frac{1}{\lambda^2}\)</td>
</tr>
<tr>
<td>Moments</td>
<td>\(E[X^n]=\frac{n!}{\lambda^n}\)</td>
</tr>
</tbody>
</table>
<figure>
<img src="assets/img/data-engineering/expo-distri.png" alt="" style="max-width: 60%; max-height: 60%;">
<figcaption style="text-align: center;"><strong>Image credit: </strong><a href="index.html">Arun Kumar Pandey</a></figcaption>
</figure>
</li>
<li><strong>Gamma distribution:</strong> this is a distribution that is used to model the sum of several exponentially distributed random variables. The PDF of the gamma distribution is:
$$f(x; k, \theta) = \frac{x^{k-1} e^{-x/\theta}}{\theta^k \Gamma(k)}$$
<p>where \(x\) is the random variable, \(k\) is the shape parameter, \(\theta\) is the scale parameter, and \(\Gamma(k)\) is the gamma function.</p>
<figure>
<img src="assets/img/data-engineering/eexpon.png" alt="" style="max-width: 60%; max-height: 60%;">
<figcaption style="text-align: center;"><strong>Image credit: </strong><a href="https://commons.wikimedia.org/wiki/File:Gamma_distribution_pdf.svg">Gamma_distribution_pdf.png: MarkSweep and Cburnettderivative work: Autopilot</a>, <a href="http://creativecommons.org/licenses/by-sa/3.0/">CC BY-SA 3.0</a>, via Wikimedia Commons</figcaption>
</figure>
<p>The probability distribution is an essential concept in probability theory and is used to calculate the expected values, variances, and other statistical properties of random variables. Understanding probability distributions is important in fields such as statistics, physics, engineering, finance, and many others where randomness plays a role.</p>
</li>
</ol>
<h2>Central Limit theorem (CLT)</h2>
<p>The central limit theorem (CLT) is a fundamental concept in statistics and probability theory. It states that under certain conditions, the sampling distribution of the mean of a random sample drawn from any population will approximate a normal distribution, regardless of the shape of the original population distribution.</p>
<p>Specifically, the CLT states that as the sample size n increases, the sampling distribution of the mean approaches a normal distribution with mean equal to the population mean and standard deviation equal to the population standard deviation divided by the square root of the sample size. This means that even if the population distribution is not normal, the distribution of sample means will tend to be normal if the sample size is sufficiently large.</p>
<p>The conditions necessary for the CLT to hold are:</p>
<ul>
<li><strong>Random sampling:</strong> The samples must be drawn at random from the population.</li>
<li><strong>Independence:</strong> Each sample observation must be independent of all the others.</li>
<li><strong>Finite variance:</strong> The population distribution must have a finite variance.</li>
</ul>
<p>The CLT has many important practical applications, as it allows us to make inferences about population means and proportions based on samples drawn from the population. It is also used in hypothesis testing, confidence interval estimation, and in the construction of many statistical models.</p>
<h3>Application of CLT</h2>
<p>The central limit theorem (CLT) has many important applications in statistics and data analysis. Here are a few examples:</p>
<ol>
<li><strong>Estimating population parameters:</strong> The CLT can be used to estimate population parameters, such as the population mean or proportion, based on a sample drawn from the population. For example, if we want to estimate the average height of all adults in a country, we can take a random sample of heights and use the CLT to construct a confidence interval for the population mean.</li>
<li><strong>Hypothesis testing:</strong> The CLT is often used in hypothesis testing to determine whether a sample is likely to have come from a particular population. For example, if we want to test whether the mean salary of a group of employees is different from the mean salary of all employees in the company, we can use the CLT to calculate the probability of observing a sample mean as extreme as the one we observed if the null hypothesis (i.e., the mean salaries are equal) is true.</li>
<li><strong>Machine learning:</strong> The CLT is used in many machine learning algorithms that require the assumption of normality, such as linear regression and logistic regression. In these algorithms, the CLT is used to justify the assumption that the errors or residuals of the model are normally distributed.</li>
</ol>
<p><strong>Forumla</strong> The formula for the CLT depends on the specific population distribution and the sample size. In general,
if \(X\) is a random variable with mean \(\mu\) and standard deviation \(\sigma\), then the distribution of the sample mean \(\mu_X\)
of a random sample of size \(n\) from \(X\) approaches a normal distribution with mean \(\mu\) and standard deviation \(\sigma/\sqrt{n}\) as \(n\) gets larger. This can be expressed mathematically as:</p>
$$\frac{\mu_X - \mu}{\sigma/\sqrt{n}}\sim N(0,1)$$
<p>where \(N(0,1)\) represents a <em><strong>standard normal distribution</strong></em> with mean <code>0</code> and standard deviation <code>1</code>.</p>
<p>In practice, the CLT is often used to calculate confidence intervals for population means or proportions. The formula for a confidence interval for
the population mean based on a sample mean \(\mu_X\) and a sample standard deviation \(s\) is:</p>
$$\mu_X \pm z^* \left(\frac{s}{\sqrt{n}}\right)$$
<p>where \(z^*\) is the appropriate critical value from the standard normal distribution based on the desired level of confidence.</p>
<p><strong>Note:</strong> To calculate the value of \(z^*\) for a given level of confidence, we need to use a standard normal distribution table (Z-table or normal probability table) or a statistical software program (R, Python, and GNU Octave to commercial software like SPSS, SAS, and Stata). For example, if we want to find the critical value for a 95% confidence level, we would look up the corresponding value in a standard normal distribution table or use the formula:</p>
$$z^* = \text{invNorm}(1 - \frac{\alpha}{2})$$
<p>where invNorm is the inverse cumulative distribution function of the standard normal distribution, and \(\alpha\) is the significance level, which is equal to 1 - confidence level.</p>
<blockquote>
<p><a href="https://www.mathsisfun.com/data/standard-normal-distribution-table.html">standard normal distribution table</a></p>
</blockquote>
<p>For a 95% confidence level, alpha is 0.05, so we would have:</p>
$$z^* = \text{invNorm}(1 - 0.05/2) = \text{invNorm}(0.975) = 1.96$$
<p>Therefore, the critical value \(z^*\) for a 95% confidence level is 1.96.</p>
<h3>Normal distribution vs the standard normal distribution</h2>
<ul>
<li>The standard normal distribution, also called the z-distribution, is a special normal distribution where the mean is 0 and the standard deviation is 1.</li>
<li>All normal distributions, like the standard normal distribution, are unimodal and symmetrically distributed with a bell-shaped curve.</li>
<li>Every normal distribution is a version of the standard normal distribution that’s been stretched or squeezed and moved horizontally right or left.</li>
<li>The mean determines where the curve is centered. Increasing the mean moves the curve right, while decreasing it moves the curve left.</li>
</ul>
<table>
<thead>
<tr>
<th>Curve </th>
<th>Position or shape (relative to standard normal distribution)</th>
</tr>
</thead>
<tbody>
<tr>
<td>A (M = 0, SD = 1)</td>
<td>Standard normal distribution</td>
</tr>
<tr>
<td>B (M = 0, SD = 0.5)</td>
<td>Squeezed, because SD < 1</td>
</tr>
<tr>
<td>C (M = 0, SD = 2)</td>
<td>Stretched, because SD > 1</td>
</tr>
<tr>
<td>D (M = 1, SD = 1)</td>
<td>Shifted right, because M > 0</td>
</tr>
<tr>
<td>E (M = –1, SD = 1)</td>
<td>Shifted left, because M < 0</td>
</tr>
</tbody>
</table>
<figure>
<img src="assets/img/data-engineering/snd-nd.png" alt="" style="max-width: 70%; max-height: 70%;">
<figcaption style="text-align: center;"><strong>Image credit: </strong><a href="https://www.scribbr.com/statistics/standard-normal-distribution/"></a>Scibbr</figcaption>
</figure>
<h3>Standardizing a normal distribution</h3>
<ul>
<li>When you standardize a normal distribution, the mean becomes 0 and the standard deviation becomes 1. This allows you to easily calculate the probability of certain values occurring in your distribution, or to compare data sets with different means and standard deviations.</li>
<li>While data points are referred to as x in a normal distribution, they are called z or z scores in the z distribution. A z score is a standard score that tells you how many standard deviations away from the mean an individual value (x) lies:
<ul>
<li>A positive z score means that your x value is greater than the mean.</li>
<li>A negative z score means that your x value is less than the mean.</li>
<li>A z score of zero means that your x value is equal to the mean.</li>
</ul>
</li>
</ul>
<figure>
<img src="assets/img/data-engineering/snd.png" alt="" style="max-width: 70%; max-height: 70%;">
<figcaption style="text-align: center;"><strong>Image credit: </strong><a href="https://www.scribbr.com/statistics/normal-distribution/"></a>☞ Scibbr</figcaption>
</figure>
</section>
<section id="examples">
<h3>Python libraries used for the </h3>
<p><strong>Importing the libraries:</strong></p>
<pre><code>
# Standard Dependencies
import os
import numpy as np
import pandas as pd
from math import sqrt
# Visualization
from pylab import *
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import seaborn as sns
# Statistics
from statistics import median
from scipy import signal
# from scipy.misc import factorial
import scipy.stats as stats
from scipy.stats import sem, binom, lognorm, poisson, bernoulli, spearmanr
from scipy.fftpack import fft, fftshift
# Scikit-learn for Machine Learning models
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Seed for reproducability
seed = 12345
np.random.seed(seed)
</code></pre>
<p><strong>Loading the data files:</strong></p>
<pre><code>
# Read in csv of Toy Dataset
# We will use this dataset throughout the tutorial
toy_df = pd.read_csv('ml-data/toy_dataset.csv')
</code></pre>
<h4>PMF (Probability Mass Function)</h4>
Here we visualize a PMF of a binomial distribution. You can see that the possible values are all integers. For example, no values are between 50 and 51.
The PMF of a binomial distribution in function form:
$$P(X=x)= p^x\left(\frac{N}{x}\right)(1-p)^{N-x}$$
<pre><code>
# PMF Visualization
n = 100
p = 0.5
fig, ax = plt.subplots(1, 1, figsize=(17,5))
x = np.arange(binom.ppf(0.01, n, p), binom.ppf(0.99, n, p))
ax.plot(x, binom.pmf(x, n, p), 'bo', ms=8, label='Binomial PMF')
ax.vlines(x, 0, binom.pmf(x, n, p), colors='b', lw=5, alpha=0.5)
rv = binom(n, p)
#ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1, label='frozen PMF')
ax.legend(loc='best', frameon=False, fontsize='xx-large')
plt.title('PMF of a binomial distribution (n=100, p=0.5)', fontsize='xx-large')
plt.show()
</code></pre>
<figure>
<img src="assets/img/data-engineering/PMF-bino.png" alt="" style="max-width: 70%; max-height: 70%;">
<figcaption style="text-align: center;"><strong>Image credit: </strong>Arun Kumar Pandey</figcaption>
</figure>
<h4>PDF (Probability Density Functions)</h4>
The PDF is the same as a PMF, but continuous. It can be said that the distribution has an infinite number of possible values. Here we visualize a simple normal distribution with a mean of 0 and standard deviation of 1.
<pre><code>
# Plot normal distribution
mu = 0
variance = 1
sigma = sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.figure(figsize=(16,5))
plt.plot(x, stats.norm.pdf(x, mu, sigma), label='Normal Distribution')
plt.title('Normal Distribution with mean = 0 and std = 1')
plt.legend(fontsize='xx-large')
plt.show()
</code></pre>
<figure>
<img src="assets/img/data-engineering/normal-distri.png" alt="" style="max-width: 70%; max-height: 70%;">
<figcaption style="text-align: center;"><strong>Image credit: </strong><a href="index.html"></a>☞ Arun Kumar Pandey</figcaption>
</figure>
<h4>CDF (Cumulative Distribution Function)</h4>
The CDF maps the probability that a random variable X will take a value of less than or equal to a value x (P(X ≤ x)). CDF's can be discrete or continuous. In this section we visualize the continuous case. You can see in the plot that the CDF accumulates all probabilities and is therefore bounded between \(0 \leq x \leq 1\).
<pre><code>
# Data
X = np.arange(-2, 2, 0.01)
Y = exp(-X ** 2)
# Normalize data
Y = Y / (0.01 * Y).sum()
# Plot the PDF and CDF
plt.figure(figsize=(15,5))
plt.title('Continuous Normal Distributions', fontsize='xx-large')
plot(X, Y, label='Probability Density Function (PDF)')
plot(X, np.cumsum(Y * 0.01), 'r', label='Cumulative Distribution Function (CDF)')
plt.legend(fontsize='xx-large')
plt.show()
</code></pre>
<figure>
<img src="assets/img/data-engineering/cdf-pdf.png" alt="" style="max-width: 70%; max-height: 70%;">
<figcaption style="text-align: center;"><strong>Image credit: </strong><a href="index.html"></a>☞ Arun Kumar Pandey</figcaption>
</figure>
<h3>Probability Distributions</h3>
<p>A Probability distribution tells us something about the likelihood of each value of the random variable.
A random variable X is a function that maps events to real numbers. The visualizations in this section are of discrete distributions. Many of these distributions can however also be continuous.</p>
<ul>
<li><strong>Uniform Distribution: </strong>A Uniform distribution is pretty straightforward. Every value has an equal change of occuring. Therefore, the distribution consists of random values with no patterns in them. In this example we generate random floating numbers between 0 and 1.
<p>The PDF of a Uniform Distribution:</p>
$$
f(x)= \begin{cases}
\frac{1}{b-a}, & a \leq x \leq b \\
0, & \text{otherwise}
\end{cases}
$$
<p>CDF:</p>
$$
CDF =
\begin{cases}
0, & x < a \\
\frac{x-a}{b-a} & a \leq x \leq b \\