-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_basics.tex
146 lines (102 loc) · 8.77 KB
/
02_basics.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
\chapter{Basics}
\label{chapter:basics}
\section{How to think}
Data analytics involves 2 kinds of thinking: thinking like an analyst, and thinking like a programmer. These involve very different thought processes, and if you are aware of them it can make it easier to solve programming problems while addressing larger analytical questions.
When thinking like an analyst, questions based on the context are common. For example:
\begin{itemize}
\item What is the main concern of the organisation that this analysis will address?
\item How should insights of this analysis be presented to the decision-makers?
\item What context is important for this analysis?
\end{itemize}
The kinds of questions are high level and inform decisions about what kind of data, analysis, and visualisation is required, but may not inform decisions about how to implement the analysis.
In contrast, when thinking like a programmer, questions about the code itself are more important. For example:
\begin{itemize}
\item Which is the best structure for this data?
\item What function is best to manipulate this data to achieve the required result?
\item What syntax is required for this section of code?
\item How can I break this task into smaller steps?
\end{itemize}
It is this last question in particular that newcomers find challenging. There is often an expectation to obtain a result by undertaking a simple action. But programming always involves taking tasks and breaking them up into small steps, ensuring that each step achieves the expected result.
How small you have to break down the task usually depends on the degree to which others have simplified the task previously -- and already written code to do the job. Although almost all data analytics with Python makes use of code written by others (to simplify the analysis task), there will always be a need to think programmatically about the code. (Or at least until AI is writing high quality code for us).
\section{How to write}
If the writing in this section didn't require any order to the words, it is likely that you wouldn't understand it. We understand a language through a knowledge of the words (the vocabulary), as well as rules for how those words are put together (syntax). The same applies to programming languages. There are certain concepts that are common across many languages (e.g. variables, loops, conditionals, data structures), but the vocabulary and the syntax tend to be language specific. It is critical to gain an understanding of both the terms important to the language \textit{AND} the syntax required to use those terms.
In general, programming languages have very specific requirements, and python is no exception. In the following example, only the first line of code will display \textbf{Hello!} in the output. The other 2 examples will result in errors. This is because only the first line is the \textit{correct} syntax. Note that the words following the hash symbol (\#) are comments and don't impact the running of the code in any way.
\begin{pycode}
print('Hello!') # will work
print['Hello!'] # won't work - will produce an error
print(Hello!) # won't work - will produce an error
\end{pycode}
However, you might find that some things in the language don't matter. For example, with Python, it really doesn't matter if you use single quotes or double quotes for a string.
\subsection{Nesting and indenting}
One very important syntactic feature of python is the use of indents in the code to tell the computer which lines of code are \textit{nested} inside an encapsulating element.
\begin{pycode}
def my_function():
print("This line of code is inside the function")
print("This line of code is outside the function")
dogs = ["Golden retriever","Australian Shepherd","Blue Heeler"] # A list of dogs
for dog in dogs: # This is a loop that loops over the list of dogs
print("dog:",dog) # This is inside the loop
print("dogs") # This line is outside the loop
\end{pycode}
\section{Common concepts}
Writing code involves some common concepts no matter which language you are using. Understanding these concepts will help you grasp the language faster, and it will also help you know what to search for if you are looking for help online.
\subsection{Variables and assignment}
\textbf{Variables} allow us to assign data to a name. For example, I could assign the string of characters ``Charles'' to the variable \code{first_name}, and the string ``Peirce'' to the variable \code{last_name}. The advantage of doing this is that I can then operate on the variables (by name) without knowing what data is actually stored in them. For example:
\begin{pycode}
first_name = "Charles"
last_name = "Pierce"
full_name = first_name + " " + last_name # use the variables and assign to new variable
print(full_name) # displays Charles Pierce
\end{pycode}
When the string is \textit{assigned} to the variable name, the name is \textit{defined} as a string type (\code{str}), and the data (e.g. "Charles", "Pierce") is assigned to that name. When a variable is \textit{refered} to by its name (like in the \code{print()} function), the computer substitutes the data assigned to the variable name instead of the name itself. It is common to use language like \textit{passing} a variable to a function. In the example above, \code{full_name} was passed to the function \code{print()}.
\subsection{Functions}
A \textbf{function} is simply a group of lines of code that accept data as input and return data as output. Functions are useful for repeating the same tasks repeatedly on different data. In python, the \code{def} keyword is used to define a function. The actual code that does the work is indented For example, the following code defines a function \code{make_full_name()}:
\begin{pycode}
def make_full_name(fname,lname):
fullname = fname + " " + lname
return fullname
\end{pycode}
This function is \textit{called} by its name and \textit{passing} the required parameters between the parenthesis. For example, \code{make_full_name("Charles","Pierce")} would result in \code{"Charles Pierce"}.
\section{A little more advanced\ldots}
\subsection{Classes with properties and methods}
While functions can be created and called standalone, they can also be included in \textbf{classes}. Without going into technical details, a class can have properties and methods where the methods are functions that usually operate on properties. When classes are \textit{instantiated} we call the instance an \textbf{object}. For example, the \textit{object} \code{soccer_ball} could be an \textit{instance} of the \textit{class} \code{ball}. Using our names example above, if we had a \code{Person} class, it may have a \code{first_name} property, a \code{last_name} property, and a \code{full_name()} method. We could create :
\begin{pycode}
# Define a class Person with properties first_name and last_name
# and a full_name method
class Person:
def __init__(self, fname, lname):
self.first_name = fname
self.last_name = lname
def full_name(self):
return self.fname + " " + self.lname
# Create an object (instance of the Person class)
cp = Person("Charles", "Peirce")
# Get the full name for cp, by calling the full_name() method
print(cp.full_name()) # displays: Charles Peirce
\end{pycode}
In data analytics, there is rarely a need to create custom classes. However, it is useful to know that you may be creating and using objects and their methods. For example:
\begin{pycode}
# Create a string object
my_string = "This is a string of characters"
# Call the string object's split() method
split_string = my_string.split('_')
print(split_string)
# Displays a list: ['This','is','a','string','of','characters']
\end{pycode}
\subsection{Libraries or Packages}
Repeated code (objects and functions) that are related can be grouped together into a reusable bundle called a \textbf{Library} or \textbf{Package}. This helps minimise duplication.
In data analytics, it is very common to use \textit{libraries} written by other programmers. Once libraries are installed, they can be \textit{imported} into the current Python file and used by the code in that file. Libraries can be imported and used directly, imported as an alias, or components of a library can be imported. For example:
\begin{pycode}
# Import the mathematics library
import math
# Use the library's log function by directly referring to the library
print(math.log(100)) # displays 2
# Import a function of the mathematics library
from math import sqrt
# Use the imported function
print(sqrt(100)) # displays 10
# Import a library as an alias
import pandas as pd
# Create a DataFrame object and assign to variable df
df = pd.DataFrame()
\end{pycode}