generated from jhudsl/OTTR_Template
-
Notifications
You must be signed in to change notification settings - Fork 1
/
04-refactoring.Rmd
541 lines (371 loc) · 22.8 KB
/
04-refactoring.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
```{r, include = FALSE}
# enable python code previews; must use python 3
library(reticulate)
use_python("/usr/bin/python3")
ottrpal::set_knitr_image_path()
```
# VIDEO Introduction to Refactoring with AI
This video discusses how AI can help with refactoring your code.
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/tcrvwyswExo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
You can view and download the Google Slides [here](https://docs.google.com/presentation/d/1RHf5p7504GOhazt_xDshRyMo5mnG_5kH2lOLx18bWjg/edit#slide=id.p).
Code refactoring has historically been done manually by developers. This involves reviewing code and identifying areas that could be improved or optimized, and then making changes to the codebase accordingly. Though important, this is process is time-consuming and labor-intensive, as it requires developers to carefully review every line of code to identify potential issues or areas for improvement. Additionally, manual code refactoring is error-prone, as developers can accidentally introduce bugs or errors into the codebase while making changes.
However, AI has significant potential to help with code refactoring. AI can use machine learning algorithms to analyze large amounts of code and identify patterns or areas that could be improved. For example, they can identify sections of code that are redundant, overly complex, or difficult to maintain, and suggest changes that could be made to improve the codebase. Machine learning algorithms can also help to identify potential bugs or security issues in the codebase, which can help to improve the overall quality and stability of the software.
AI refactoring is also faster and more accurate than manual refactoring. This is particularly useful for large-scale software projects with massive codebases, where manual code review and refactoring can be an enormous task. In the next sections, we'll take a look at some examples of using AI to refactor code.
# Refactoring Code
## Learning Objectives
- Describe how refactoring code involves optimization for maintainability, efficiency, and reuse
- Explain why refactoring code is important for developers in the long-term
- Recognize the benefits and limitations of using AI tools to refactor code, as well as why AI tools are uniquely poised to be beneficial
- Implement prompt strategies that can be used to assist with refactoring code for correcting syntax, for adopting more consistent styling, for making code more concise, for making code easier to maintain, and for making code more efficient
## Refactoring Basics
[Code refactoring](https://en.wikipedia.org/wiki/Code_refactoring) is the process of improving the quality of underlying code without changing its functionality. In other words, it's a way of cleaning up and optimizing code so that it's easier to maintain and more efficient. This often involves making small changes to the code, such as renaming variables or functions, reorganizing code blocks, or simplifying complex expressions. Refactoring is an essential practice in software development and helps to ensure that the codebase remains manageable and adaptable as requirements and business needs change over time.
Code refactoring helps to reduce [technical debt](https://en.wikipedia.org/wiki/Technical_debt), which is the accumulation of development work that needs to be done in the future as a result of taking shortcuts or using less than optimal solutions. Refactoring can help to identify potential issues and bugs before they cause problems, resulting in more stable software. Finally, refactoring code to be more readable and less verbose can create fewer headaches for you in the future.
To understand why code refactoring is important, imagine trying to cook in a disorganized and cluttered kitchen. You might be a fantastic cook, but you are spending more time searching for utensils and ingredients, and the quality of the meal could suffer as a result. Similarly, code that is disorganized and cluttered can be difficult to work with and can result in lower-quality software. Refactoring is like cleaning and organizing the kitchen - it makes the process of cooking (or coding) smoother and more efficient.
```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Refactoring code is like cleaning up a cluttered, disorganized kitchen. On the left is a jumble of cookbooks, utensils, and fruits. On the right they are nicely organized and extras have been removed."}
ottrpal::include_slide("https://docs.google.com/presentation/d/1MCNeSO4aOm1iESWDLOGTcx3aLEbnu8UttV0QGVAeafE/edit#slide=id.g22d6414ee48_0_26")
```
::: warning
The information presented in this course is meant for use with open source code and software. It is unclear what happens to the information fed to AI chatbots as prompts, or how secure the data are. We know data are saved and may be used to further train the AI tools, but the specifics of how data are saved, as well as how sensitive or personally identifiable information are protected, is unknown.
Err on the side of caution when interacting with them. We do **not** recommend using proprietary code or private information for prompts unless you are working with an AI that you or your company built and you know is secure.
:::
## Example One: Correcting Syntax
Programming languages operate using their established set of rules. For example, [Python](https://www.python.org/) uses indentation to define code blocks. However, when you shift between development environments, indentation and line breaks can sometimes be affected, causing major headaches!
Let's say we run the following code:
```{python error = TRUE, python.reticulate = TRUE}
def calculate_sum(a,b):
result = a + b
return result
def calculate_product(a,b):result = a * b
return result
def main():
a=5
b=6
sum_result = calculate_sum(a, b)
print("Sum:", sum_result)
product_result = calculate_product(a,b)
print("Product:", product_result)
if __name__ == '__main__':
main()
```
We get the error above because our indentation is off. Let's ask [ChatGPT](https://chat.openai.com) to clean things up.
:::{.query}
Can you assist me with correcting the formatting and indentation issues in my Python code?
:::
:::{.ai_response}
Of course! Here is the code after correcting the formatting and indentation:
```{python python.reticulate = TRUE}
def calculate_sum(a, b):
result = a + b
return result
def calculate_product(a, b):
result = a * b
return result
def main():
a = 5
b = 6
sum_result = calculate_sum(a, b)
print("Sum:", sum_result)
product_result = calculate_product(a, b)
print("Product:", product_result)
if __name__ == '__main__':
main()
```
:::
Much better! This can be quite handy when you can't immediately see what is wrong with the formatting or if you are working with a large codebase.
## Example Two: Styling and Convention
Sometimes programming languages have more flexible rules where there are multiple ways to take the same or almost the same action. For example, in the [R programming language](https://www.r-project.org/), there are several ways to perform variable assignment. However, is is generally accepted among R programmers that the `<-` operator should be used. Single and double quotes can be used for character types interchangeably, but it's generally a good idea to stick with one or the other.
### Commonly Accepted Convention
Let's say we have the following code:
```{r}
a <- "apple"
b = 'banana'
"carrot" -> c
c(a, b, c)
```
This code does not produce any errors, but it doesn't follow typical R convention. Let's ask ChatGPT to clean things up.
:::{.query}
Can you help me reformat my R code to follow accepted style conventions?
:::
:::{.ai_response}
Sure, here's the cleaned-up R code:
```{r eval=FALSE}
a <- "apple"
b <- "banana"
c <- "carrot"
```
:::
Much better! This code will now look more familiar to other R programmers.
```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "AI can help clean up code to match commonly followed conventions by users."}
ottrpal::include_slide("https://docs.google.com/presentation/d/1MCNeSO4aOm1iESWDLOGTcx3aLEbnu8UttV0QGVAeafE/edit#slide=id.g22d6414ee48_0_10")
```
### Visual Styling
Some languages are flexible about line breaks, indentation, and spacing among code blocks. If you're rapidly developing, things can get messy pretty quickly. Many development environments, for example RStudio and PyCharm, have the ability to format your code for you without changing any of the code itself. However, you can also get AI to do this according to your own preferences.
```{r}
d <- c(rep(a,5),rep(b,5),rep(c,5),paste(a,b,c),paste(c,b,a))
```
This code is pretty hard to read. Let's see if AI can do better.
:::{.query}
Can you space out this R code to make it more readable?
:::
:::{.ai_response}
Sure! Here's the same code with added spacing for improved readability:
```{r}
d <- c(rep(a, 5),
rep(b, 5),
rep(c, 5),
paste(a, b, c),
paste(c, b, a))
```
:::
Better! It's easier to interpret because only one action is being taken on each step. By spacing out the code and breaking it into multiple lines, it becomes easier for the human eye to scan for structure, logic, errors, and/or inconsistencies.
## Example Three: Staying DRY
The DRY (Don't Repeat Yourself) programming tenet is essential to good software. By avoiding repetition in code, we can reduce the amount of duplicate code we need to write and maintain. If we need to change the functionality of a certain piece of code, we have to make the same change in multiple places, which can be time-consuming and error-prone.
Let's take the following Python code:
```{python python.reticulate = TRUE}
def print_greetings():
print("Hello, Alice!")
print("Hello, Bob!")
print("Hello, Carol!")
print("Hello, David!")
print("Hello, Eve!")
print_greetings()
```
Notice how repetitive this code is? `print()` and the string `"Hello, "` appear five times. We can improve this!
:::{.query}
Could you please refactor my Python code to make it more DRY?
:::
:::{.ai_response}
Sure, here's a refactored version of your code:
```{python}
def print_greeting(name):
print(f"Hello, {name}!")
def print_greetings():
names = ["Alice", "Bob", "Carol", "David", "Eve"]
for name in names:
print_greeting(name)
print_greetings()
```
:::
Now let's imagine your software lead has asked us to change the greeting to Spanish. Notice how we only have to change the second line of code:
```{python eval = TRUE, python.reticulate = TRUE}
def print_greeting(name):
print(f"¡Hola, {name}!")
def print_greetings():
names = ["Alice", "Bob", "Carol", "David", "Eve"]
for name in names:
print_greeting(name)
print_greetings()
```
Much easier.
## Example Four: Keeping Code Alive
During software development, code is often changing rapidly. It's common to end up with functions or variables you don't need anymore. It's best practice to remove "dead" code from your codebase: it's distracting and can make maintenance more challenging. Programmers should resist the urge to hang on to unused code in the expectation that they might need it one day. Version control, such as provided by git and GitHub, let you keep a record of old code without cluttering your final version.
Let's take a look at some Python code with some dead code:
```{python eval = TRUE, python.reticulate = TRUE}
def default_values():
a = 5
b = 6
return a, b
def calculate_sum(a, b):
result = a + b
return result
def calculate_product(x, y):
return x * y
def simulate_data():
a, b = default_values()
dat = calculate_sum(a, b) + calculate_sum(a, b)
return dat
def main():
a = 5
b = 6
a, b = default_values()
sum_result = calculate_sum(a, b)
print("Sum:", sum_result)
if __name__ == '__main__':
main()
```
Some of these functions aren't used in the final step, typically defined in python as the `main()` function. Let's ask AI to remove them.
:::{.query}
I need help with eliminating unused functions in my Python script.
:::
:::{.ai_response}
Sure! Here's a refactored version:
```{python eval = TRUE, python.reticulate = TRUE}
def default_values():
a = 5
b = 6
return a, b
def calculate_sum(a, b):
result = a + b
return result
def main():
a = 5
b = 6
a, b = default_values()
sum_result = calculate_sum(a, b)
print("Sum:", sum_result)
if __name__ == '__main__':
main()
```
:::
Great! Now our code is much more manageable. This code only includes the necessary functions needed to run the `main()` function. We can go a step further!
:::{.query}
Can you help me refactor my python code to make it more concise?
:::
:::{.ai_response}
Certainly! Here's a more concise version of your code:
```{python eval = TRUE, python.reticulate = TRUE}
def calculate_sum(a=5, b=6):
return a + b
def main():
sum_result = calculate_sum()
print("Sum:", sum_result)
if __name__ == '__main__':
main()
```
:::
In this version, the `calculate_sum` function is simplified to take default arguments for a and b. This means the `default_values` function is no longer necessary. The a and b variables are also removed from `main`, as they are no longer needed. Much better and easier to look at overall.
## Example Five: Making Code Readable
Usually, our first goal as programmers is to get the response we need from the code. Readability might be the last thing on our minds. Luckily, AI can help with this. Let's imagine we whipped up this code pretty quickly. It is tough to read, and it could definitely be improved.
```{r}
for(term in c("good","bad","ugly")){
print(c(rep(term,3),paste("really",stringr::str_to_upper(term))))
}
```
Let's see if AI can do better.
:::{.query}
Can you help me clean up this R code so it is more readable?
:::
:::{.ai_response}
Sure! Here's a more readable version of the code:
```{r warning=FALSE}
library(stringr)
for(term in c("good", "bad", "ugly")) {
print(c(rep(term, 3), paste("really", str_to_upper(term))))
}
```
I added whitespace around operators and after commas for better readability. I also loaded the `stringr` package explicitly using `library()` to make the code more self-contained.
:::
This is a good start, but not much has changed. Let's try again. On ChatGPT, we can click "Regenerate Response".
:::{.ai_response}
Here's a cleaned-up version of the R code that is more readable:
```{r warning=FALSE}
library(stringr)
terms <- c("good", "bad", "ugly")
for (term in terms) {
rep_term <- rep(term, 3)
upper_term <- str_to_upper(term)
really_upper_term <- paste("really", upper_term)
result <- c(rep_term, really_upper_term)
print(result)
}
```
I've stored the repeated term and the uppercased term in separate variables to make the code more readable. I've also separated the construction of the "really" phrase from the uppercased term to make it clearer what's happening. Finally, I've used whitespace and indentation to make the structure of the code more apparent.
:::
Better! While the code takes up a greater number of lines, it's easier to interpret because only one action is being taken on each step.
```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Use the 'Regenerate response' button in ChatGPT for a variety of refactoring styles. The button is highlighted with a callout that says 'Get more options!'."}
ottrpal::include_slide("https://docs.google.com/presentation/d/1MCNeSO4aOm1iESWDLOGTcx3aLEbnu8UttV0QGVAeafE/edit#slide=id.g22d6414ee48_0_111")
```
## Example Six: Keep it Brief
As we learned above, following the DRY principle and removing dead code can make your code easier to interpret, maintain, and debug. However, sometimes the issue is the functions themselves. No one has perfect knowledge of everything in their programming language. AI can help us determine if there are any more efficient sets of functions that do the same thing. Here is some code that plays with the "CO2" dataset in R:
```{r}
library(datasets)
for (type in unique(CO2$Type)) {
dat_subset <- CO2[CO2$Type == type, ]
for (plant in unique(dat_subset$Plant)) {
dat_subset_2 <- dat_subset[dat_subset$Plant == plant, ]
print(paste(dat_subset_2$Plant[1], round(mean(dat_subset_2$uptake), 1)))
}
}
```
Let's see if AI has any ideas for making it more concise.
:::{.query}
Can you make this code more concise?
:::
:::{.ai_response}
Here's a more concise version using `dplyr` package in R:
```{r warning=FALSE, message=FALSE}
library(dplyr)
CO2 %>%
group_by(Type, Plant) %>%
summarize(mean_uptake = mean(uptake))
```
This code groups the CO2 dataset by Type and Plant, calculates the mean uptake for each group using summarize, and then prints the resulting data frame. The code is shorter and more concise because it uses the `dplyr` package, which has built-in functions for data manipulation and summarization.
:::
Refactoring the code to be more concise using `dplyr` makes it easier to read and understand the steps involved in the data manipulation. It also reduces the number of nested loops and intermediate variables, making the code more efficient and reducing the chance of errors. Ultimately, it is your job as a programmer to use your best judgment on the fine balance between brevity and readability, as the best solution is not always this obvious.
:::{.notice}
Remember that you can regenerate the AI's response if you want more options.
:::
## Code optimization
Sometimes time and efficiency is of the essence. Loops can be slow and time consuming in many languages. Take for example the following R code that calculates the cross product between two vectors x and y using nested `for` loops:
```{r}
set.seed(123)
n <- 10000
x <- rnorm(n)
y <- rnorm(n)
z <- numeric(n)
# Start timer
start_time <- proc.time()
for(i in 1:n) {
for(j in 1:n) {
z[i] <- z[i] + x[i]*y[j]
}
}
# End timer
proc.time() - start_time
```
:::{.query}
What is a faster, more optimized way of running the following R code?
:::
:::{.ai_response}
```{r}
set.seed(123)
n <- 10000
x <- rnorm(n)
y <- rnorm(n)
# Start timer
start_time <- proc.time()
z <- outer(x, y, "*")
z <- rowSums(z)
# End timer
proc.time() - start_time
```
The `outer()` function performs the same calculation as the nested loop in the original code, but more efficiently. It returns a matrix of all possible combinations of x and y values, with each element of the matrix being the product of the corresponding x and y values. The `rowSums()` function is then used to sum the elements of each row of the matrix, which is equivalent to summing the products of x and y for each index `i` in the original loop. This method avoids the need for the nested loop, resulting in a faster and more efficient computation.
:::
So much faster! We can look at another option by regenerating the response:
:::{.ai_response}
```{r}
set.seed(123)
n <- 10000
x <- rnorm(n)
y <- rnorm(n)
z <- numeric(n)
# Start timer
start_time <- proc.time()
z <- x %*% t(y)
# End timer
proc.time() - start_time
```
One optimized way to perform the same calculation is by using the `%*%` operator to perform matrix multiplication. This can be done by converting x and y to matrices and transposing one of them so that their dimensions align for matrix multiplication. This code should be much faster than the original implementation because it takes advantage of highly optimized matrix multiplication algorithms in R.
:::
While this second suggestion is faster, you will need to consider what aspects of the codebase are most important in each instance. For example, this code runs more quickly, but [the `%*%` operator](https://stat.ethz.ch/R-manual/R-patched/library/base/html/matmult.html) might be unfamiliar to some R programmers. In cases where efficiency is less important, or the data are not large, you might consider maximizing readability.
## Challenges and Limitations
Like humans, AI can make mistakes. Be sure to keep the following in mind as you use AI to refactor your code.
### Complexity
Refactoring is often a complex process that requires deep understanding of the code and its context. AI may not always be able to fully understand the complexity of the code and may struggle to identify the best refactoring strategy. Use a modular approach whenever possible.
### Limited Data
AI models require large amounts of data to learn from, but in the case of code refactoring, there is often limited data available. This can make it difficult for AI models to generalize to new code bases and situations, especially if you are using a more niche programming language.
### Quality Control
Automated refactoring tools that use AI may not always produce code that is of the same quality as code produced by human developers. It can be difficult to always ensure that the refactored code is maintainable, efficient, and free of bugs. You need to use your best judgment when copying and pasting AI-produced code into your codebase.
:::{.warning}
**You should always include unit tests in your code.** Tests can help you catch bugs, including those introduced accidentally by AI.
:::
Because AI models are created by humans, they can be biased. This means they may not always identify your preferred refactorings or may prioritize certain types of refactorings over others. In some cases, this can lead to suboptimal code quality and may create technical debt over time.
### Security
When using AI to refactor code, the code itself is often sent to an external service or platform for analysis and transformation. This can raise concerns about the security of the code, especially if it contains sensitive information such as trade secrets, proprietary algorithms, or personal data. If your code is sensitive, it's important to carefully vet any third-party AI tools or services used in the refactoring process.
# VIDEO Refactoring Code Main Points
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/uKVQAtQ-w0I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
- Code refactoring is the process of improving code quality without changing its functionality. It is crucial in software development to maintain a manageable and adaptable codebase.
- Code refactoring reduces technical debt, improves code stability, and makes it easier to maintain.
- Examples of using AI for code refactoring include correcting syntax, adhering to styling and convention, visual styling, avoiding repetition, removing dead code, and improving both readability and speed of execution (optimization).
- The use of AI for code refactoring raises ethical concerns and is not perfect. It is important for the developer to consider security needs of their code, as well as test out their code.
You can view and download the Google Slides [here](https://docs.google.com/presentation/d/12kBb3mQIWOmn44JsjDcXU8e5dhLAf0xAwkoC5FASRrU/edit#slide=id.p).