OUHSC Statistical Computing User Group
Will Beasley, Dept of Pediatrics,
Biomedical and Behavioral Methodology Core (BBMC)
A 'regex' is typically a carefully crafted string that describes a pattern of text. It can:
- Extract components of the text,
- Substitute components of the text, or
- Determine if the pattern simply appears in the text.
It's like the big brother of wildcards you match filenames with
(eg, "*.R"
).
Pattern | Matches |
---|---|
mike |
"mike", "smike", "miked", etc. |
mike4 |
"mike4", "smike4", etc. |
mike\d |
"mike" followed by any single digit (eg "mike8", "smike8") |
mike\d+ |
"mike" followed by one or more digits (eg "mike1234", "smike8") |
^mike$ |
only "mike" |
\b19(?=(1|2))(\d{2})\b
and 20\2
converts years in the 1910s and 1920s to the 2010s and 2020s
(but leaves later years as they are).
-
An online regex tester, regex101 (https://regex101.com/).
-
Example "subject" text in SCUG repo.
(Google "OU scug github".) -
A local text editor, choose one of the following:
- Atom (https://atom.io/).
- Notepad++ (https://notepad-plus-plus.org/).
-
Today's as language agnostic as possible. SAS, R, and Python examples in Part 2.
-
Later, consider RegexBuddy for $40.
- There's no single "regular expression" specification. Each language (eg, Python, R, Java) have slightly different flavors.
- There are two main branches of the specification. We'll concentrate on
- the "Perl" branch (eg,
"\d\w"
) instead of - the "Posix" branch (eg,
"[:digit:][:alnum:]"
)
- the "Perl" branch (eg,
- Don't forget the "g" option in regex101.com
- Identify years in the 1990s, or
- Identify years ending in 7, 8, or 9, or
- Identify lines starting with a year, or
- "Capture" the century of each year, or
- Think of something else.
1916-1918 subscales for a subject
1998-1914 subscales for a subject
subscales for a subject 1998-1920
- Start in a regex tester.
- Start small/simple, then slowly build complexity & generality.
- In the regex.
- In the example/subject text.
- Leave a breadcrumb trail
(ie, progression of simpler regexes, commented out). - Include comments to help others and you later.
- Short-term: use the right-panels of regex101.com.
- Blacklist: Flag the bad numeric values. (Sometimes "bad" is a subjective decision.)
- Whitelist: Permit only the good values.
1234
23
14a
1a3
234
1.39
In the right panel of bottom right panel of regex101.com. Especially these first:
. versus \.
\w and \d and \s (versus \W and \D and \S)
^ and \A
$ and \Z
? and * and + and things like {3,6}
Capturing
Character classes
- Swap the columns.
"CL_ID" = "ClientID"
, "RMSEQ" = "RemovedSequence"
, "RMVL_BEGDT" = "RemovalBeginDate"
, "RMVL_ENDDT" = "RemovalEndDate"
, "END_TYP_CDE" = "EndTypeID"
, "REMOVED_FROM" = "RemovedFromTypeID"
, "CURR_RMVL_TYP" = "RemovalTypeCurrentID"
, "ORIG_RMVL_TYP" = "RemovalTypeOriginalID"
, "FMLY_STRUCTURE" = "FamilyStructureTypeID"
- Anticipate misbehaving subject that you haven't seen yet.
- Try variations of the solutions. There are usually 10 solutions, each with strengths and weaknesses.
- Robustness.
- Readability & maintainability.
- Generality.
- Create very selective regexes that loudly fail when they encounter subject text that you haven't anticipated.
- Long-term: read and rereadRegular Expressions Cookbook, esp Ch 2.
- Erase the "quietly" parameters.
- In a single file
- In many files
requireNamespace("dplyr", quietly=TRUE) #hadley/dplyr
requireNamespace("lubridate")
requireNamespace("OuhscMunge",quietly=T) #OuhscBbmc/OuhscMunge
- Pad single digits with zeros (eg, "4" becomes "04")
9
4
34
3
62
43
1
- Extract the cage ID
- Extract the mouse ID (within the cage)
Time,Gender,Genetype,Treatment,MouseID,OR-Recognition Index,FC-t-F %,FC-b-F %,FC-a-F %
4M,Male,WILD,Control,c9-1,0.32,11.9,0,25.7
4M,Male,WILD,Control,c13-2,0.47,23.7,0,11.
4M,Male,WILD,Prozac,c10-2,0.62,40.7,11.4,51.4
4M,Male,WILD,Prozac,c14-3,0.63,10.2,0,28.6
4M,Male,YFP,Control,c9-2,0.42,42.4,11.4,22.9
4M,Male,YFP,Control,c13-1,0.5,15.3,0,54.1
4M,Male,YFP,Control,c13-nm,1,27.1,0,31.4
4M,Male,YFP,Prozac,c10-1,0.65,20.3,17.1,54.3
(In some cases, you'd have to parse only the cell, not the entire line. But this is good practice.)
- Work by yourself or in pairs.
- After you're done with these 6 exercises,
- Invent new challenges
- Help someone else
- Check the solutions I thought of.
Look for an email invitation to a REDCap survey.
Example 3
(,*\s*)"(\w+)"\s+=\s+"(\w+)" and
$1"$3" = "$2"
Example 4
library\((\w+),\s*quietly=(T|TRUE)\) and
library($1)
Example 5
\b(\d)\b and
0$1
Example 6
,c(\d{1,2})-(\d|nm),