<- c(
examples_dates "jlnsdc19/04/2022sjkscd"
"34kjkbs83n19-04-2022kn3r3jwk"
,"kjnwf34kb7-4-22wkj34"
,"dkfbc19/04-2022fwb3k"
,
)
<- c(
examples_times "nsdvln18:03:58sdlfkjns"
"sflks18.03:58sdlfjn"
, )
Introduction
This tutorial will introduce the basics of regular expressions, through general considerations and then concrete examples of varying difficulty.
Regular expressions
Let’s refer to Wikipedia to see what we are working with in this article:
A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation.
This definition straight away frames out the types of problem we can handle with regexes. When faced with pure text, or a variable in a data frame that contains strings in which we are only looking for a specific information that to extract for further use; we might use regular expressions to solve this kind of problems.
A regular expression will describe a sequence of characters, pretty much as detailed as one can imagine, that can then be targeted and extracted by the machine. Now let’s look at how to build a regular expression.
Building regex first steps
While this tutorial is in R, the formalism of a regex is pretty much the same across a range of languages and thus the details of this tutorial hold valid.
Let’s imagine every possible character that can be contained in textual data, and start breaking it into categories.
Categories of characters
The following table presents a non-exhaustive set of character categories:
Category | Symbol | Example | Opposite | Symbol | Example |
---|---|---|---|---|---|
numbers | \d | 12387 | non-numbers | \D | :-dfv* |
letters (lower case) | [a-z] | jsknvs | letters (upper case) | [A-Z] | SDFVM |
letters (lower & upper) | [a-zA-Z] | fvFVDf | - | - | - |
numbers & letters | [:alphanum:] | 3k4rF4 | - | - | - |
words | \w | Hello | non-words | \W | ! |
whitespace | \s | ” ” | non whitespace | \S | “kj4&n” |
boundaries of words | \b | _hello_ | interior of words | \B | h_e_l_l_o |
As you can see, some categories have a designated letter, but in other cases they can be designated by placing the characters of interest between square braquets. This allows to customize and create your own categories of regexes suitable for your specific needs. For example, let’s say you want to find numbers or the letter h, this would look like [\dh]
. Inversely, if you want to match for everything except those characters, place the ^
in front of the others: [^\dh]
.
Other symbols help us with specifying repetitions, or ordering:
- Positioning
^
: beginning of sequence,^\d
\$
: end of sequence,\dh\$
- Repetition
*
: zero or more times+
: one or more time{min, max}
: minimal and maximal number of repetitions{n}
: match exactly n times
- Any symbol
.
: match any symbol?
: match potentially a symbol
- Groups
()
: group elements
- Logical
|
: logical or
It is possible to group several characters by enclosing them into round brackets()
. This will treat whatever is inside them as one single block of characters. We can include any of the mentioned regex symbols within them, particularly the logical or that can help with differentiating cases.
Examples
We will be using the stringr package, part of the tydiverse ecosystem of package. You can also try using base R which is already pretty strong for manipulating strings and regexes. Let’s look at this list of examples we can use to test our newly acquired knowledge. Quite often you might need to work with dates or times, and they can happen to be collected in a messy way. You can imagine that you have all these corner cases that occur in your data and you want to still be able to accurately get dates and times.
Let’s try to understand the basic structure of the part we are interested in: the day can be either one or two digits, then there is a separating element, either \(/\) or \(-\), then the year should be either two or 4 digits. Finally we assume that the surrounding character are not digits otherwise there could be confusion. Let’s capture that with a regex:
library(stringr)
<- "(\\d{1,2})[/-](\\d{1,2})[/-](\\d{2}|\\d{4})"
date_regex
<- str_extract(examples_dates,date_regex)
dates
print(dates)
[1] "19/04/20" "19-04-20" "7-4-22" "19/04-20"
Let’s now format all the dates to be dd/mm/yyyy
, for that we need to replace all other separators by /
.
<- str_replace(dates,"-","/")
dates
print(dates)
[1] "19/04/20" "19/04-20" "7/4-22" "19/04/20"
With times the problem is pretty similar, but generally time is represented as three pairs of digits, for hour, minute and second second separated by either \(.\) or \(:\).
<- "(\\d{2})[\\.:](\\d{2})[\\.:](\\d{2}|)"
time_regex
<- str_extract(examples_times,time_regex)
times
print(times)
[1] "18:03:58" "18.03:58"
And again here, format all the times to hh:mm:ss
.
<- str_replace(times,"\\.",":")
times
print(times)
[1] "18:03:58" "18:03:58"
References
Medium post by Jason Chong
Cheatsheet: