This guide will help you understand string manipulation in R. Most of the semi-structured and unstructured data is stored using strings, so you’ll need to deal with string manipulation for data analysis or mining. R provides built-in functions for case conversion, combine, length, and subset for manipulating strings. Stingr from tidyverse package is popular choice, as all string functions begin with str and are easy to remember; we will review some of these functions. Let us start by installing tidyverse package.
1install.pacakages(tidyverse)
2library(tidyverse)
3library(stringr)
4
To make strings in R ,you can use a single quote, double quotes, and character(). However, character() will create a vector of type character.
1myquote <- “Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
2
3myquote <- ‘Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures’
4
5myquote = character(0)
6myquote[1] = "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
7
This is used for creating empty strings because these are not fixed, we can provide values later.
1myquote = character(0)
2myquote <- ‘’
3myquote <- “”
String length needs to be checked for various purposes like: -Compare two strings -Find the longest or shortest string
Let us review length(), nchar(), and str_length from stringr.
1>length(myquote)
Output:
1`[1] 1`
For the above string, since R stores data as vectors, the length function returns “1” for the index1 .
1>nchar(myquote)
Output:
1`[1] 136`
nchar counts the total characters in the string.
1> str_length(myquote)
Output:
1`[1] 136`
str_length() returns the number of code points in a string. Generally, one code point is one character, but not always.
At times, we need to add a string to an existing string. For example, the quote mentioned above in my quote string does not contain a name or identifier. Let’s try to add this as a string.
1>myquote <-c(myquote, "-John F. Kennedy")
Output:
1`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
2[2] "-John F. Kennedy"`
This stores the data as two combined strings with individual character counts.
1> str_c(myquote, "-John F. Kennedy", sep= "",collapse =NULL )
Output:
1`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures-John F. Kennedy"`
2```]
3You can use the sep argument to specify how the strings are separated. Since str_c() creates a vector, it automatically recycles a shorter vector to the size of the longest element.
4
5### Subset a String
6
7In order to extract parts of strings, you can use the substr() or the str_sub(). This is helpful in cases like date and time stored together as a string and you need to extract only the date part of the data. Both of the functions require the start and end of the string to be extracted.
8
9```r
10>substr(myquote,17,45)
Output:
1`[1] ", a weekly, a monthly process" ""`
1> str_sub(myquote,start=17,end=45)
Output:
1`[1] ", a weekly, a monthly process"`
To split the elements of a string into substrings based on matches to a given pattern:
1> strsplit(myquote,"slowly")
Output:
1`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, "
2[2] " eroding old barriers, quietly building new structures"`
1> str_split(myquote,"slowly")
Output:
1`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, "
2[2] " eroding old barriers, quietly building new structures" `
In this example, the string myquote
gets split into a two-character vector where the word “slowly” is encountered.
To find a string, you can use grep, grepl(), regexpr(), gregexpr(), and regexec() functions. These differ in the format and details in the results. To perform a replacement of the first match only, use sub() and for replacing all the matches, use gsub().
The example below gsub() replaces all the spaces with “-“ and str_replace_all() replaces all the “-“ with spaces.
1> gsub(" ", "-",myquote)
Output:
1`[1] "Peace-is-a-daily,-a-weekly,-a-monthly-process,-gradually-changing-opinions,-slowly-eroding-old-barriers,-quietly-building-new-structures"`
1> str_replace_all(myquote,"-"," ")
Output:
1`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"`
Now we will discuss formatting. R provides C-style formatting which means that we use a wrapper for C library functions. Let us see an example using the sprint() that replaces a format with a given string or number. The parameters used here are %s for string and %.2f for a fixed-point decimal value. You can find more information in the resources section.
1> sprintf("Your device %s is at %.2f percent energy efficient", "Thermostat", 67.700)
Output:
1`[1] "Your device Thermostat is at 67.70 percent energy efficient"`
Let's review the regular expressions, a method of describing patterns. For example, if I want to find all states starting with the letter “a” in the USArrests data set, I can set a pattern match as below:
1#install rebus to specify anchors START and END
2install.packages("rebus")
3library(rebus)
4# Find states starting with letter A
5states = rownames(USArrests)
6str_view(states, pattern = START %R% "A")`
Similarly, to find all the states ending with “a”:
1> str_view(states, pattern = "a" %R% END )
2
To conclude, this guide provides you with basic functions to get started on string manipulations. I have created a list of a few more functions that you can use; refer to the resources section for further explanations.
Check out the table below for base R functions:
Task | Function to use |
---|---|
Convert to uppercase | toupper(x) |
Convert to lowercase | toLower(x, keep_acronyms = FALSE, ...) |
Join multiple vectors | paste (…, sep = " ", collapse = NULL) |
Join elements of a vector together | paste(x, collapse = ' ') |
Find regular expression matches in x returns a vector of indices that contain the pattern | grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE) |
Find regular expression matches in x returns True is the pattern is found. | grepl((pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE) |
Replace matches | gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) |
Converts to character string (x is object of class fingerprint, featvec or feature) | as.character(x) |
Checks for string data types | is.character(x) |
Abbreviate text | abbreviate(names.arg, minlength = 4, use.classes = TRUE, dot = FALSE, strict = FALSE method = c("left.kept", "both.sides"), named = TRUE) |
Enable retrieval of matching substrings | gregexpr(pattern, text, ignore.case =FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) |
Case folding | casefold(x, upper = FALSE) |
Character translation | chartr(old, new, x) |
Convert to integer value of same length as text | regexec(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) |
Check out the table below for stringr functions and their usage:
Task | Function to use |
---|---|
Convert to uppercase | str_to_lower(string, locale = "en") |
Convert to lowercase | str_to_upper(string, locale = "en") |
Convert to title case | str_to_title(string, locale = "en") |
Convert to sentence case | str_to_sentence(string, locale = "en") |
Match exact string | str_view(string, pattern, match = NA) or str_view_all(string, pattern, match = NA) |
Duplicate a string | str_dup(string, times) |
Remove white spaces | str_trim(string, side = c("both", "left", "right")) or str_squish(string) |
Wrap text | str_wrap(string, width = 80, indent = 0, exdent = 0) |
Vectorized over string | str_count(string, pattern = "") |
View or override current encoding | str_conv(string, encoding) |
Order a character vector | str_sort(x, increasing = TRUE, ignore.case = FALSE, USE.NAMES = FALSE) |
Check out my guides on visualizations with R: