Author avatar

Jaya Trivedi

Manipulating String Data in R

Jaya Trivedi

  • Sep 6, 2019
  • 10 Min read
  • 7,643 Views
  • Sep 6, 2019
  • 10 Min read
  • 7,643 Views
Data
R

Introduction

This guide will help you understand string manipulation in R. Most of the semi-structured and unstructured data is stored using strings, so you’ll need to deal with string manipulation for data analysis or mining. R provides built-in functions for case conversion, combine, length, and subset for manipulating strings. Stingr from tidyverse package is popular choice, as all string functions begin with str and are easy to remember; we will review some of these functions. Let us start by installing tidyverse package.

1install.pacakages(tidyverse)
2library(tidyverse)
3library(stringr)
4 
r

Performing Simple String Operations

Define a String

To make strings in R ,you can use a single quote, double quotes, and character(). However, character() will create a vector of type character.

1myquote <- “Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
2 
3myquote <- ‘Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures’
4 
5myquote = character(0)
6myquote[1] = "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
7 
r

Create an Empty String

This is used for creating empty strings because these are not fixed, we can provide values later.

1myquote = character(0)
2myquote <- ‘’
3myquote <- “”
r

Display Length of String

String length needs to be checked for various purposes like: -Compare two strings -Find the longest or shortest string

  • Applying format to strings

Let us review length(), nchar(), and str_length from stringr.

Length()

1>length(myquote)
r

Output:

1`[1] 1`

For the above string, since R stores data as vectors, the length function returns “1” for the index1 .

nchar()

1>nchar(myquote)
r

Output:

1`[1] 136`

nchar counts the total characters in the string.

str_length()

1> str_length(myquote)
r

Output:

1`[1] 136`

str_length() returns the number of code points in a string. Generally, one code point is one character, but not always.

Combine Two Strings with c() and str_c()

At times, we need to add a string to an existing string. For example, the quote mentioned above in my quote string does not contain a name or identifier. Let’s try to add this as a string.

Add a String Using the c() Combine Function

1>myquote <-c(myquote, "-John F. Kennedy")
r

Output:

1`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
2[2] "-John F. Kennedy"`

This stores the data as two combined strings with individual character counts.

Add a String Using the str_c() Combine Function

1> str_c(myquote, "-John F. Kennedy", sep= "",collapse =NULL )
r

Output:

1`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures-John F. Kennedy"`
2```] 
3You can use the sep argument to specify how the strings are separated. Since str_c() creates a vector, it automatically recycles a shorter vector to the size of the longest element.
4 
5### Subset a String
6 
7In order to extract parts of strings, you can use the substr() or the str_sub(). This is helpful in cases like date and time stored together as a string and you need to extract only the date part of the data. Both of the functions require the start and end of the string to be extracted.
8 
9```r
10>substr(myquote,17,45)

Output:

1`[1] ", a weekly, a monthly process" ""`
1> str_sub(myquote,start=17,end=45)
r

Output:

1`[1] ", a weekly, a monthly process"`

To split the elements of a string into substrings based on matches to a given pattern:

1> strsplit(myquote,"slowly")
r

Output:

1`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, "
2[2] " eroding old barriers, quietly building new structures"`
1> str_split(myquote,"slowly")
r

Output:

1`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, "
2[2] " eroding old barriers, quietly building new structures"  `

In this example, the string myquote gets split into a two-character vector where the word “slowly” is encountered.

Find and Replace Functions

To find a string, you can use grep, grepl(), regexpr(), gregexpr(), and regexec() functions. These differ in the format and details in the results. To perform a replacement of the first match only, use sub() and for replacing all the matches, use gsub().

The example below gsub() replaces all the spaces with “-“ and str_replace_all() replaces all the “-“ with spaces.

1> gsub(" ", "-",myquote)
r

Output:

1`[1] "Peace-is-a-daily,-a-weekly,-a-monthly-process,-gradually-changing-opinions,-slowly-eroding-old-barriers,-quietly-building-new-structures"`
1> str_replace_all(myquote,"-"," ")
r

Output:

1`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"`

Formatting Strings

Now we will discuss formatting. R provides C-style formatting which means that we use a wrapper for C library functions. Let us see an example using the sprint() that replaces a format with a given string or number. The parameters used here are %s for string and %.2f for a fixed-point decimal value. You can find more information in the resources section.

1> sprintf("Your device %s is at %.2f percent energy efficient", "Thermostat", 67.700)
r

Output:

1`[1] "Your device Thermostat is at 67.70 percent energy efficient"`

Pattern Matching

Let's review the regular expressions, a method of describing patterns. For example, if I want to find all states starting with the letter “a” in the USArrests data set, I can set a pattern match as below:

1#install rebus to specify anchors START and END
2install.packages("rebus")
3library(rebus)
4# Find states starting with letter A
5states = rownames(USArrests)
6str_view(states, pattern = START %R% "A")`
r

Similarly, to find all the states ending with “a”:

1> str_view(states, pattern = "a" %R% END  )
2 
r

Conclusion

To conclude, this guide provides you with basic functions to get started on string manipulations. I have created a list of a few more functions that you can use; refer to the resources section for further explanations.

Check out the table below for base R functions:

TaskFunction to use
Convert to uppercasetoupper(x)
Convert to lowercasetoLower(x, keep_acronyms = FALSE, ...)
Join multiple vectorspaste (…, sep = " ", collapse = NULL)
Join elements of a vector togetherpaste(x, collapse = ' ')
Find regular expression matches in x returns a vector of indices that contain the patterngrep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
Find regular expression matches in x returns True is the pattern is found.grepl((pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
Replace matchesgsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Converts to character string (x is object of class fingerprint, featvec or feature)as.character(x)
Checks for string data typesis.character(x)
Abbreviate textabbreviate(names.arg, minlength = 4, use.classes = TRUE, dot = FALSE, strict = FALSE method = c("left.kept", "both.sides"), named = TRUE)
Enable retrieval of matching substringsgregexpr(pattern, text, ignore.case =FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Case foldingcasefold(x, upper = FALSE)
Character translationchartr(old, new, x)
Convert to integer value of same length as textregexec(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

Check out the table below for stringr functions and their usage:

TaskFunction to use
Convert to uppercasestr_to_lower(string, locale = "en")
Convert to lowercasestr_to_upper(string, locale = "en")
Convert to title casestr_to_title(string, locale = "en")
Convert to sentence casestr_to_sentence(string, locale = "en")
Match exact stringstr_view(string, pattern, match = NA) or str_view_all(string, pattern, match = NA)
Duplicate a stringstr_dup(string, times)
Remove white spacesstr_trim(string, side = c("both", "left", "right")) or str_squish(string)
Wrap textstr_wrap(string, width = 80, indent = 0, exdent = 0)
Vectorized over stringstr_count(string, pattern = "")
View or override current encodingstr_conv(string, encoding)
Order a character vectorstr_sort(x, increasing = TRUE, ignore.case = FALSE, USE.NAMES = FALSE)

Check out my guides on visualizations with R: