Manipulating String Data in R

Most semi-structured and unstructured data is stored using strings, so you’ll need to deal with string manipulation for data analysis or mining.

By Jaya Trivedi

Sep 6, 2019 • 10 Minute Read

Subscribe to the newsletter

Introduction

This guide will help you understand string manipulation in R. Most of the semi-structured and unstructured data is stored using strings, so you’ll need to deal with string manipulation for data analysis or mining. R provides built-in functions for case conversion, combine, length, and subset for manipulating strings. Stingr from tidyverse package is popular choice, as all string functions begin with str and are easy to remember; we will review some of these functions. Let us start by installing tidyverse package.

          install.pacakages(tidyverse)
library(tidyverse)
library(stringr)
    

Performing Simple String Operations

Define a String

To make strings in R ,you can use a single quote, double quotes, and character(). However, character() will create a vector of type character.

          myquote <- “Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
 
myquote <- ‘Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures’
 
myquote = character(0)
myquote[1] = "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"

Create an Empty String

This is used for creating empty strings because these are not fixed, we can provide values later.

          myquote = character(0)
myquote <- ‘’
myquote <- “”
    

Display Length of String

String length needs to be checked for various purposes like: -Compare two strings -Find the longest or shortest string

Applying format to strings

Let us review length(), nchar(), and str_length from stringr.

Length()

      >length(myquote)

Output:

      `[1] 1`

For the above string, since R stores data as vectors, the length function returns “1” for the index[1] .

nchar()

      >nchar(myquote)

Output:

      `[1] 136`

nchar counts the total characters in the string.

str_length()

      > str_length(myquote)

Output:

      `[1] 136`

str_length() returns the number of code points in a string. Generally, one code point is one character, but not always.

Combine Two Strings with c() and str_c()

At times, we need to add a string to an existing string. For example, the quote mentioned above in my quote string does not contain a name or identifier. Let’s try to add this as a string.

Add a String Using the c() Combine Function

      >myquote <-c(myquote, "-John F. Kennedy")

Output:

          `[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
[2] "-John F. Kennedy"`
    

This stores the data as two combined strings with individual character counts.

Add a String Using the str_c() Combine Function

      > str_c(myquote, "-John F. Kennedy", sep= "",collapse =NULL )

Output:

          `[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures-John F. Kennedy"`
```] 
You can use the sep argument to specify how the strings are separated. Since str_c() creates a vector, it automatically recycles a shorter vector to the size of the longest element.
 
### Subset a String
 
In order to extract parts of strings, you can use the substr() or the str_sub(). This is helpful in cases like date and time stored together as a string and you need to extract only the date part of the data. Both of the functions require the start and end of the string to be extracted.
 
```r
>substr(myquote,17,45)
    

Output:

      `[1] ", a weekly, a monthly process" ""`

      > str_sub(myquote,start=17,end=45)

Output:

      `[1] ", a weekly, a monthly process"`

To split the elements of a string into substrings based on matches to a given pattern:

      > strsplit(myquote,"slowly")

Output:

          `[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, "
[2] " eroding old barriers, quietly building new structures"`
    

      > str_split(myquote,"slowly")

Output:

          `[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, "
[2] " eroding old barriers, quietly building new structures"  `
    

In this example, the string myquote gets split into a two-character vector where the word “slowly” is encountered.

Find and Replace Functions

To find a string, you can use grep, grepl(), regexpr(), gregexpr(), and regexec() functions. These differ in the format and details in the results. To perform a replacement of the first match only, use sub() and for replacing all the matches, use gsub().

The example below gsub() replaces all the spaces with “-“ and str_replace_all() replaces all the “-“ with spaces.

      > gsub(" ", "-",myquote)

Output:

          `[1] "Peace-is-a-daily,-a-weekly,-a-monthly-process,-gradually-changing-opinions,-slowly-eroding-old-barriers,-quietly-building-new-structures"`
    

      > str_replace_all(myquote,"-"," ")

Output:

          `[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"`
    

Formatting Strings

Now we will discuss formatting. R provides C-style formatting which means that we use a wrapper for C library functions. Let us see an example using the sprint() that replaces a format with a given string or number. The parameters used here are %s for string and %.2f for a fixed-point decimal value. You can find more information in the resources section.

          > sprintf("Your device %s is at %.2f percent energy efficient", "Thermostat", 67.700)
    

Output:

      `[1] "Your device Thermostat is at 67.70 percent energy efficient"`

Pattern Matching

Let's review the regular expressions, a method of describing patterns. For example, if I want to find all states starting with the letter “a” in the USArrests data set, I can set a pattern match as below:

          #install rebus to specify anchors START and END
install.packages("rebus")
library(rebus)
# Find states starting with letter A
states = rownames(USArrests)
str_view(states, pattern = START %R% "A")`
    

Similarly, to find all the states ending with “a”:

      > str_view(states, pattern = "a" %R% END  )

Conclusion

To conclude, this guide provides you with basic functions to get started on string manipulations. I have created a list of a few more functions that you can use; refer to the resources section for further explanations.

Check out the table below for base R functions:

Task	Function to use
Convert to uppercase	toupper(x)
Convert to lowercase	toLower(x, keep_acronyms = FALSE, ...)
Join multiple vectors	paste (…, sep = " ", collapse = NULL)
Join elements of a vector together	paste(x, collapse = ' ')
Find regular expression matches in x returns a vector of indices that contain the pattern	grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
Find regular expression matches in x returns True is the pattern is found.	grepl((pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
Replace matches	gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Converts to character string (x is object of class fingerprint, featvec or feature)	as.character(x)
Checks for string data types	is.character(x)
Abbreviate text	abbreviate(names.arg, minlength = 4, use.classes = TRUE, dot = FALSE, strict = FALSE method = c("left.kept", "both.sides"), named = TRUE)
Enable retrieval of matching substrings	gregexpr(pattern, text, ignore.case =FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Case folding	casefold(x, upper = FALSE)
Character translation	chartr(old, new, x)
Convert to integer value of same length as text	regexec(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

Check out the table below for stringr functions and their usage:

Task	Function to use
Convert to uppercase	str_to_lower(string, locale = "en")
Convert to lowercase	str_to_upper(string, locale = "en")
Convert to title case	str_to_title(string, locale = "en")
Convert to sentence case	str_to_sentence(string, locale = "en")
Match exact string	str_view(string, pattern, match = NA) or str_view_all(string, pattern, match = NA)
Duplicate a string	str_dup(string, times)
Remove white spaces	str_trim(string, side = c("both", "left", "right")) or str_squish(string)
Wrap text	str_wrap(string, width = 80, indent = 0, exdent = 0)
Vectorized over string	str_count(string, pattern = "")
View or override current encoding	str_conv(string, encoding)
Order a character vector	str_sort(x, increasing = TRUE, ignore.case = FALSE, USE.NAMES = FALSE)

Check out my guides on visualizations with R:

Resources

Jaya T.

Written content author.

More about this author

Manipulating String Data in R

Introduction

Performing Simple String Operations

Define a String

Create an Empty String

Display Length of String

Length()

nchar()

str_length()

Combine Two Strings with c() and str_c()

Add a String Using the c() Combine Function

Add a String Using the str_c() Combine Function

Find and Replace Functions

Formatting Strings

Pattern Matching

Conclusion

Resources

Advance your tech skills today