Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Manipulating String Data in R

Most semi-structured and unstructured data is stored using strings, so you’ll need to deal with string manipulation for data analysis or mining.

Sep 6, 2019 • 10 Minute Read

Introduction

This guide will help you understand string manipulation in R. Most of the semi-structured and unstructured data is stored using strings, so you’ll need to deal with string manipulation for data analysis or mining. R provides built-in functions for case conversion, combine, length, and subset for manipulating strings. Stingr from tidyverse package is popular choice, as all string functions begin with str and are easy to remember; we will review some of these functions. Let us start by installing tidyverse package.

      install.pacakages(tidyverse)
library(tidyverse)
library(stringr)
    

Performing Simple String Operations

Define a String

To make strings in R ,you can use a single quote, double quotes, and character(). However, character() will create a vector of type character.

      myquote <- “Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
 
myquote <- ‘Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures’
 
myquote = character(0)
myquote[1] = "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
    

Create an Empty String

This is used for creating empty strings because these are not fixed, we can provide values later.

      myquote = character(0)
myquote <- ‘’
myquote <- “”
    

Display Length of String

String length needs to be checked for various purposes like: -Compare two strings -Find the longest or shortest string

  • Applying format to strings

Let us review length(), nchar(), and str_length from stringr.

Length()

      >length(myquote)
    

Output:

      `[1] 1`
    

For the above string, since R stores data as vectors, the length function returns “1” for the index[1] .

nchar()

      >nchar(myquote)
    

Output:

      `[1] 136`
    

nchar counts the total characters in the string.

str_length()

      > str_length(myquote)
    

Output:

      `[1] 136`
    

str_length() returns the number of code points in a string. Generally, one code point is one character, but not always.

Combine Two Strings with c() and str_c()

At times, we need to add a string to an existing string. For example, the quote mentioned above in my quote string does not contain a name or identifier. Let’s try to add this as a string.

Add a String Using the c() Combine Function

      >myquote <-c(myquote, "-John F. Kennedy")
    

Output:

      `[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
[2] "-John F. Kennedy"`
    

This stores the data as two combined strings with individual character counts.

Add a String Using the str_c() Combine Function

      > str_c(myquote, "-John F. Kennedy", sep= "",collapse =NULL )
    

Output:

      `[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures-John F. Kennedy"`
```] 
You can use the sep argument to specify how the strings are separated. Since str_c() creates a vector, it automatically recycles a shorter vector to the size of the longest element.
 
### Subset a String
 
In order to extract parts of strings, you can use the substr() or the str_sub(). This is helpful in cases like date and time stored together as a string and you need to extract only the date part of the data. Both of the functions require the start and end of the string to be extracted.
 
```r
>substr(myquote,17,45)
    

Output:

      `[1] ", a weekly, a monthly process" ""`
    
      > str_sub(myquote,start=17,end=45)
    

Output:

      `[1] ", a weekly, a monthly process"`
    

To split the elements of a string into substrings based on matches to a given pattern:

      > strsplit(myquote,"slowly")
    

Output:

      `[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, "
[2] " eroding old barriers, quietly building new structures"`
    
      > str_split(myquote,"slowly")
    

Output:

      `[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, "
[2] " eroding old barriers, quietly building new structures"  `
    

In this example, the string myquote gets split into a two-character vector where the word “slowly” is encountered.

Find and Replace Functions

To find a string, you can use grep, grepl(), regexpr(), gregexpr(), and regexec() functions. These differ in the format and details in the results. To perform a replacement of the first match only, use sub() and for replacing all the matches, use gsub().

The example below gsub() replaces all the spaces with “-“ and str_replace_all() replaces all the “-“ with spaces.

      > gsub(" ", "-",myquote)
    

Output:

      `[1] "Peace-is-a-daily,-a-weekly,-a-monthly-process,-gradually-changing-opinions,-slowly-eroding-old-barriers,-quietly-building-new-structures"`
    
      > str_replace_all(myquote,"-"," ")
    

Output:

      `[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"`
    

Formatting Strings

Now we will discuss formatting. R provides C-style formatting which means that we use a wrapper for C library functions. Let us see an example using the sprint() that replaces a format with a given string or number. The parameters used here are %s for string and %.2f for a fixed-point decimal value. You can find more information in the resources section.

      > sprintf("Your device %s is at %.2f percent energy efficient", "Thermostat", 67.700)
    

Output:

      `[1] "Your device Thermostat is at 67.70 percent energy efficient"`
    

Pattern Matching

Let's review the regular expressions, a method of describing patterns. For example, if I want to find all states starting with the letter “a” in the USArrests data set, I can set a pattern match as below:

      #install rebus to specify anchors START and END
install.packages("rebus")
library(rebus)
# Find states starting with letter A
states = rownames(USArrests)
str_view(states, pattern = START %R% "A")`
    

Similarly, to find all the states ending with “a”:

      > str_view(states, pattern = "a" %R% END  )
    

Conclusion

To conclude, this guide provides you with basic functions to get started on string manipulations. I have created a list of a few more functions that you can use; refer to the resources section for further explanations.

Check out the table below for base R functions:

TaskFunction to use
Convert to uppercasetoupper(x)
Convert to lowercasetoLower(x, keep_acronyms = FALSE, ...)
Join multiple vectorspaste (…, sep = " ", collapse = NULL)
Join elements of a vector togetherpaste(x, collapse = ' ')
Find regular expression matches in x returns a vector of indices that contain the patterngrep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
Find regular expression matches in x returns True is the pattern is found.grepl((pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
Replace matchesgsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Converts to character string (x is object of class fingerprint, featvec or feature)as.character(x)
Checks for string data typesis.character(x)
Abbreviate textabbreviate(names.arg, minlength = 4, use.classes = TRUE, dot = FALSE, strict = FALSE method = c("left.kept", "both.sides"), named = TRUE)
Enable retrieval of matching substringsgregexpr(pattern, text, ignore.case =FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Case foldingcasefold(x, upper = FALSE)
Character translationchartr(old, new, x)
Convert to integer value of same length as textregexec(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

Check out the table below for stringr functions and their usage:

TaskFunction to use
Convert to uppercasestr_to_lower(string, locale = "en")
Convert to lowercasestr_to_upper(string, locale = "en")
Convert to title casestr_to_title(string, locale = "en")
Convert to sentence casestr_to_sentence(string, locale = "en")
Match exact stringstr_view(string, pattern, match = NA) or str_view_all(string, pattern, match = NA)
Duplicate a stringstr_dup(string, times)
Remove white spacesstr_trim(string, side = c("both", "left", "right")) or str_squish(string)
Wrap textstr_wrap(string, width = 80, indent = 0, exdent = 0)
Vectorized over stringstr_count(string, pattern = "")
View or override current encodingstr_conv(string, encoding)
Order a character vectorstr_sort(x, increasing = TRUE, ignore.case = FALSE, USE.NAMES = FALSE)

Check out my guides on visualizations with R:

Resources