Author avatar

Jaya Trivedi

Manipulating String Data in R

Jaya Trivedi

  • Sep 6, 2019
  • 10 Min read
  • 75 Views
  • Sep 6, 2019
  • 10 Min read
  • 75 Views
Data
R

Introduction

This guide will help you understand string manipulation in R. Most of the semi-structured and unstructured data is stored using strings, so you’ll need to deal with string manipulation for data analysis or mining. R provides built-in functions for case conversion, combine, length, and subset for manipulating strings. Stingr from tidyverse package is popular choice, as all string functions begin with str and are easy to remember; we will review some of these functions. Let us start by installing tidyverse package.

1
2
3
4
install.pacakages(tidyverse)
library(tidyverse)
library(stringr)
 
r

Performing Simple String Operations

Define a String

To make strings in R ,you can use a single quote, double quotes, and character(). However, character() will create a vector of type character.

1
2
3
4
5
6
7
myquote <- “Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
 
myquote <- ‘Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures’
 
myquote = character(0)
myquote[1] = "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
 
r

Create an Empty String

This is used for creating empty strings because these are not fixed, we can provide values later.

1
2
3
myquote = character(0)
myquote <- ‘’
myquote <- “”
r

Display Length of String

String length needs to be checked for various purposes like: -Compare two strings -Find the longest or shortest string

  • Applying format to strings

Let us review length(), nchar(), and str_length from stringr.

Length()

1
>length(myquote)
r

Output:

1
`[1] 1`

For the above string, since R stores data as vectors, the length function returns “1” for the index1 .

nchar()

1
>nchar(myquote)
r

Output:

1
`[1] 136`

nchar counts the total characters in the string.

str_length()

1
> str_length(myquote)
r

Output:

1
`[1] 136`

str_length() returns the number of code points in a string. Generally, one code point is one character, but not always.

Combine Two Strings with c() and str_c()

At times, we need to add a string to an existing string. For example, the quote mentioned above in my quote string does not contain a name or identifier. Let’s try to add this as a string.

Add a String Using the c() Combine Function

1
>myquote <-c(myquote, "-John F. Kennedy")
r

Output:

1
2
`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
[2] "-John F. Kennedy"`

This stores the data as two combined strings with individual character counts.

Add a String Using the str_c() Combine Function

1
> str_c(myquote, "-John F. Kennedy", sep= "",collapse =NULL )
r

Output:

1
2
3
4
5
6
7
8
9
10
`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures-John F. Kennedy"`
```] 
You can use the sep argument to specify how the strings are separated. Since str_c() creates a vector, it automatically recycles a shorter vector to the size of the longest element.
 
### Subset a String
 
In order to extract parts of strings, you can use the substr() or the str_sub(). This is helpful in cases like date and time stored together as a string and you need to extract only the date part of the data. Both of the functions require the start and end of the string to be extracted.
 
```r
>substr(myquote,17,45)

Output:

1
`[1] ", a weekly, a monthly process" ""`
1
> str_sub(myquote,start=17,end=45)
r

Output:

1
`[1] ", a weekly, a monthly process"`

To split the elements of a string into substrings based on matches to a given pattern:

1
> strsplit(myquote,"slowly")
r

Output:

1
2
`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, "
[2] " eroding old barriers, quietly building new structures"`
1
> str_split(myquote,"slowly")
r

Output:

1
2
`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, "
[2] " eroding old barriers, quietly building new structures"  `

In this example, the string myquote gets split into a two-character vector where the word “slowly” is encountered.

Find and Replace Functions

To find a string, you can use grep, grepl(), regexpr(), gregexpr(), and regexec() functions. These differ in the format and details in the results. To perform a replacement of the first match only, use sub() and for replacing all the matches, use gsub().

The example below gsub() replaces all the spaces with “-“ and str_replace_all() replaces all the “-“ with spaces.

1
> gsub(" ", "-",myquote)
r

Output:

1
`[1] "Peace-is-a-daily,-a-weekly,-a-monthly-process,-gradually-changing-opinions,-slowly-eroding-old-barriers,-quietly-building-new-structures"`
1
> str_replace_all(myquote,"-"," ")
r

Output:

1
`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"`

Formatting Strings

Now we will discuss formatting. R provides C-style formatting which means that we use a wrapper for C library functions. Let us see an example using the sprint() that replaces a format with a given string or number. The parameters used here are %s for string and %.2f for a fixed-point decimal value. You can find more information in the resources section.

1
> sprintf("Your device %s is at %.2f percent energy efficient", "Thermostat", 67.700)
r

Output:

1
`[1] "Your device Thermostat is at 67.70 percent energy efficient"`

Pattern Matching

Let's review the regular expressions, a method of describing patterns. For example, if I want to find all states starting with the letter “a” in the USArrests data set, I can set a pattern match as below:

1
2
3
4
5
6
#install rebus to specify anchors START and END
install.packages("rebus")
library(rebus)
# Find states starting with letter A
states = rownames(USArrests)
str_view(states, pattern = START %R% "A")`
r

Similarly, to find all the states ending with “a”:

1
2
> str_view(states, pattern = "a" %R% END  )
 
r

Conclusion

To conclude, this guide provides you with basic functions to get started on string manipulations. I have created a list of a few more functions that you can use; refer to the resources section for further explanations.

Check out the table below for base R functions:

TaskFunction to use
Convert to uppercasetoupper(x)
Convert to lowercasetoLower(x, keep_acronyms = FALSE, ...)
Join multiple vectorspaste (…, sep = " ", collapse = NULL)
Join elements of a vector togetherpaste(x, collapse = ' ')
Find regular expression matches in x returns a vector of indices that contain the patterngrep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
Find regular expression matches in x returns True is the pattern is found.grepl((pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
Replace matchesgsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Converts to character string (x is object of class fingerprint, featvec or feature)as.character(x)
Checks for string data typesis.character(x)
Abbreviate textabbreviate(names.arg, minlength = 4, use.classes = TRUE, dot = FALSE, strict = FALSE method = c("left.kept", "both.sides"), named = TRUE)
Enable retrieval of matching substringsgregexpr(pattern, text, ignore.case =FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Case foldingcasefold(x, upper = FALSE)
Character translationchartr(old, new, x)
Convert to integer value of same length as textregexec(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

Check out the table below for stringr functions and their usage:

TaskFunction to use
Convert to uppercasestr_to_lower(string, locale = "en")
Convert to lowercasestr_to_upper(string, locale = "en")
Convert to title casestr_to_title(string, locale = "en")
Convert to sentence casestr_to_sentence(string, locale = "en")
Match exact stringstr_view(string, pattern, match = NA) or str_view_all(string, pattern, match = NA)
Duplicate a stringstr_dup(string, times)
Remove white spacesstr_trim(string, side = c("both", "left", "right")) or str_squish(string)
Wrap textstr_wrap(string, width = 80, indent = 0, exdent = 0)
Vectorized over stringstr_count(string, pattern = "")
View or override current encodingstr_conv(string, encoding)
Order a character vectorstr_sort(x, increasing = TRUE, ignore.case = FALSE, USE.NAMES = FALSE)

Check out my guides on visualizations with R:

4