Jaya Trivedi

# Manipulating String Data in R

• Sep 6, 2019
• 3,597 Views
• Sep 6, 2019
• 3,597 Views
Data
R

## Introduction

This guide will help you understand string manipulation in R. Most of the semi-structured and unstructured data is stored using strings, so you’ll need to deal with string manipulation for data analysis or mining. R provides built-in functions for case conversion, combine, length, and subset for manipulating strings. Stingr from tidyverse package is popular choice, as all string functions begin with str and are easy to remember; we will review some of these functions. Let us start by installing tidyverse package.

``````1
2
3
4
``````install.pacakages(tidyverse)
library(tidyverse)
library(stringr)
``````
r

## Performing Simple String Operations

### Define a String

To make strings in R ,you can use a single quote, double quotes, and character(). However, character() will create a vector of type character.

``````1
2
3
4
5
6
7
``````myquote <- “Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"

myquote <- ‘Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures’

myquote = character(0)
myquote[1] = "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
``````
r

## Create an Empty String

This is used for creating empty strings because these are not fixed, we can provide values later.

``````1
2
3
``````myquote = character(0)
myquote <- ‘’
myquote <- “”``````
r

### Display Length of String

String length needs to be checked for various purposes like: -Compare two strings -Find the longest or shortest string

• Applying format to strings

Let us review length(), nchar(), and str_length from stringr.

#### Length()

``````1
````>length(myquote)````
r

Output:

``````1
`````[1] 1`````

For the above string, since R stores data as vectors, the length function returns “1” for the index1 .

#### nchar()

``````1
````>nchar(myquote)````
r

Output:

``````1
`````[1] 136`````

nchar counts the total characters in the string.

#### str_length()

``````1
````> str_length(myquote)````
r

Output:

``````1
`````[1] 136`````

str_length() returns the number of code points in a string. Generally, one code point is one character, but not always.

### Combine Two Strings with c() and str_c()

At times, we need to add a string to an existing string. For example, the quote mentioned above in my quote string does not contain a name or identifier. Let’s try to add this as a string.

#### Add a String Using the c() Combine Function

``````1
````>myquote <-c(myquote, "-John F. Kennedy")````
r

Output:

``````1
2
```````[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
[2] "-John F. Kennedy"```````

This stores the data as two combined strings with individual character counts.

#### Add a String Using the str_c() Combine Function

``````1
````> str_c(myquote, "-John F. Kennedy", sep= "",collapse =NULL )````
r

Output:

``````1
2
3
4
5
6
7
8
9
10
```````[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures-John F. Kennedy"`
```]
You can use the sep argument to specify how the strings are separated. Since str_c() creates a vector, it automatically recycles a shorter vector to the size of the longest element.

### Subset a String

In order to extract parts of strings, you can use the substr() or the str_sub(). This is helpful in cases like date and time stored together as a string and you need to extract only the date part of the data. Both of the functions require the start and end of the string to be extracted.

```r
>substr(myquote,17,45)``````

Output:

``````1
`````[1] ", a weekly, a monthly process" ""`````
``````1
````> str_sub(myquote,start=17,end=45)````
r

Output:

``````1
`````[1] ", a weekly, a monthly process"`````

To split the elements of a string into substrings based on matches to a given pattern:

``````1
````> strsplit(myquote,"slowly")````
r

Output:

``````1
2
```````[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, "
[2] " eroding old barriers, quietly building new structures"```````
``````1
````> str_split(myquote,"slowly")````
r

Output:

``````1
2
```````[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, "
[2] " eroding old barriers, quietly building new structures"  ```````

In this example, the string `myquote` gets split into a two-character vector where the word “slowly” is encountered.

#### Find and Replace Functions

To find a string, you can use grep, grepl(), regexpr(), gregexpr(), and regexec() functions. These differ in the format and details in the results. To perform a replacement of the first match only, use sub() and for replacing all the matches, use gsub().

The example below gsub() replaces all the spaces with “-“ and str_replace_all() replaces all the “-“ with spaces.

``````1
````> gsub(" ", "-",myquote)````
r

Output:

``````1
`````[1] "Peace-is-a-daily,-a-weekly,-a-monthly-process,-gradually-changing-opinions,-slowly-eroding-old-barriers,-quietly-building-new-structures"`````
``````1
````> str_replace_all(myquote,"-"," ")````
r

Output:

``````1
`````[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"`````

## Formatting Strings

Now we will discuss formatting. R provides C-style formatting which means that we use a wrapper for C library functions. Let us see an example using the sprint() that replaces a format with a given string or number. The parameters used here are %s for string and %.2f for a fixed-point decimal value. You can find more information in the resources section.

``````1
````> sprintf("Your device %s is at %.2f percent energy efficient", "Thermostat", 67.700)````
r

Output:

``````1
`````[1] "Your device Thermostat is at 67.70 percent energy efficient"`````

### Pattern Matching

Let's review the regular expressions, a method of describing patterns. For example, if I want to find all states starting with the letter “a” in the USArrests data set, I can set a pattern match as below:

``````1
2
3
4
5
6
``````#install rebus to specify anchors START and END
install.packages("rebus")
library(rebus)
# Find states starting with letter A
states = rownames(USArrests)
str_view(states, pattern = START %R% "A")```````
r

Similarly, to find all the states ending with “a”:

``````1
2
``````> str_view(states, pattern = "a" %R% END  )
``````
r

## Conclusion

To conclude, this guide provides you with basic functions to get started on string manipulations. I have created a list of a few more functions that you can use; refer to the resources section for further explanations.

Check out the table below for base R functions:

Convert to uppercasetoupper(x)
Convert to lowercasetoLower(x, keep_acronyms = FALSE, ...)
Join multiple vectorspaste (…, sep = " ", collapse = NULL)
Join elements of a vector togetherpaste(x, collapse = ' ')
Find regular expression matches in x returns a vector of indices that contain the patterngrep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
Find regular expression matches in x returns True is the pattern is found.grepl((pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
Replace matchesgsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Converts to character string (x is object of class fingerprint, featvec or feature)as.character(x)
Checks for string data typesis.character(x)
Abbreviate textabbreviate(names.arg, minlength = 4, use.classes = TRUE, dot = FALSE, strict = FALSE method = c("left.kept", "both.sides"), named = TRUE)
Enable retrieval of matching substringsgregexpr(pattern, text, ignore.case =FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Case foldingcasefold(x, upper = FALSE)
Character translationchartr(old, new, x)
Convert to integer value of same length as textregexec(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

Check out the table below for stringr functions and their usage:

Convert to uppercasestr_to_lower(string, locale = "en")
Convert to lowercasestr_to_upper(string, locale = "en")
Convert to title casestr_to_title(string, locale = "en")
Convert to sentence casestr_to_sentence(string, locale = "en")
Match exact stringstr_view(string, pattern, match = NA) or str_view_all(string, pattern, match = NA)
Duplicate a stringstr_dup(string, times)
Remove white spacesstr_trim(string, side = c("both", "left", "right")) or str_squish(string)
Wrap textstr_wrap(string, width = 80, indent = 0, exdent = 0)
Vectorized over stringstr_count(string, pattern = "")
View or override current encodingstr_conv(string, encoding)
Order a character vectorstr_sort(x, increasing = TRUE, ignore.case = FALSE, USE.NAMES = FALSE)

Check out my guides on visualizations with R:

106