Working with Formatted Text Files in R

By Dániel Szabó

May 4, 2020 • 8 Minute Read

Introduction

In this guide you will learn about the facilities R provides to work with formatted text files. Working with files is a very common task, especially in R. Most of the time data scientists have huge amounts of data on network shares or hard disks. Understanding how to access them and process these files is crucial, because the large amount of data to be processed is coupled with long runtime of scripts. The more efficient you are handling files, the more time you can save with your optimized code. First we will clarify what is meant by "formatted text file," then work with the interfaces provided in R.

Common Text Files

These formats will be fairly familiar to you:

TXT
CSV
JSON (JavaScript Object Notation)
XML (Extensible Markup Language)

These types are the most common ones for storing unstructured or structured data. When you are using a common file format with unstructured data you need to make sense of it. This means you need to understand each piece of information in those files and adjust your app accordingly. The situation is much easier when the data is structured. Unstructured data and regular-expression walk hand in hand; the regular-expressions allow you to parse out meaningful information while keeping the resource consumption relatively low. For more information on regular-expressions, check out this resource.

Prerequisite

The default R installation has no packages supporting your activity, but its versatile package repository allows you to add this functionality. You need to grab and install the readtext package. After firing up the R console, issue the following command:

      install.packages("readtext")

This way, the latest stable version is installed on your system. If you like to experiment with the newest version and its functionality you can install it from Github. The following commands will do that for you:

          install.packages("devtools")
devtools::install_github("quanteda/readtext")
    

To be able to install bleeding edge packages you need the devtools package installed, then the syntax below that line is the real deal. What happens here is that the following prefix is added to https://github.com/, and if you insert the URL https://github.com/quanteda/readtext in your browser it will take you to the source files for the package.

Action

In this section you are going to use the data available from the US about stolen guns. It is enough to download some CSV files and place them into the same folder. Spin up the R console and load the readtext library.

      library(readtext)

Right now you need to set the DATA_DIR variable, which is going to be your workplace. When you install the readtext package, it comes with some examples that are installed at the package's location.

      DATA_DIR <- system.file("extdata/", package = "readtext")

On a Windows machine you should see a similar output:

      1] "C:/Program Files/R/R-3.6.3/library/readtext/extdata"

It has several subfolders like the following:

          ├───csv
├───json
├───pdf
│   └───UDHR
├───tsv
├───txt
│   ├───EU_manifestos
│   ├───movie_reviews
│   │   ├───neg
│   │   └───pos
│   └───UDHR
└───word
    

If you want to load the data from the word folder, the following needs to be done:

      word_data <- readtext(paste0(DATA_DIR, "/word/*"))

Now word_data contains the following information:

          readtext object consisting of 6 documents and 0 docvars.
# Description: df[,2] [6 x 2]
  doc_id                                 text                 
  <chr>                                  <chr>                
1 21Parti_Socialiste_SUMMARY_2004.doc    "\"[pic]\r\nRés\"..."
2 21vivant2004.doc                       "\"http://www\"..."  
3 21VLD2004.doc                          "\"http://www\"..."  
4 32_socialisti_democratici_italiani.doc "\"DIVENTARE \"..."  
5 UK_2015_EccentricParty.docx            "\"The Eccent\"..."  
6 UK_2015_LoonyParty.docx                "\"The Offici\"..."
    

Now let's get back to the stolenguns folder. You can use the above option to simply specify the DIR_PATH folder for stolenguns, or use each file separately.

          > gun_data_q1 <- readtext("C:/Users/dszabo/Desktop/stolenguns/2016-first-quarter-stolen-guns.csv")
> gun_data_q2 <- readtext("C:/Users/dszabo/Desktop/stolenguns/2016-second-quarter-stolen-guns.csv")
> gun_data_q3 <- readtext("C:/Users/dszabo/Desktop/stolenguns/2016-third-quarter-stolen-guns.csv")
> gun_data_q4 <- readtext("C:/Users/dszabo/Desktop/stolenguns/2016-fourth-quarter-stolen-guns.csv")
    

Each variable will hold something similar.

          readtext object consisting of 7 documents and 9 docvars.
# Description: df[,11] [7 x 11]
  doc_id           text      Date   Brand   Model   Color Stolen   Stolen.From Status    Incident.number Agency
  <chr>            <chr>     <chr>  <chr>   <chr>   <chr> <chr>    <chr>       <chr>     <chr>           <chr> 
1 2016-first-quar~ "\"P1382~ 01/06~ HI POI~ "9MM"   "BLK" Stolen ~ Vehicle     Recovere~ B16-00694       BPD   
2 2016-first-quar~ "\"P1417~ 01/15~ JENNIN~ ""      "COM" Stolen ~ Residence   Not Reco~ B16-01892       BPD   
3 2016-first-quar~ "\"P1437~ 01/24~ CENTUR~ "M92"   ""    Stolen ~ Residence   Recovere~ B16-03125       BPD   
4 2016-first-quar~ "\"P1470~ 02/08~ TAURUS  "PT740~ ""    Stolen ~ Residence   Not Reco~ B16-05095       BPD   
5 2016-first-quar~ "\"P1504~ 02/23~ HIGHPO~ "CARBI~ ""    Stolen ~ Residence   Recovere~ B16-06990       BPD   
6 2016-first-quar~ "\"P1504~ 02/23~ RUGAR   ""      ""    Stolen ~ Residence   Recovere~ B16-06990       BPD   
# ... with 1 more row
    

There is an option where you can customize what is loaded into your data frame called document level metadata. You can take docvars from filenames, and it even allows you to name them individually. The devsep argument defines a separator or a regular-expression character string.

          gun_data_q4 <- readtext("C:/Users/dszabo/Desktop/stolenguns/2016-fourth-quarter-stolen-guns.csv",docvarsfrom = "filenames", dvsep = "_", encoding = "ISO-8859-1")
    

This should produce the following result.

          doc_id                                text            Date       Brand            Model     Color    Stolen         Stolen.From Status        Incident.number Agency docvar1                        
  <chr>                                 <chr>           <chr>      <chr>            <chr>     <chr>    <chr>          <chr>       <chr>         <chr>           <chr>  <chr>                          
2016-fourth-quarter-stolen-guns.csv.1 "\"P22093\"..." 10/25/2016 SMITH AND WESSON "SD9VE"   ""       Stolen Locally Vehicle     Not Recovered B16-42866       BPD    2016-fourth-quarter-stolen-guns
2016-fourth-quarter-stolen-guns.csv.2 "\"P22183\"..." 10/27/2016 TAURUS           "PT111G2" "BLACK"  Stolen Locally Residence   Not Recovered B16-43134       BPD    2016-fourth-quarter-stolen-guns
2016-fourth-quarter-stolen-guns.csv.3 "\"P22497\"..." 11/07/2016 SIG SAUER        "P290"    ""       Stolen Locally Vehicle     Not Recovered B16-44838       BPD    2016-fourth-quarter-stolen-guns
2016-fourth-quarter-stolen-guns.csv.4 "\"P22910\"..." 11/18/2016 TAURUS           "85UL"    "SILVER" Stolen Locally Residence   Not Recovered B16-46503       BPD    2016-fourth-quarter-stolen-guns
2016-fourth-quarter-stolen-guns.csv.5 "\"P23536\"..." 12/07/2016 SMITH & WESSON   ""        ""       Stolen Locally Vehicle     Not Recovered B16-48692       BPD    2016-fourth-quarter-stolen-guns
2016-fourth-quarter-stolen-guns.csv.6 "\"P23657\"..." 12/09/2016 COBRA            ".380"    "BLACK"  Stolen Locally Residence   Not Recovered B16-49060       BPD    2016-fourth-quarter-stolen-guns
    

The way you approach using the readtext module is very dependent on the actual formatting your data takes.

Conclusion

In this guide we have seen what facilities are provided by R to work with common formatted text files. We have seen what prerequisites are there to help us on our journey, and grasped the foundation that helps us move further. I hope this guide has been informative to you and I would like to thank you for reading it.

Dániel Szabó

Written content author.

More about this author