In this guide you will learn about the facilities R provides to work with formatted text files. Working with files is a very common task, especially in R. Most of the time data scientists have huge amounts of data on network shares or hard disks. Understanding how to access them and process these files is crucial, because the large amount of data to be processed is coupled with long runtime of scripts. The more efficient you are handling files, the more time you can save with your optimized code. First we will clarify what is meant by "formatted text file," then work with the interfaces provided in R.
These formats will be fairly familiar to you:
These types are the most common ones for storing unstructured or structured data. When you are using a common file format with unstructured data you need to make sense of it. This means you need to understand each piece of information in those files and adjust your app accordingly. The situation is much easier when the data is structured. Unstructured data and regular-expression walk hand in hand; the regular-expressions allow you to parse out meaningful information while keeping the resource consumption relatively low. For more information on regular-expressions, check out this resource.
The default R installation has no packages supporting your activity, but its versatile package repository allows you to add this functionality. You need to grab and install the
readtext package. After firing up the R console, issue the following command:
This way, the latest stable version is installed on your system. If you like to experiment with the newest version and its functionality you can install it from Github. The following commands will do that for you:
To be able to install bleeding edge packages you need the devtools package installed, then the syntax below that line is the real deal. What happens here is that the following prefix is added to http://github.com/, and if you insert the URL http://github.com/quanteda/readtext in your browser it will take you to the source files for the package.
In this section you are going to use the data available from the US about stolen guns. It is enough to download some CSV files and place them into the same folder. Spin up the R console and load the
Right now you need to set the
DATA_DIR variable, which is going to be your workplace. When you install the
readtext package, it comes with some examples that are installed at the package's location.
1DATA_DIR <- system.file("extdata/", package = "readtext")
On a Windows machine you should see a similar output:
1 "C:/Program Files/R/R-3.6.3/library/readtext/extdata"
It has several subfolders like the following:
1├───csv 2├───json 3├───pdf 4│ └───UDHR 5├───tsv 6├───txt 7│ ├───EU_manifestos 8│ ├───movie_reviews 9│ │ ├───neg 10│ │ └───pos 11│ └───UDHR 12└───word
If you want to load the data from the word folder, the following needs to be done:
1word_data <- readtext(paste0(DATA_DIR, "/word/*"))
word_data contains the following information:
1readtext object consisting of 6 documents and 0 docvars. 2# Description: df[,2] [6 x 2] 3 doc_id text 4 <chr> <chr> 51 21Parti_Socialiste_SUMMARY_2004.doc "\"[pic]\r\nRés\"..." 62 21vivant2004.doc "\"http://www\"..." 73 21VLD2004.doc "\"http://www\"..." 84 32_socialisti_democratici_italiani.doc "\"DIVENTARE \"..." 95 UK_2015_EccentricParty.docx "\"The Eccent\"..." 106 UK_2015_LoonyParty.docx "\"The Offici\"..."
Now let's get back to the
stolenguns folder. You can use the above option to simply specify the
DIR_PATH folder for
stolenguns, or use each file separately.
1> gun_data_q1 <- readtext("C:/Users/dszabo/Desktop/stolenguns/2016-first-quarter-stolen-guns.csv") 2> gun_data_q2 <- readtext("C:/Users/dszabo/Desktop/stolenguns/2016-second-quarter-stolen-guns.csv") 3> gun_data_q3 <- readtext("C:/Users/dszabo/Desktop/stolenguns/2016-third-quarter-stolen-guns.csv") 4> gun_data_q4 <- readtext("C:/Users/dszabo/Desktop/stolenguns/2016-fourth-quarter-stolen-guns.csv")
Each variable will hold something similar.
1readtext object consisting of 7 documents and 9 docvars. 2# Description: df[,11] [7 x 11] 3 doc_id text Date Brand Model Color Stolen Stolen.From Status Incident.number Agency 4 <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> 51 2016-first-quar~ "\"P1382~ 01/06~ HI POI~ "9MM" "BLK" Stolen ~ Vehicle Recovere~ B16-00694 BPD 62 2016-first-quar~ "\"P1417~ 01/15~ JENNIN~ "" "COM" Stolen ~ Residence Not Reco~ B16-01892 BPD 73 2016-first-quar~ "\"P1437~ 01/24~ CENTUR~ "M92" "" Stolen ~ Residence Recovere~ B16-03125 BPD 84 2016-first-quar~ "\"P1470~ 02/08~ TAURUS "PT740~ "" Stolen ~ Residence Not Reco~ B16-05095 BPD 95 2016-first-quar~ "\"P1504~ 02/23~ HIGHPO~ "CARBI~ "" Stolen ~ Residence Recovere~ B16-06990 BPD 106 2016-first-quar~ "\"P1504~ 02/23~ RUGAR "" "" Stolen ~ Residence Recovere~ B16-06990 BPD 11# ... with 1 more row
There is an option where you can customize what is loaded into your data frame called document level metadata. You can take
filenames, and it even allows you to name them individually. The
devsep argument defines a separator or a regular-expression character string.
1gun_data_q4 <- readtext("C:/Users/dszabo/Desktop/stolenguns/2016-fourth-quarter-stolen-guns.csv",docvarsfrom = "filenames", dvsep = "_", encoding = "ISO-8859-1")
This should produce the following result.
1 doc_id text Date Brand Model Color Stolen Stolen.From Status Incident.number Agency docvar1 2 <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> 31 2016-fourth-quarter-stolen-guns.csv.1 "\"P22093\"..." 10/25/2016 SMITH AND WESSON "SD9VE" "" Stolen Locally Vehicle Not Recovered B16-42866 BPD 2016-fourth-quarter-stolen-guns 42 2016-fourth-quarter-stolen-guns.csv.2 "\"P22183\"..." 10/27/2016 TAURUS "PT111G2" "BLACK" Stolen Locally Residence Not Recovered B16-43134 BPD 2016-fourth-quarter-stolen-guns 53 2016-fourth-quarter-stolen-guns.csv.3 "\"P22497\"..." 11/07/2016 SIG SAUER "P290" "" Stolen Locally Vehicle Not Recovered B16-44838 BPD 2016-fourth-quarter-stolen-guns 64 2016-fourth-quarter-stolen-guns.csv.4 "\"P22910\"..." 11/18/2016 TAURUS "85UL" "SILVER" Stolen Locally Residence Not Recovered B16-46503 BPD 2016-fourth-quarter-stolen-guns 75 2016-fourth-quarter-stolen-guns.csv.5 "\"P23536\"..." 12/07/2016 SMITH & WESSON "" "" Stolen Locally Vehicle Not Recovered B16-48692 BPD 2016-fourth-quarter-stolen-guns 86 2016-fourth-quarter-stolen-guns.csv.6 "\"P23657\"..." 12/09/2016 COBRA ".380" "BLACK" Stolen Locally Residence Not Recovered B16-49060 BPD 2016-fourth-quarter-stolen-guns
The way you approach using the
readtext module is very dependent on the actual formatting your data takes.
In this guide we have seen what facilities are provided by R to work with common formatted text files. We have seen what prerequisites are there to help us on our journey, and grasped the foundation that helps us move further. I hope this guide has been informative to you and I would like to thank you for reading it.