Author avatar

Benney Au

Importing Common Data Types In R

Benney Au

  • Sep 24, 2020
  • 11 Min read
  • 37 Views
  • Sep 24, 2020
  • 11 Min read
  • 37 Views
Data
Data Analytics
Languages and Libraries
R

Introduction

In order to begin a data analysis or modelling project, you first need to fetch and import data into your tool of choice.

In this guide, you will learn to use basic functions in R to work with JSON, CSV, and XML data. This will use data from the Bureau of Transportation Statistics to illustrate the use of these functions. A basic understanding of the R programming language is assumed knowledge for this guide. You can watch a course on Programming with R if you need a refresher.

Importing CSV Files

Comma-separated values (CSV) is a common format for sharing data as it is simple and easy to understand. Generally, each row in the data file represents a new observation. Within each row, different attributes are separated by commas.

The read.csv function is built into the utils package. This means no third-party packages are required to read CSV files in R.

Create the following data.csv in your current working directory.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
year,efficiency,sales
1980,24.3,8949000
1985,27.6,10979000
1990,28,9303000
1991,28.4,8185000
1992,27.9,8213000
1993,28.4,8518000
1994,28.3,8991000
1995,28.6,8620000
1996,28.5,8479000
1997,28.7,8217000
1998,28.8,8085000
1999,28.3,8638000
2000,28.5,8778000
2001,28.8,8352000
2002,29,8042000
2003,29.5,7556000
2004,29.5,7483000
2005,30.3,7660000
2006,30.1,7762000
2007,31.2,7562000
2008,31.5,6769000
2009,32.9,5402000
2010,33.9,5636000
2011,33.1,6093000
2012,35.3,7245000
2013,36.4,7586000
2014,36.5,7708000
2015,37.2,7517000
2016,37.7,6873000
2017,39.4,6081000
csv

From your R IDE, you can run the following code to import it:

1
2
3
4
csv <- read.csv('data.csv', header = TRUE)
csv

typeof(csv)
r

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
> csv <- read.csv('data.csv', header = TRUE)
> csv
   year efficiency    sales
1  1980       24.3  8949000
2  1985       27.6 10979000
3  1990       28.0  9303000
4  1991       28.4  8185000
5  1992       27.9  8213000
6  1993       28.4  8518000
7  1994       28.3  8991000
8  1995       28.6  8620000
9  1996       28.5  8479000
10 1997       28.7  8217000
11 1998       28.8  8085000
12 1999       28.3  8638000
13 2000       28.5  8778000
14 2001       28.8  8352000
15 2002       29.0  8042000
16 2003       29.5  7556000
17 2004       29.5  7483000
18 2005       30.3  7660000
19 2006       30.1  7762000
20 2007       31.2  7562000
21 2008       31.5  6769000
22 2009       32.9  5402000
23 2010       33.9  5636000
24 2011       33.1  6093000
25 2012       35.3  7245000
26 2013       36.4  7586000
27 2014       36.5  7708000
28 2015       37.2  7517000
29 2016       37.7  6873000
30 2017       39.4  6081000
> typeof(csv)
[1] "list"

Importing JSON Data

JSON is another popular format for sharing data. It allows hierarchical data types but is more verbose than CSV.

To import JSON, create the data.json file in your working directory:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[
    {"year": 1980, "efficiency": 24.3, "sales": 8949000},
    {"year": 1985, "efficiency": 27.6, "sales": 10979000},
    {"year": 1990, "efficiency": 28, "sales": 9303000},
    {"year": 1991, "efficiency": 28.4, "sales": 8185000},
    {"year": 1992, "efficiency": 27.9, "sales": 8213000},
    {"year": 1993, "efficiency": 28.4, "sales": 8518000},
    {"year": 1994, "efficiency": 28.3, "sales": 8991000},
    {"year": 1995, "efficiency": 28.6, "sales": 8620000},
    {"year": 1996, "efficiency": 28.5, "sales": 8479000},
    {"year": 1997, "efficiency": 28.7, "sales": 8217000},
    {"year": 1998, "efficiency": 28.8, "sales": 8085000},
    {"year": 1999, "efficiency": 28.3, "sales": 8638000},
    {"year": 2000, "efficiency": 28.5, "sales": 8778000},
    {"year": 2001, "efficiency": 28.8, "sales": 8352000},
    {"year": 2002, "efficiency": 29, "sales": 8042000},
    {"year": 2003, "efficiency": 29.5, "sales": 7556000},
    {"year": 2004, "efficiency": 29.5, "sales": 7483000},
    {"year": 2005, "efficiency": 30.3, "sales": 7660000},
    {"year": 2006, "efficiency": 30.1, "sales": 7762000},
    {"year": 2007, "efficiency": 31.2, "sales": 7562000},
    {"year": 2008, "efficiency": 31.5, "sales": 6769000},
    {"year": 2009, "efficiency": 32.9, "sales": 5402000},
    {"year": 2010, "efficiency": 33.9, "sales": 5636000},
    {"year": 2011, "efficiency": 33.1, "sales": 6093000},
    {"year": 2012, "efficiency": 35.3, "sales": 7245000},
    {"year": 2013, "efficiency": 36.4, "sales": 7586000},
    {"year": 2014, "efficiency": 36.5, "sales": 7708000},
    {"year": 2015, "efficiency": 37.2, "sales": 7517000},
    {"year": 2016, "efficiency": 37.7, "sales": 6873000},
    {"year": 2017, "efficiency": 39.4, "sales": 6081000}
  ]
json

There is no built-in method for parsing JSON in R. However, you can install jsonlite, a popular library for converting R data types to and from JSON.

The code below demonstrates how to install this package and use it to import a JSON file into your R environment.

1
2
3
4
install.packages('jsonlite')

json <- jsonlite$fromJSON('data.json')
json
r

The output will be the same as the CSV sample, since the input data is the same.

Importing XML Data

Another common data format is XML. This can be more verbose and therefore lead to larger file sizes. However, if you are working with legacy systems, they may output XML and you will have no ability to change it.

Create the following xml file data.xml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
<?xml version="1.0" encoding="UTF-8"?>
<root>
   <element>
      <efficiency>24.3</efficiency>
      <sales>8949000</sales>
      <year>1980</year>
   </element>
   <element>
      <efficiency>27.6</efficiency>
      <sales>10979000</sales>
      <year>1985</year>
   </element>
   <element>
      <efficiency>28</efficiency>
      <sales>9303000</sales>
      <year>1990</year>
   </element>
   <element>
      <efficiency>28.4</efficiency>
      <sales>8185000</sales>
      <year>1991</year>
   </element>
   <element>
      <efficiency>27.9</efficiency>
      <sales>8213000</sales>
      <year>1992</year>
   </element>
   <element>
      <efficiency>28.4</efficiency>
      <sales>8518000</sales>
      <year>1993</year>
   </element>
   <element>
      <efficiency>28.3</efficiency>
      <sales>8991000</sales>
      <year>1994</year>
   </element>
   <element>
      <efficiency>28.6</efficiency>
      <sales>8620000</sales>
      <year>1995</year>
   </element>
   <element>
      <efficiency>28.5</efficiency>
      <sales>8479000</sales>
      <year>1996</year>
   </element>
   <element>
      <efficiency>28.7</efficiency>
      <sales>8217000</sales>
      <year>1997</year>
   </element>
   <element>
      <efficiency>28.8</efficiency>
      <sales>8085000</sales>
      <year>1998</year>
   </element>
   <element>
      <efficiency>28.3</efficiency>
      <sales>8638000</sales>
      <year>1999</year>
   </element>
   <element>
      <efficiency>28.5</efficiency>
      <sales>8778000</sales>
      <year>2000</year>
   </element>
   <element>
      <efficiency>28.8</efficiency>
      <sales>8352000</sales>
      <year>2001</year>
   </element>
   <element>
      <efficiency>29</efficiency>
      <sales>8042000</sales>
      <year>2002</year>
   </element>
   <element>
      <efficiency>29.5</efficiency>
      <sales>7556000</sales>
      <year>2003</year>
   </element>
   <element>
      <efficiency>29.5</efficiency>
      <sales>7483000</sales>
      <year>2004</year>
   </element>
   <element>
      <efficiency>30.3</efficiency>
      <sales>7660000</sales>
      <year>2005</year>
   </element>
   <element>
      <efficiency>30.1</efficiency>
      <sales>7762000</sales>
      <year>2006</year>
   </element>
   <element>
      <efficiency>31.2</efficiency>
      <sales>7562000</sales>
      <year>2007</year>
   </element>
   <element>
      <efficiency>31.5</efficiency>
      <sales>6769000</sales>
      <year>2008</year>
   </element>
   <element>
      <efficiency>32.9</efficiency>
      <sales>5402000</sales>
      <year>2009</year>
   </element>
   <element>
      <efficiency>33.9</efficiency>
      <sales>5636000</sales>
      <year>2010</year>
   </element>
   <element>
      <efficiency>33.1</efficiency>
      <sales>6093000</sales>
      <year>2011</year>
   </element>
   <element>
      <efficiency>35.3</efficiency>
      <sales>7245000</sales>
      <year>2012</year>
   </element>
   <element>
      <efficiency>36.4</efficiency>
      <sales>7586000</sales>
      <year>2013</year>
   </element>
   <element>
      <efficiency>36.5</efficiency>
      <sales>7708000</sales>
      <year>2014</year>
   </element>
   <element>
      <efficiency>37.2</efficiency>
      <sales>7517000</sales>
      <year>2015</year>
   </element>
   <element>
      <efficiency>37.7</efficiency>
      <sales>6873000</sales>
      <year>2016</year>
   </element>
   <element>
      <efficiency>39.4</efficiency>
      <sales>6081000</sales>
      <year>2017</year>
   </element>
</root>
xml

There are a few different ways to work with XML. The XML package is built in, but some versions of R have memory leaks. Further, using this package can lead to quite verbose code.

A simpler solution is using XML2, which is more focused on extracting values out of XML documents.

The code below demonstrates how to extract XML values using the XML2 package into a data structure we can analyze in R.

1
2
3
4
5
6
7
8
9
10
11
12
install.packages('xml2')

parse_xml <- function (file) {
  doc <- xml2::read_xml(file)
  efficiency <- xml2::xml_double(xml2::xml_find_all(doc, "//root/element/efficiency"))
  sales <- xml2::xml_double(xml2::xml_find_all(doc, "//root/element/sales"))
  year <- xml2::xml_double(xml2::xml_find_all(doc, "//root/element/year"))
  cbind(efficiency, sales, year)
}

data <- parse_xml('data.xml')
data
r

The code snippet above creates a function to extract the sales, efficiency, and year columns. Then cbind is used to combine all these columns together, producing a table. This code uses xpath queries to search for the given values. If your XML document is complex, it can take a bit of trial and error to query the values you want. For this reason, the XML data format is typically not the preferred way to share data.

This parsing logic is encapsulated in the parse_xml function. This is also helpful for keeping the global environment clean of temporary variables.

Conclusion

Data analysts need to be comfortable importing different data formats to effectively undertake data modelling and analysis projects. This guide has shown you how to import three common formats: XML, JSON and CSV. To further build on these skills, you can watch this course on Querying and Converting Data Types in R.

0