0
In order to begin a data analysis or modelling project, you first need to fetch and import data into your tool of choice.
In this guide, you will learn to use basic functions in R to work with JSON, CSV, and XML data. This will use data from the Bureau of Transportation Statistics to illustrate the use of these functions. A basic understanding of the R programming language is assumed knowledge for this guide. You can watch a course on Programming with R if you need a refresher.
Comma-separated values (CSV) is a common format for sharing data as it is simple and easy to understand. Generally, each row in the data file represents a new observation. Within each row, different attributes are separated by commas.
The read.csv
function is built into the utils
package.
This means no third-party packages are required to read CSV files in R.
Create the following data.csv
in your current working directory.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
year,efficiency,sales 1980,24.3,8949000 1985,27.6,10979000 1990,28,9303000 1991,28.4,8185000 1992,27.9,8213000 1993,28.4,8518000 1994,28.3,8991000 1995,28.6,8620000 1996,28.5,8479000 1997,28.7,8217000 1998,28.8,8085000 1999,28.3,8638000 2000,28.5,8778000 2001,28.8,8352000 2002,29,8042000 2003,29.5,7556000 2004,29.5,7483000 2005,30.3,7660000 2006,30.1,7762000 2007,31.2,7562000 2008,31.5,6769000 2009,32.9,5402000 2010,33.9,5636000 2011,33.1,6093000 2012,35.3,7245000 2013,36.4,7586000 2014,36.5,7708000 2015,37.2,7517000 2016,37.7,6873000 2017,39.4,6081000
From your R IDE, you can run the following code to import it:
1 2 3 4
csv <- read.csv('data.csv', header = TRUE) csv typeof(csv)
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
> csv <- read.csv('data.csv', header = TRUE) > csv year efficiency sales 1 1980 24.3 8949000 2 1985 27.6 10979000 3 1990 28.0 9303000 4 1991 28.4 8185000 5 1992 27.9 8213000 6 1993 28.4 8518000 7 1994 28.3 8991000 8 1995 28.6 8620000 9 1996 28.5 8479000 10 1997 28.7 8217000 11 1998 28.8 8085000 12 1999 28.3 8638000 13 2000 28.5 8778000 14 2001 28.8 8352000 15 2002 29.0 8042000 16 2003 29.5 7556000 17 2004 29.5 7483000 18 2005 30.3 7660000 19 2006 30.1 7762000 20 2007 31.2 7562000 21 2008 31.5 6769000 22 2009 32.9 5402000 23 2010 33.9 5636000 24 2011 33.1 6093000 25 2012 35.3 7245000 26 2013 36.4 7586000 27 2014 36.5 7708000 28 2015 37.2 7517000 29 2016 37.7 6873000 30 2017 39.4 6081000 > typeof(csv) [1] "list"
JSON is another popular format for sharing data. It allows hierarchical data types but is more verbose than CSV.
To import JSON, create the data.json
file in your working directory:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[ {"year": 1980, "efficiency": 24.3, "sales": 8949000}, {"year": 1985, "efficiency": 27.6, "sales": 10979000}, {"year": 1990, "efficiency": 28, "sales": 9303000}, {"year": 1991, "efficiency": 28.4, "sales": 8185000}, {"year": 1992, "efficiency": 27.9, "sales": 8213000}, {"year": 1993, "efficiency": 28.4, "sales": 8518000}, {"year": 1994, "efficiency": 28.3, "sales": 8991000}, {"year": 1995, "efficiency": 28.6, "sales": 8620000}, {"year": 1996, "efficiency": 28.5, "sales": 8479000}, {"year": 1997, "efficiency": 28.7, "sales": 8217000}, {"year": 1998, "efficiency": 28.8, "sales": 8085000}, {"year": 1999, "efficiency": 28.3, "sales": 8638000}, {"year": 2000, "efficiency": 28.5, "sales": 8778000}, {"year": 2001, "efficiency": 28.8, "sales": 8352000}, {"year": 2002, "efficiency": 29, "sales": 8042000}, {"year": 2003, "efficiency": 29.5, "sales": 7556000}, {"year": 2004, "efficiency": 29.5, "sales": 7483000}, {"year": 2005, "efficiency": 30.3, "sales": 7660000}, {"year": 2006, "efficiency": 30.1, "sales": 7762000}, {"year": 2007, "efficiency": 31.2, "sales": 7562000}, {"year": 2008, "efficiency": 31.5, "sales": 6769000}, {"year": 2009, "efficiency": 32.9, "sales": 5402000}, {"year": 2010, "efficiency": 33.9, "sales": 5636000}, {"year": 2011, "efficiency": 33.1, "sales": 6093000}, {"year": 2012, "efficiency": 35.3, "sales": 7245000}, {"year": 2013, "efficiency": 36.4, "sales": 7586000}, {"year": 2014, "efficiency": 36.5, "sales": 7708000}, {"year": 2015, "efficiency": 37.2, "sales": 7517000}, {"year": 2016, "efficiency": 37.7, "sales": 6873000}, {"year": 2017, "efficiency": 39.4, "sales": 6081000} ]
There is no built-in method for parsing JSON in R. However, you can install jsonlite, a popular library for converting R data types to and from JSON.
The code below demonstrates how to install this package and use it to import a JSON file into your R environment.
1 2 3 4
install.packages('jsonlite') json <- jsonlite$fromJSON('data.json') json
The output will be the same as the CSV sample, since the input data is the same.
Another common data format is XML. This can be more verbose and therefore lead to larger file sizes. However, if you are working with legacy systems, they may output XML and you will have no ability to change it.
Create the following xml file data.xml
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
<?xml version="1.0" encoding="UTF-8"?> <root> <element> <efficiency>24.3</efficiency> <sales>8949000</sales> <year>1980</year> </element> <element> <efficiency>27.6</efficiency> <sales>10979000</sales> <year>1985</year> </element> <element> <efficiency>28</efficiency> <sales>9303000</sales> <year>1990</year> </element> <element> <efficiency>28.4</efficiency> <sales>8185000</sales> <year>1991</year> </element> <element> <efficiency>27.9</efficiency> <sales>8213000</sales> <year>1992</year> </element> <element> <efficiency>28.4</efficiency> <sales>8518000</sales> <year>1993</year> </element> <element> <efficiency>28.3</efficiency> <sales>8991000</sales> <year>1994</year> </element> <element> <efficiency>28.6</efficiency> <sales>8620000</sales> <year>1995</year> </element> <element> <efficiency>28.5</efficiency> <sales>8479000</sales> <year>1996</year> </element> <element> <efficiency>28.7</efficiency> <sales>8217000</sales> <year>1997</year> </element> <element> <efficiency>28.8</efficiency> <sales>8085000</sales> <year>1998</year> </element> <element> <efficiency>28.3</efficiency> <sales>8638000</sales> <year>1999</year> </element> <element> <efficiency>28.5</efficiency> <sales>8778000</sales> <year>2000</year> </element> <element> <efficiency>28.8</efficiency> <sales>8352000</sales> <year>2001</year> </element> <element> <efficiency>29</efficiency> <sales>8042000</sales> <year>2002</year> </element> <element> <efficiency>29.5</efficiency> <sales>7556000</sales> <year>2003</year> </element> <element> <efficiency>29.5</efficiency> <sales>7483000</sales> <year>2004</year> </element> <element> <efficiency>30.3</efficiency> <sales>7660000</sales> <year>2005</year> </element> <element> <efficiency>30.1</efficiency> <sales>7762000</sales> <year>2006</year> </element> <element> <efficiency>31.2</efficiency> <sales>7562000</sales> <year>2007</year> </element> <element> <efficiency>31.5</efficiency> <sales>6769000</sales> <year>2008</year> </element> <element> <efficiency>32.9</efficiency> <sales>5402000</sales> <year>2009</year> </element> <element> <efficiency>33.9</efficiency> <sales>5636000</sales> <year>2010</year> </element> <element> <efficiency>33.1</efficiency> <sales>6093000</sales> <year>2011</year> </element> <element> <efficiency>35.3</efficiency> <sales>7245000</sales> <year>2012</year> </element> <element> <efficiency>36.4</efficiency> <sales>7586000</sales> <year>2013</year> </element> <element> <efficiency>36.5</efficiency> <sales>7708000</sales> <year>2014</year> </element> <element> <efficiency>37.2</efficiency> <sales>7517000</sales> <year>2015</year> </element> <element> <efficiency>37.7</efficiency> <sales>6873000</sales> <year>2016</year> </element> <element> <efficiency>39.4</efficiency> <sales>6081000</sales> <year>2017</year> </element> </root>
There are a few different ways to work with XML. The XML package is built in, but some versions of R have memory leaks. Further, using this package can lead to quite verbose code.
A simpler solution is using XML2, which is more focused on extracting values out of XML documents.
The code below demonstrates how to extract XML values using the XML2 package into a data structure we can analyze in R.
1 2 3 4 5 6 7 8 9 10 11 12
install.packages('xml2') parse_xml <- function (file) { doc <- xml2::read_xml(file) efficiency <- xml2::xml_double(xml2::xml_find_all(doc, "//root/element/efficiency")) sales <- xml2::xml_double(xml2::xml_find_all(doc, "//root/element/sales")) year <- xml2::xml_double(xml2::xml_find_all(doc, "//root/element/year")) cbind(efficiency, sales, year) } data <- parse_xml('data.xml') data
The code snippet above creates a function to extract the sales, efficiency, and year columns.
Then cbind
is used to combine all these columns together, producing a table.
This code uses xpath queries to search for the given values. If your XML document is complex, it can take a bit of trial and error to query the values you want.
For this reason, the XML data format is typically not the preferred way to share data.
This parsing logic is encapsulated in the parse_xml
function. This is also helpful for keeping the global environment clean of temporary variables.
Data analysts need to be comfortable importing different data formats to effectively undertake data modelling and analysis projects. This guide has shown you how to import three common formats: XML, JSON and CSV. To further build on these skills, you can watch this course on Querying and Converting Data Types in R.
0