Working with Imported Data - the Base R#

Learning Objectives#

  • Set up the working directory and import data into R.

  • Explore and manipulate data frames to extract insights.

  • Perform basic data cleaning, including handling dates and missing values.

Now you have mastered the basics of R. But we really want to learn how to import data! So let us do that.

We have a CSV file of the Atlantic Hurricane Database (HURDAT2) 1851-2022. The data is released by the National Hurricane Center (NHC) at NOAA. Use this link to download: https://tufts.box.com/shared/static/gdqc9tdv7334622tkco37oyg9t2fyeaf.csv

But how do we import this data so we can use it in R?

Working Directory#

All relative file paths in R are relative to the current working directory, which might or might not be the location of the R script, depending on how RStudio was originally launched. If RStudio was launched by double-clicking on the R script or by right-clicking on the R script and then selecting Open With > RStudio, the default working directory will be the location of the R script. Otherwise the default working directory will either be your Documents directory (Windows) or your home directory (macOS and Linux).

We can use getwd() and dir() to explore the current working directory.

getwd()     # The current location of the working directory.
dir()       # The files in the working directory.
'/home/runner/work/guides/guides/source/source/r'
  1. '00_Getting_started_with_R.md'
  2. '01_Intro_to_RStudio.md'
  3. '02_intro-r_script excution and basics.Rmd'
  4. '03_working_with_imported_data.Rmd'
  5. '04_working_with_imported_data_tidyverse.Rmd'
  6. 'animation.gif'
  7. 'atlantic.csv'
  8. 'Fast_Loops_and_Parallel_Processing.md'
  9. 'index.md'
  10. 'r_data_dataviz'
  11. 'r-data-viz_files'

You can use the %in% operator to ensure that your working directory contains data file (atlantic.csv).

"atlantic.csv" %in% dir()
TRUE

If the statement above returns TRUE, you are all set. But if it returns FALSE, you need to change your working directory. Go to the top menu bar: Session > Set Working Directory > To Source File Location.

We can ensure the working directory is set correctly by either re-running the commands from above or clicking on the Files tab in the lower-right panel and then selecting More > Go To Working Directory and exploring the results.

Importing Data#

The atlantic.csv data file is in comma-separated values (CSV) format. (Note that the specifics of this format are outlined in standard RFC 4180.) This is a very common data format and you can easily import it in R as follows.

hurrdata <- read.csv("atlantic.csv")

The read.csv() function has numerous additional optional arguments that we can use to specify how exactly a data file should be read in and interpreted.

To investigate those, we can use the help() function or the ? operator.

?read.csv

Note that CSV files are different from Excel spreadsheets (XLS or XLSX files) and R does not contain the functionality to import the latter by default. An external community-developed package must be used to import Excel files. Installing and loading external packages is discussed later in this script.

Exploring Data#

We see that a new variable “hurrdata” has been added to the environment.

head(hurrdata)
A data.frame: 6 × 23
IDNameDateTimeEventStatusLatitudeLongitudeMaximum.WindMinimum.PressureRadii.NW.34ktRadii.NE.50ktRadii.SE.50ktRadii.SW.50ktRadii.NW.50ktRadii.NE.64ktRadii.SE.64ktRadii.SW.64ktRadii.NW.64ktMaximum.Wind.Radius
<chr><chr><int><int><chr><chr><chr><chr><int><int><int><int><int><int><int><int><int><int><int><int>
1AL011851UNNAMED18510625 0 HU28.0N94.8W80-999-999-999-999-999-999-999-999-999-999-999
2AL011851UNNAMED18510625 600 HU28.0N95.4W80-999-999-999-999-999-999-999-999-999-999-999
3AL011851UNNAMED185106251200 HU28.0N96.0W80-999-999-999-999-999-999-999-999-999-999-999
4AL011851UNNAMED185106251800 HU28.1N96.5W80-999-999-999-999-999-999-999-999-999-999-999
5AL011851UNNAMED185106252100LHU28.2N96.8W80-999-999-999-999-999-999-999-999-999-999-999
6AL011851UNNAMED18510626 0 HU28.2N97.0W70-999-999-999-999-999-999-999-999-999-999-999

This shows us a preview of the fist couple rows of the file. You can also click on the data table under Environment > Data.

We can use summary() to get descriptive statistics.

summary(hurrdata)
      ID                Name                Date               Time       
 Length:53976       Length:53976       Min.   :18510625   Min.   :   0.0  
 Class :character   Class :character   1st Qu.:19180804   1st Qu.: 600.0  
 Mode  :character   Mode  :character   Median :19640609   Median :1200.0  
                                       Mean   :19554281   Mean   : 910.8  
                                       3rd Qu.:19961020   3rd Qu.:1800.0  
                                       Max.   :20221111   Max.   :2330.0  
    Event              Status            Latitude          Longitude        
 Length:53976       Length:53976       Length:53976       Length:53976      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
  Maximum.Wind    Minimum.Pressure Radii.NE.34kt    Radii.SE.34kt   
 Min.   :-99.00   Min.   :-999.0   Min.   :-999.0   Min.   :-999.0  
 1st Qu.: 35.00   1st Qu.:-999.0   1st Qu.:-999.0   1st Qu.:-999.0  
 Median : 45.00   Median :-999.0   Median :-999.0   Median :-999.0  
 Mean   : 52.57   Mean   :-149.4   Mean   :-795.1   Mean   :-796.4  
 3rd Qu.: 70.00   3rd Qu.: 995.0   3rd Qu.:-999.0   3rd Qu.:-999.0  
 Max.   :165.00   Max.   :1024.0   Max.   : 780.0   Max.   : 720.0  
 Radii.SW.34kt    Radii.NW.34kt  Radii.NE.50kt    Radii.SE.50kt   
 Min.   :-999.0   Min.   :-999   Min.   :-999.0   Min.   :-999.0  
 1st Qu.:-999.0   1st Qu.:-999   1st Qu.:-999.0   1st Qu.:-999.0  
 Median :-999.0   Median :-999   Median :-999.0   Median :-999.0  
 Mean   :-801.4   Mean   :-799   Mean   :-806.1   Mean   :-806.4  
 3rd Qu.:-999.0   3rd Qu.:-999   3rd Qu.:-999.0   3rd Qu.:-999.0  
 Max.   : 660.0   Max.   : 600   Max.   : 360.0   Max.   : 300.0  
 Radii.SW.50kt    Radii.NW.50kt    Radii.NE.64kt    Radii.SE.64kt   
 Min.   :-999.0   Min.   :-999.0   Min.   :-999.0   Min.   :-999.0  
 1st Qu.:-999.0   1st Qu.:-999.0   1st Qu.:-999.0   1st Qu.:-999.0  
 Median :-999.0   Median :-999.0   Median :-999.0   Median :-999.0  
 Mean   :-807.7   Mean   :-807.1   Mean   :-809.2   Mean   :-809.3  
 3rd Qu.:-999.0   3rd Qu.:-999.0   3rd Qu.:-999.0   3rd Qu.:-999.0  
 Max.   : 360.0   Max.   : 390.0   Max.   : 180.0   Max.   : 250.0  
 Radii.SW.64kt    Radii.NW.64kt    Maximum.Wind.Radius
 Min.   :-999.0   Min.   :-999.0   Min.   :-999.0     
 1st Qu.:-999.0   1st Qu.:-999.0   1st Qu.:-999.0     
 Median :-999.0   Median :-999.0   Median :-999.0     
 Mean   :-809.7   Mean   :-809.5   Mean   :-973.2     
 3rd Qu.:-999.0   3rd Qu.:-999.0   3rd Qu.:-999.0     
 Max.   : 180.0   Max.   : 300.0   Max.   : 400.0     

We might wonder - what type of data is hurrdata? We imported it from csv, but how is it stored?

class(hurrdata)
'data.frame'

We can see that it is a “data.frame”, commonly referred to as a data frame. A data frame is a table where you have observations as rows and variables as columns. Data frames have some great features for working with data and are the go-to for R data storage. You can think of them almost as spreadsheets.

Working with Data Frames#

Let us say we want to access the maximum wind speed of the sixth observation. We can do this in multiple ways. Knowing that “Maximum.Wind” is the ninth column:

hurrdata[[6, 9]]    # [[row, column]]
hurrdata[[9]][6]    # [[column]][row]
70
70

NOTE: When using a data frame, we must use [[]] to access variable values. [] will just give us a subset of the data frame, not the values themselves.

Alternatively, we could use the column name:

hurrdata[["Maximum.Wind"]][6]
hurrdata$Maximum.Wind[6]
70
70

Note the dollar sign ($). This is a special operator that allows us to access data.frame variables (columns) based on their name.

The “$” operator is the preferred way of accessing data.frame variables based on their name, as it removes the complexity of when to use [[]] vs [] and does not require quotation marks … given your column names do not contain spaces.

Note how the read.csv() function automatically replaced spaces with periods in the column names to accommodate this. Underscores are also an acceptable alternative to spaces and other functions might use those instead of periods.

Selecting Data Based on Conditions#

Note how the summary() function from before did not provide much information on the columns containing textual (character/string) data. These columns can be analyzed further using the table() function. Can you guess what it does?

table(hurrdata$Name)

It creates a frequency table of all the unique values in the column! But what if we wanted to analyze hurricanes only with a specific name?

We can select rows based on a condition by combining logical operators with [] to subset all the rows where the logical operator returns TRUE. For example, all the hurricanes named Nicole can be extracted as follows.

hurrdata[hurrdata$Name == "NICOLE", ]

Note that we leave the column index blank to select all the columns.

Alternatively, we could extract values from one column based on the values of another column. For example, the maximum wind speed for all the hurricanes named Nicole could be obtained as follows.

max(hurrdata$Maximum.Wind[hurrdata$Name == "NICOLE"])
120

Data Cleaning: Dates & Strings#

Let us say we want to analyze maximum wind speed by year. Note how the date is stored as a number in YYYYMMDD format. This notation is great for sorting but very inconvenient for analysis.

We can use the “$” operator to easily extract the Data column as follows.

hurrdata$Date

Passing this column to the as.character() function will convert all the numbers into text (character data). Note the quotes around the new values.

as.character(hurrdata$Date)

Let us store these new character (string) representations of the dates in a new variable and use the substr() function to extract the year and the month.

date_strings <- as.character(hurrdata$Date)

Extract the year from the date string (position 1-4).

hurrdata$Year <- substr(date_strings, start = 1, stop = 4)

Extract the month from the date string (position 5-6).

hurrdata$Month <- substr(date_strings, 5, 6)

Convert both new variables to numeric to accommodate further analysis.

hurrdata$Month <- as.numeric(hurrdata$Month)
hurrdata$Year <- as.numeric(hurrdata$Year)

Note that this is a somewhat hack-y way of dealing with dates. It is suitable for simple conversions like this, but it is highly encouraged to use external packages specifically designed to work with dates for more complex tasks.

Data Cleaning: Missing Data#

Why are some wind speeds negative? If we looked into the metadata, we would find that we should have removed these! We can check on the negative data:

min(hurrdata$Maximum.Wind)
-99

We can combine [] with logical operators like before to find all the values less than zero (< 0) and replace them with “NA”, which means “No Data” in R.

hurrdata$Maximum.Wind[hurrdata$Maximum.Wind < 0] <- NA

We could also delete the whole observation, but this is bad practice! Let’s check on the results. It should print “NA”.

min(hurrdata$Maximum.Wind)
<NA>

To get the minimum wind speed excluding the NA values, we must call the min() function with na.rm = TRUE to instruct the function to ignore the NA values.

min(hurrdata$Maximum.Wind, na.rm = TRUE)
10

Sampling the Data#

We can also sample the data.frame to de-clutter the scatter plot. This is a two-step process in base R. First we use the sample() function to randomly select a desired quantity of row indexes/numbers for our data frame.

sample(nrow(hurrdata), 1000)
  1. 29809
  2. 42828
  3. 19469
  4. 44676
  5. 51631
  6. 8816
  7. 33764
  8. 39566
  9. 38865
  10. 6748
  11. 16676
  12. 2747
  13. 37083
  14. 5371
  15. 25897
  16. 10550
  17. 6601
  18. 23425
  19. 48811
  20. 51711
  21. 45674
  22. 37371
  23. 31033
  24. 10797
  25. 39389
  26. 9506
  27. 24126
  28. 25856
  29. 7330
  30. 38521
  31. 6717
  32. 12271
  33. 16017
  34. 11708
  35. 3726
  36. 4355
  37. 45986
  38. 51643
  39. 20018
  40. 24604
  41. 17351
  42. 15587
  43. 18603
  44. 33217
  45. 14021
  46. 5114
  47. 34986
  48. 7506
  49. 21412
  50. 28578
  51. 51166
  52. 51103
  53. 46346
  54. 38946
  55. 378
  56. 43649
  57. 50029
  58. 44458
  59. 15248
  60. 42514
  61. 37853
  62. 8363
  63. 37241
  64. 34348
  65. 29140
  66. 34227
  67. 10806
  68. 34590
  69. 18546
  70. 45639
  71. 53061
  72. 40667
  73. 18595
  74. 41656
  75. 17586
  76. 20846
  77. 42121
  78. 15774
  79. 2013
  80. 8379
  81. 8939
  82. 53638
  83. 37675
  84. 8964
  85. 26473
  86. 21753
  87. 53442
  88. 36196
  89. 39380
  90. 37209
  91. 24007
  92. 25242
  93. 25496
  94. 44024
  95. 22814
  96. 30956
  97. 44600
  98. 34028
  99. 34339
  100. 38687
  101. 30947
  102. 875
  103. 22189
  104. 8220
  105. 41616
  106. 12902
  107. 7367
  108. 48784
  109. 45228
  110. 38227
  111. 19924
  112. 42042
  113. 39645
  114. 52442
  115. 27658
  116. 17793
  117. 50201
  118. 29583
  119. 22887
  120. 49965
  121. 53183
  122. 5632
  123. 9952
  124. 25557
  125. 14816
  126. 15010
  127. 17012
  128. 24533
  129. 27857
  130. 41209
  131. 48213
  132. 45340
  133. 19908
  134. 28236
  135. 53807
  136. 23066
  137. 44758
  138. 21320
  139. 51081
  140. 32426
  141. 30597
  142. 18727
  143. 8380
  144. 34838
  145. 6685
  146. 23567
  147. 5233
  148. 43623
  149. 15677
  150. 45307
  151. 43741
  152. 41967
  153. 47240
  154. 53161
  155. 20214
  156. 13663
  157. 38809
  158. 24120
  159. 30847
  160. 23893
  161. 46365
  162. 31
  163. 7737
  164. 24260
  165. 22135
  166. 21512
  167. 52430
  168. 23900
  169. 24906
  170. 36120
  171. 31786
  172. 16099
  173. 36134
  174. 19860
  175. 50473
  176. 3448
  177. 46584
  178. 24085
  179. 11478
  180. 21241
  181. 21611
  182. 39790
  183. 834
  184. 22234
  185. 17107
  186. 36842
  187. 46955
  188. 53946
  189. 44300
  190. 48936
  191. 6504
  192. 48854
  193. 44857
  194. 52337
  195. 36448
  196. 47572
  197. 4659
  198. 18489
  199. 5591
  200. 19591
  201. 36869
  202. 9834
  203. 50984
  204. 28516
  205. 35385
  206. 11788
  207. 26971
  208. 51581
  209. 20684
  210. 21725
  211. 4161
  212. 50651
  213. 1772
  214. 40827
  215. 10903
  216. 38255
  217. 9018
  218. 17677
  219. 32959
  220. 3140
  221. 29099
  222. 30354
  223. 5329
  224. 34207
  225. 23829
  226. 21752
  227. 25842
  228. 35836
  229. 53901
  230. 52300
  231. 567
  232. 35208
  233. 49713
  234. 40731
  235. 41610
  236. 39582
  237. 27949
  238. 9984
  239. 2276
  240. 38011
  241. 36413
  242. 7100
  243. 28986
  244. 46917
  245. 48959
  246. 35906
  247. 50010
  248. 7896
  249. 33928
  250. 24486
  251. 49113
  252. 8629
  253. 45799
  254. 48738
  255. 15749
  256. 31127
  257. 40690
  258. 41908
  259. 16518
  260. 8049
  261. 48773
  262. 8544
  263. 11861
  264. 47220
  265. 33593
  266. 35885
  267. 9706
  268. 15833
  269. 5396
  270. 51772
  271. 23315
  272. 34032
  273. 23795
  274. 53863
  275. 48897
  276. 11877
  277. 16310
  278. 25714
  279. 40950
  280. 39913
  281. 11635
  282. 9373
  283. 48858
  284. 38789
  285. 4052
  286. 51049
  287. 27471
  288. 53002
  289. 20418
  290. 19170
  291. 7303
  292. 20236
  293. 10559
  294. 42361
  295. 17317
  296. 3131
  297. 50752
  298. 4274
  299. 21971
  300. 27695
  301. 42845
  302. 28781
  303. 2233
  304. 34351
  305. 16568
  306. 24953
  307. 43240
  308. 23172
  309. 12427
  310. 23245
  311. 22619
  312. 23310
  313. 12937
  314. 33389
  315. 12310
  316. 36012
  317. 42451
  318. 24287
  319. 22237
  320. 43807
  321. 42609
  322. 13035
  323. 33620
  324. 31824
  325. 51494
  326. 52453
  327. 35455
  328. 9431
  329. 25295
  330. 5697
  331. 25745
  332. 53023
  333. 1090
  334. 20119
  335. 31913
  336. 45107
  337. 26323
  338. 25062
  339. 8681
  340. 23514
  341. 31922
  342. 10554
  343. 1361
  344. 37394
  345. 16854
  346. 41730
  347. 40373
  348. 14187
  349. 8688
  350. 39165
  351. 18827
  352. 29363
  353. 35036
  354. 45524
  355. 51604
  356. 8717
  357. 2541
  358. 43999
  359. 47211
  360. 44576
  361. 1980
  362. 46078
  363. 8322
  364. 1898
  365. 24598
  366. 15726
  367. 19363
  368. 22817
  369. 13562
  370. 43472
  371. 28326
  372. 14275
  373. 24784
  374. 11501
  375. 19403
  376. 32328
  377. 18384
  378. 51783
  379. 44646
  380. 31445
  381. 40966
  382. 21023
  383. 4359
  384. 9321
  385. 33739
  386. 36293
  387. 20788
  388. 12081
  389. 41613
  390. 25747
  391. 25850
  392. 9976
  393. 31210
  394. 53243
  395. 27531
  396. 6420
  397. 18881
  398. 43749
  399. 50969
  400. 22580

And then we use [] to extract those rows from the data frame.

hurrdata2 <- hurrdata[sample(nrow(hurrdata), 1000), ]

To Get Support: - Please email tts-research@tufts.edu for questions and requests.