This post is a slight extension of the previous two articles (Download Stock Data - Part I, Download Stock Data - Part II) and we discuss how to produce gross returns, standard deviation and correlation of multiple shares.

The following packages are used.

1library(knitr)
2library(lubridate)
3library(stringr)
4library(reshape2)
5library(plyr)
6library(dplyr)

The script begins with creating a data folder in the format of data_YYYY-MM-DD.

1# create data folder
2dataDir <- paste0("data","_",format(Sys.Date(),"%Y-%m-%d"))
3if(file.exists(dataDir)) {
4  unlink(dataDir, recursive = TRUE)
5  dir.create(dataDir)
6} else {
7  dir.create(dataDir)
8}

Given company codes, URLs and file paths are created. Then data files are downloaded by Map, which is a wrapper of mapply. Note that R’s download.file function is wrapped by downloadFile so that the function does not break when an error occurs.

 1# assumes codes are known beforehand
 2codes <- c("MSFT", "TCHC")
 3urls <- paste0("http://www.google.com/finance/historical?q=NASDAQ:",
 4               codes,"&output=csv")
 5paths <- paste0(dataDir,"/",codes,".csv") # backward slash on windows (\)
 6
 7# simple error handling in case file doesn't exists
 8downloadFile <- function(url, path, ...) {
 9  # remove file if exists already
10  if(file.exists(path)) file.remove(path)
11  # download file
12  tryCatch(
13    download.file(url, path, ...), error = function(c) {
14      # remove file if error
15      if(file.exists(path)) file.remove(path)
16      # create error message
17      c$message <- paste(substr(path, 1, 4),"failed")
18      message(c$message)
19    }
20  )
21}
22# wrapper of mapply
23Map(downloadFile, urls, paths)

Once the files are downloaded, they are read back to combine using rbind_all. Some more details about this step is listed below.

  • only Date, Close and Code columns are taken
  • codes are extracted from file paths by matching a regular expression
  • data is arranged by date as the raw files are sorted in a descending order
  • error is handled by returning a dummy data frame where its code value is NA.
  • individual data files are merged in a long format
    • ‘NA’ is filtered out
 1# read all csv files and merge
 2files <- dir(dataDir, full.name = TRUE)
 3dataList <- llply(files, function(file){
 4  # get code from file path
 5  pattern <- "/[A-Z][A-Z][A-Z][A-Z]"
 6  code <- substr(str_extract(file, pattern), 2, nchar(str_extract(file, pattern)))
 7  tryCatch({
 8    data <- read.csv(file, stringsAsFactors = FALSE)
 9    # first column's name is funny
10    names(data) <- c("Date","Open","High","Low","Close","Volume")
11    data$Date <- dmy(data$Date)
12    data$Close <- as.numeric(data$Close)
13    data$Code <- code
14    # optional
15    data$Open <- as.numeric(data$Open)
16    data$High <- as.numeric(data$High)
17    data$Low <- as.numeric(data$Low)
18    data$Volume <- as.integer(data$Volume)
19    # select only 'Date', 'Close' and 'Code'
20    # raw data should be arranged in an ascending order
21    arrange(subset(data, select = c(Date, Close, Code)), Date)
22  },
23  error = function(c){
24    c$message <- paste(code,"failed")
25    message(c$message)
26    # return a dummy data frame not to break function
27    data <- data.frame(Date=dmy(format(Sys.Date(),"%d%m%Y")), Close=0, Code="NA")
28    data
29  })
30}, .progress = "text")
31
32# data is combined to create a long format
33# dummy data frame values are filtered out
34data <- filter(rbind_all(dataList), Code != "NA")

Some values of this long format data is shown below.

DateCloseCode
2013-11-2938.13MSFT
2013-12-0238.45MSFT
2013-12-0338.31MSFT
2013-12-0438.94MSFT
2013-12-0538.00MSFT
2013-12-0638.36MSFT

The data is converted into a wide format data where the x and y variables are Date and Code respectively (Date ~ Code) while the value variable is Close (value.var="Close"). Some values of the wide format data is shown below.

1# data is converted into a wide format
2data <- dcast(data, Date ~ Code, value.var="Close")
3kable(head(data))
DateMSFTTCHC
2013-11-2938.1313.52
2013-12-0238.4513.81
2013-12-0338.3113.48
2013-12-0438.9413.71
2013-12-0538.0013.55
2013-12-0638.3613.95

The remaining steps are just differencing close price values after taking log and applying sum, sd, and cor.

 1# select except for Date column
 2data <- select(data, -Date)
 3
 4# apply log difference column wise
 5dailyRet <- apply(log(data), 2, diff, lag=1)
 6
 7# obtain daily return, variance and correlation
 8returns <- apply(dailyRet, 2, sum, na.rm = TRUE)
 9std <- apply(dailyRet, 2, sd, na.rm = TRUE)
10correlation <- cor(dailyRet)
11
12returns
##      MSFT      TCHC 
## 0.2249777 0.6293973
1std
##       MSFT       TCHC 
## 0.01167381 0.03203031
1correlation
##           MSFT      TCHC
## MSFT 1.0000000 0.1481043
## TCHC 0.1481043 1.0000000

Finally the data folder is deleted.

1# delete data folder
2if(file.exists(dataDir)) { unlink(dataDir, recursive = TRUE) }