This post is a slight extension of the previous two articles (Download Stock Data - Part I, Download Stock Data - Part II) and we discuss how to produce gross returns, standard deviation and correlation of multiple shares.
The following packages are used.
1library(knitr)
2library(lubridate)
3library(stringr)
4library(reshape2)
5library(plyr)
6library(dplyr)
The script begins with creating a data folder in the format of data_YYYY-MM-DD.
1# create data folder
2dataDir <- paste0("data","_",format(Sys.Date(),"%Y-%m-%d"))
3if(file.exists(dataDir)) {
4 unlink(dataDir, recursive = TRUE)
5 dir.create(dataDir)
6} else {
7 dir.create(dataDir)
8}
Given company codes, URLs and file paths are created. Then data files are downloaded by Map
, which is a wrapper of mapply
. Note that R’s download.file
function is wrapped by downloadFile
so that the function does not break when an error occurs.
1# assumes codes are known beforehand
2codes <- c("MSFT", "TCHC")
3urls <- paste0("http://www.google.com/finance/historical?q=NASDAQ:",
4 codes,"&output=csv")
5paths <- paste0(dataDir,"/",codes,".csv") # backward slash on windows (\)
6
7# simple error handling in case file doesn't exists
8downloadFile <- function(url, path, ...) {
9 # remove file if exists already
10 if(file.exists(path)) file.remove(path)
11 # download file
12 tryCatch(
13 download.file(url, path, ...), error = function(c) {
14 # remove file if error
15 if(file.exists(path)) file.remove(path)
16 # create error message
17 c$message <- paste(substr(path, 1, 4),"failed")
18 message(c$message)
19 }
20 )
21}
22# wrapper of mapply
23Map(downloadFile, urls, paths)
Once the files are downloaded, they are read back to combine using rbind_all
. Some more details about this step is listed below.
- only Date, Close and Code columns are taken
- codes are extracted from file paths by matching a regular expression
- data is arranged by date as the raw files are sorted in a descending order
- error is handled by returning a dummy data frame where its code value is NA.
- individual data files are merged in a long format
- ‘NA’ is filtered out
1# read all csv files and merge
2files <- dir(dataDir, full.name = TRUE)
3dataList <- llply(files, function(file){
4 # get code from file path
5 pattern <- "/[A-Z][A-Z][A-Z][A-Z]"
6 code <- substr(str_extract(file, pattern), 2, nchar(str_extract(file, pattern)))
7 tryCatch({
8 data <- read.csv(file, stringsAsFactors = FALSE)
9 # first column's name is funny
10 names(data) <- c("Date","Open","High","Low","Close","Volume")
11 data$Date <- dmy(data$Date)
12 data$Close <- as.numeric(data$Close)
13 data$Code <- code
14 # optional
15 data$Open <- as.numeric(data$Open)
16 data$High <- as.numeric(data$High)
17 data$Low <- as.numeric(data$Low)
18 data$Volume <- as.integer(data$Volume)
19 # select only 'Date', 'Close' and 'Code'
20 # raw data should be arranged in an ascending order
21 arrange(subset(data, select = c(Date, Close, Code)), Date)
22 },
23 error = function(c){
24 c$message <- paste(code,"failed")
25 message(c$message)
26 # return a dummy data frame not to break function
27 data <- data.frame(Date=dmy(format(Sys.Date(),"%d%m%Y")), Close=0, Code="NA")
28 data
29 })
30}, .progress = "text")
31
32# data is combined to create a long format
33# dummy data frame values are filtered out
34data <- filter(rbind_all(dataList), Code != "NA")
Some values of this long format data is shown below.
Date | Close | Code |
---|---|---|
2013-11-29 | 38.13 | MSFT |
2013-12-02 | 38.45 | MSFT |
2013-12-03 | 38.31 | MSFT |
2013-12-04 | 38.94 | MSFT |
2013-12-05 | 38.00 | MSFT |
2013-12-06 | 38.36 | MSFT |
The data is converted into a wide format data where the x and y variables are Date and Code respectively (Date ~ Code
) while the value variable is Close (value.var="Close"
). Some values of the wide format data is shown below.
1# data is converted into a wide format
2data <- dcast(data, Date ~ Code, value.var="Close")
3kable(head(data))
Date | MSFT | TCHC |
---|---|---|
2013-11-29 | 38.13 | 13.52 |
2013-12-02 | 38.45 | 13.81 |
2013-12-03 | 38.31 | 13.48 |
2013-12-04 | 38.94 | 13.71 |
2013-12-05 | 38.00 | 13.55 |
2013-12-06 | 38.36 | 13.95 |
The remaining steps are just differencing close price values after taking log and applying sum
, sd
, and cor
.
1# select except for Date column
2data <- select(data, -Date)
3
4# apply log difference column wise
5dailyRet <- apply(log(data), 2, diff, lag=1)
6
7# obtain daily return, variance and correlation
8returns <- apply(dailyRet, 2, sum, na.rm = TRUE)
9std <- apply(dailyRet, 2, sd, na.rm = TRUE)
10correlation <- cor(dailyRet)
11
12returns
## MSFT TCHC
## 0.2249777 0.6293973
1std
## MSFT TCHC
## 0.01167381 0.03203031
1correlation
## MSFT TCHC
## MSFT 1.0000000 0.1481043
## TCHC 0.1481043 1.0000000
Finally the data folder is deleted.
1# delete data folder
2if(file.exists(dataDir)) { unlink(dataDir, recursive = TRUE) }
Comments