As I don’t use R at work yet, I haven’t got enough chances to learn R by doing. Together with teaching myself machine learning with R, I consider it would be a good idea to collect examples on the web. Recently I have been visiting a Linkedin group called The R Project for Statistical Computing and suggesting scripts on data manipulation or programming topics. Actually I expected some of the topics may also be helpful to learn machine learning/statistics but, possibly due to lack of understaning of the topics in context, I’ve found only a few - it may be necessary to look for another source. Anyway below is a summary of two examples.
Summarise a data frame group by a column
This seems to be a part of an assignment of a Coursera class and the original posting is as following.
- Problem Statement : To find the ‘Best’ hospital in a particular city under one of the three outcomes(heart attack, heart failure, pneumonia). The data set (dat) contains 54 unique states and a total of 4706 observations with 26 variable columns. Column 11 contains Mortality rate for each hospital(4706) due to heart attack. Logic is simple as the best hospital would be one with least mortality. But something is going awry which i’m not able to identify. Please help it out!
As far as I understand, the one who posted this article tried to split the data by state into a list as x<- split(as.character(dat$Hospital.Name),dat$State)
. Then he was going to combine the list elements after calculating the minimum mortality rate by state.
Focusing on the logic in bold-cased above, I just created a simple data frame and converted it into another data frame which shows state, name and mininum number - an interger vector was assumed rather than mortality rates.
Some notes about the code is
- It would be good to use a library that provides a comprehensive syntax
- plyr is used
- ddply converts a data frame (ddply) into another (ddply)
.data
requires a data frame - dat.(state)
will group by state when applying a function - eg mininum by statename = name[num==min(num, na.rm=TRUE)][1]
selects name whennum
is the minimum of num- last indexing (
[1]
) is necessary in case there is a tie - for example, it’ll fail if
num = c(3, 2, 6, 2, 3)
(same number in state1) - the idea is just selecting the first record when there is a tie
- last indexing (
1library(knitr)
2library(plyr)
1dat <- data.frame(name=c("name1","name2","name3","name4","name5"),
2 num = c(3, 2, 6, 2, 9),
3 state = c("state1","state2","state3","state3","state1"))
4
5# summarise dat by state
6sumDat <- ddply(.data=dat, .(state), .drop=FALSE, # missing combination kept
7 summarise, name = name[num==min(num, na.rm=TRUE)][1], minNum = min(num, na.rm=TRUE))
8
9# show output
10kable(sumDat)
state | name | minNum |
---|---|---|
state1 | name1 | 3 |
state2 | name2 | 2 |
state3 | name4 | 2 |
A quick way of simulation
This is a quick way of simulating exponential random variables where the number of variables in each trial is 8 and the rate is fixed to be 1/5. 5000 trials are supposed to be performed and their means and variances are calculated.
The simulation is done using mapply
, which is a multivariate version of sapply
. Then apply
is ued to obtain the sample properties.
1# set local variables
2n <- 8
3times <- 500
4rate <- 1/5
5
6set.seed(12347)
7# mapply is a multivariate version of sapply
8# it allow to apply rexp on a vector rather than a single value
9mat <- mapply(rexp, rep(n,times=times), rate=rate)
10
11# apply to get mean and variance
12df <- as.data.frame(cbind(apply(mat, 2, mean),apply(mat, 2, var)))
13names(df) <- c("mean","var")
14
15# show output
16kable(head(df))
mean | var |
---|---|
5.755293 | 39.998841 |
7.378336 | 64.692667 |
12.036352 | 68.976558 |
3.454449 | 16.025874 |
5.289533 | 11.723384 |
3.641087 | 6.539533 |
Comments