As mentioned in an earlier post, things that are not easy in R can be relatively simple in other languages. Another example would be connecting to Amazon Web Services. In relation to s3, although there are a number of existing packages, many of them seem to be deprecated, premature or platform-dependent. (I consider the cloudyr project looks promising though.)

If there isn’t a comprehensive R-way of doing something yet, it may be necessary to create it from scratch. Actually there are some options to do so by using AWS Command Line Interface, AWS REST API or wrapping functionality of another language.

In this post, a quick summary of the last way using Python is illustrated by introducing the rs3helper package.

The reasons why I’ve come up with a package are as following.

  • Firstly, Python is relatively easy to learn and it has quite a comprehensive interface to Amazon Web Services - boto.
  • Secondly, in order to call Python in R, the rPython package may be used if it only targets UNIX-like platforms. For cross-platform functionality, however, system command has to be executed.
  • Finally, due to the previous reason, it wouldn’t be stable to keep the source files locally and it’d be necessary to keep them in a package.

I use Python 2.7 and the boto library can be installed easily using pip by executing pip install boto.

Using RStudio, it is not that complicated to develop a package. (see R packages by Hadley Wickham) Even the folder structure and necessary files are generated if the project type is selected as R Package. R script files should locate in the R folder while Python scripts should be in inst/python.

In the package, the s3-related R functions exists in R/s3utils.R while the corresponding python scripts are in inst/python - all Python functions are in inst/python/s3helper.py. As the Python function outputs should be passed to R, a response variable is returned for each function and it is converted into JSON string. The response variable is a Python list, dictionary or list of dictionaries and thus it is parsed as R vector, list or data frame.

An example of the wrapper functions, which looks up a bucket, is shown below.

Python: connect_to_s3() and lookup_bucket() are imported to inst/python/lookup_bucket.py from inst/python/s3helper.py. The script requires 4 mandatory/optional argumens and prints the response after converting it into JSON string.

 1## in inst/python/s3helper.py
 2import boto
 3from boto.s3.connection import OrdinaryCallingFormat
 4
 5def connect_to_s3(access_key_id, secret_access_key, region = None):
 6    try:
 7        if region is None:
 8            conn = boto.connect_s3(access_key_id, secret_access_key)
 9        else:
10            conn = boto.s3.connect_to_region(
11               region_name = region,
12               aws_access_key_id = access_key_id,
13               aws_secret_access_key = secret_access_key,
14               calling_format = OrdinaryCallingFormat()
15               )
16    except boto.exception.AWSConnectionError:
17        conn = None
18    return conn
19
20def lookup_bucket(conn, bucket_name):
21    if conn is not None:
22        try:
23            bucket = conn.lookup(bucket_name)
24            if bucket is not None:
25                response = {'bucket': bucket_name, 'is_exist': True, 'message': None}
26            else:
27                response = {'bucket': bucket_name, 'is_exist': False, 'message': None}
28        except boto.exception.S3ResponseError as re:
29            response = {'bucket': bucket_name, 'is_exist': None, 'message': 'S3ResponseError = {0} {1}'.format(re[0], re[1])}
30        except:
31            response = {'bucket': bucket_name, 'is_exist': None, 'message': 'Unhandled error occurs'}
32    else:
33        response = {'bucket': bucket_name, 'is_exist': None, 'message': 'connection is not made'}
34    return response
 1## in inst/python/lookup_bucket.py
 2import json
 3import argparse
 4
 5from s3helper import connect_to_s3, lookup_bucket
 6
 7parser = argparse.ArgumentParser(description='lookup a bucket')
 8parser.add_argument('--access_key_id', required=True, type=str, help='AWS access key id')
 9parser.add_argument('--secret_access_key', required=True, type=str, help='AWS secret access key')
10parser.add_argument('--bucket_name', required=True, type=str, help='S3 bucket name')
11parser.add_argument('--region', required=False, type=str, help='Region info')
12
13args = parser.parse_args()
14
15conn = connect_to_s3(args.access_key_id, args.secret_access_key, args.region)
16response = lookup_bucket(conn, args.bucket_name)
17
18print(json.dumps(response))

R: lookup_bucket() generates the path where inst/python/lookup_bucket.py exists and constructs the command to be executed in system() - the intern argument should be TRUE to grap the printed JSON string. Then it parses the returned JSON string into a R object using the jsoinlite package.

 1lookup_bucket <- function(access_key_id, secret_access_key, bucket_name, region = NULL) {
 2  if(bucket_name == '') stop('bucket_name: expected one argument')
 3
 4  path <- system.file('python', 'lookup_bucket.py', package = 'rs3helper')
 5  command <- paste('python', path, '--access_key_id', access_key_id, '--secret_access_key', secret_access_key, '--bucket_name', bucket_name)
 6  if(!is.null(region)) command <- paste(command, '--region', region)
 7
 8  response <- system(command, intern = TRUE)
 9  tryCatch({
10    fromJSON(response)
11  }, error = function(err) {
12    warning('fails to parse JSON response')
13    response
14  })
15}

A quick example of running this function is shown below.

1if (!require("devtools"))
2  install.packages("devtools")
3devtools::install_github("jaehyeon-kim/rs3helper")
4
5library(rs3helper)
6library(jsonlite) # not sure why it is not loaded at the first place
7
8lookup_bucket('access-key-id', 'secret-access-key', 'rs3helper')
## $is_exist
## [1] TRUE
## 
## $message
## NULL
## 
## $bucket
## [1] "rs3helper"