Working with DARWIN Time Series Data in R (MLD-II)

17 January 2018
The Market Bull

This post describes how to prepare and analyse OHLC time series objects in R, from DARWIN datasets available publicly on our GitHub profile.

Unlike the introductory posts in this series (see below) where we focused on environment configuration and fundamentals, from here on all concepts will be presented in a practical manner, with fully functional code examples in R, Python or both.

Note #1: While the material presented herein is designed for people familiar with the R programming language, consulting a good R tutorial like this one may assist those unfamiliar with R in learning from, following and implementing it.

The full code listing for this post is available on GitHub (under Tools -> R).

Note #2: If you missed the first two posts in this blog series, you may wish (especially if you are new to trading strategy development) to read them first via the following links:

  1. Setting up a DARWIN Data Science Environment
  2. Machine Learning on DARWIN Datasets (MLD-I)

 

For your convenience, the remainder of this post is categorized as follows:

  1. Required libraries for manipulating Time Series in R
  2. Importing DARWIN datasets from GitHub
  3. Inspecting Data Structure & Localizing Timezones
  4. Creating OHLC Time Series Objects
  5. Visualizing DARWIN Time Series
  6. Efficient Data Storage Practices
  7. Future posts – what’s to come!

1) Required libraries for manipulating Time Series in R

This post’s implementation will make use of the S3 infrastructure for regular and irregular time series (or Z‘s Ordered Observations), and an extension of the same, literally “eXtensible” Time Series.

These packages are available in R as zoo and xts respectively.

We’ll work with dates and times using the anytime library, manipulate data inside data tables using the data.table library, overlay technical indicators using the quantmod library, and perform some basic visualizations using the plotly library.

Finally, in the “efficient data storage practices” section, we’ll discuss performance considerations in using different I/O methods in R, and benchmark these as we go along. For this we will make use of the microbenchmark library.

The following snippet installs and loads all 7 of these in just a few simple lines of code:

# Disable scientific formatting
options(scipen=9999)

if (!require("pacman")) 
 install.packages("pacman")

libs.vector <- c("anytime",
 "data.table",
 "xts", "zoo", "quantmod",
 "plotly",
 "microbenchmark")

pacman::p_load(char = libs.vector,
 install=TRUE, 
 update=FALSE)

2) Importing DARWIN datasets from GitHub

DARWIN Time Series Datasets @ GitHub

DARWIN Time Series Datasets @ GitHub

Next, head over to our GitHub profile and download datasets you wish to work with.

For this post, we will download, unzip and use the CSV file inside DWC.M1.QUOTES.29.12.2017.csv.zip, available via the link:

https://github.com/darwinex/DarwinexLabs/blob/master/datasets/community-darwins/DWC.M1.QUOTES.29.12.2017.csv.zip

Assuming you’ve loaded the required libraries as above, downloaded and unzipped the CSV (comma separated values) file in a directory of your choice, open an R console and use the setwd(“<full path to directory here>”) command to navigate to it and execute the following line of code:

 

DWC.M1.QUOTES.29.12.2017.dt <- fread("DWC.M1.QUOTES.29.12.2017.csv", colClasses="numeric")

 

As both the columns of data in this CSV file (timestamp and quote) contain integer and floating point values, the colClasses parameter input to data.table::fread() is used to ensure that data types are overridden from character (text) in the CSV file to numeric format for future processing.

We now have DARWIN $DWC’s data available as rows of {timestamp, quote} values inside our DWC.M1.QUOTES.29.12.2017.dt data.table object.


3) Inspecting Data Structure & Localizing Timezones

Inspecting Data Structure & Localizing Timezones

Inspecting Data Structure & Localizing Timezones to UTC

It’s always a good idea to inspect time series data you’re working with, prior to conducting any analysis.

Beyond giving you a general idea of what the data’s structure looks like, this practice helps ensure that data types are as expected and any missing data is identified (and subsequently dealt with).

For now, we’ll run the following commands:

  1. dim() for displaying the data’s dimensions,
  2. class() for displaying the data type of each field,
  3. str() for displaying the complete structure of DWC.M1.QUOTES.dt (what dim() and class() tell us individually).

> dim(DWC.M1.QUOTES.29.12.2017.dt)
 [1] 230630 2

This tells us that DWC.M1.QUOTES.29.12.2017.dt is a 2D object containing 230,630 rows and 2 columns.


> class(DWC.M1.QUOTES.29.12.2017.dt)
[1] "data.table" "data.frame"

> class(DWC.M1.QUOTES.29.12.2017.dt$timestamp)
[1] "numeric"

> class(DWC.M1.QUOTES.29.12.2017.dt$quote)
[1] "numeric"

 

These commands tell us that DWC.M1.QUOTES.29.12.2017.dt is a data.table object, each column of which has a numeric data type.


> str(DWC.M1.QUOTES.29.12.2017.dt)
Classes ‘data.table’ and 'data.frame': 230630 obs. of 2 variables:
 $ timestamp: num 1514584740 1514584680 1514584620 1514584560 1514584500 ...
 $ quote : num 90.5 90.5 90.5 90.5 90.5 ...
 - attr(*, ".internal.selfref")=<externalptr>

 

This command gives us the full picture:

  1. DWC.M1.QUOTES.29.12.2017.dt is a “data.table” object.
  2. It contains 230,630 rows and 2 columns.
  3. The columns contain numeric values.
  4. The columns are called “timestamp” and “quote” respectively.
  5. The timestamp values appear to be UNIX epochs (in seconds).
  6. The quote values appear to have a floating point data type.

Timestamp values in all DARWIN time series datasets available on GitHub are in the UTC timezone.

The anytime library converts any supported input timestamp into POSIXct format, which defaults to your local system timezone. Therefore, it is important that a timezone be specified when working with timestamps in DARWIN datasets, as below:

 

# Localize numeric timestamps to UTC timezone.
DWC.M1.QUOTES.29.12.2017.dt$timestamp <- anytime(DWC.M1.QUOTES.29.12.2017.dt$timestamp, tz="UTC")

> head(DWC.M1.QUOTES.29.12.2017.dt)
             timestamp quote
1: 2017-12-29 21:59:00 90.50
2: 2017-12-29 21:58:00 90.50
3: 2017-12-29 21:57:00 90.50
4: 2017-12-29 21:56:00 90.50
5: 2017-12-29 21:55:00 90.50
6: 2017-12-29 21:54:00 90.53

4) Creating OHLC Time Series Objects

OHLC (Open, High, Low, Close) DARWIN Data

OHLC (Open, High, Low, Close) DARWIN Data

Conventional asset prices (e.g. in stocks, commodities, forex, etc) are usually available in OHLC (Open, High, Low, Close) format, as are DARWIN quotes on the DARWIN Exchange via the candlestick chart view.

Most price-based trading indicators and/or strategies tend to use at least one of these price values.

Hence, to overlay such indicators (or create trading strategies that rely on OHLC data) – on a DARWIN time-frame higher than 1-minute – we will need to first construct an OHLC time series for that desired time-frame.

For this example, we’ll use our 1-minute time series to construct a Daily OHLC time series.

Note: “DARWIN days” are from 21:00 the day before, to 20:59 of the current day.

Therefore, with our 1-minute data now pre-processed to contain POSIXct timestamps in UTC and quotes as numeric (floating point) values, we need to construct a Daily time series of Open, High, Low and Close prices accounting for this block of hours.

One way to achieve this is:

  1. Construct an eXtensible Time Series object from our 1-minute data.
  2. Convert it to an HOURLY OHLC time series first.
  3. Start processing from the first occurrence of hour 21 in the hourly series.
  4. Generate daily endpoints where each new day begins at hour 21.
  5. Generate the final DAILY series with these endpoints.

Note: The 5-step logic above deliberately uses as many steps as possible for every reader’s educational benefit. Please feel free to refine it for reduced iterations -> faster output.

The R code below implements the above.

Note: The function expects a data.table object of {timestamp, quote} values as input. Argument ts.type set to “open” outputs the time series with timestamps representing “Open Time”, any other value resulting in timestamps representing “Close Time”.


# OHLC Helper Function
Convert.toOHLC <- function(x) {
 op <- as.vector(first(x))
 hl <- range(x, na.rm = TRUE)
 cl <- as.vector(last(x))
 
 xts(cbind(Open = op, High = hl[2], Low = hl[1], Close = cl), end(x))
}

# Function to generate DAILY time series from 1-minute data.
Convert.DARWIN.To.D1.OHLC.XTS <- function(darwin.M1.dt,
 start.hour = 21,
 ts.type="open")
{
 # 1) Accept M1 data table and convert to 1-hour periodicity xts object
 ret.xts <- xts(x = as.numeric(.subset2(darwin.M1.dt, "quote")), 
 order.by = anytime(as.numeric(.subset2(darwin.M1.dt, "timestamp")), tz="UTC"))
 
 # Create H1 xts object
 ret.xts <- to.period(x = ret.xts,
 period = 'hours',
 OHLC = FALSE,
 indexAt = 'endof')
 
 ret.H1.zoo <- zoo(coredata(ret.xts), order.by = index(ret.xts))
 
 # Convert to xts
 y <- as.xts(ret.H1.zoo)
 
 # Find first occurence of start.hour
 first.y <- which(hour(index(y)) == start.hour)[1]
 
 # Set first observation to epoch (zero)
 .index(y) <- .index(y) - .index(y)[first.y]
 
 # Get endpoints for y by day
 ep <- endpoints(y, "days")
 
 ret.H1.zoo <- period.apply(ret.H1.zoo, ep, Convert.toOHLC)
 
 if(grepl("open", ts.type)) {
 # Lag the series by 1, making each timestamp the Open Time of the corresponding OHLC record.
 return(na.omit(lag(as.xts(ret.H1.zoo), -1))) 
 } else {
 return(as.xts(ret.H1.zoo)) 
 }
}

# Example output:
> temp.D1 <- Convert.DARWIN.To.D1.OHLC.XTS(DWC.M1.QUOTES.29.12.2017.dt, start.hour = 21, ts.type="open")
> head(temp.D1)
                     Open  High   Low Close
2017-04-03 20:59:00 93.64 94.48 93.64 93.93
2017-04-04 20:59:00 93.90 94.51 93.59 94.51
2017-04-05 20:59:00 94.68 94.74 93.80 94.11
2017-04-06 20:59:00 94.07 94.75 93.92 94.22
2017-04-07 20:59:00 94.23 94.23 93.57 93.84
2017-04-10 20:59:00 93.92 96.77 93.90 96.77

5) Visualizing DARWIN Time Series

With our M1 data converted into a Daily OHLC time series, we can now plot it in a variety of ways and explore the asset further.

For example,

To plot an OHLC candlestick chart using plotly, overlaying Bollinger Bands with default parameter values (20-period Simple Moving Average and 2 Standard Deviations), run the following code in your R console:


# Visualize XTS data as Candlestick Chart with Bollinger Bands
ts.visualize.DARWIN.xts <- function(darwin.D1.xts,
 chart.type="candlestick") {
 
 # chart.type = candlesticks | closes | returns
 
 df <- data.frame(Date=index(darwin.D1.xts),coredata(darwin.D1.xts))
 
 bb <- BBands(df[ , c("High", "Low", "Close")])
 df <- cbind(df, bb[, 1:3])
 
 if(grepl("candlestick", chart.type)) {
 print(df %>%
 plot_ly(name="DWC", x = ~Date, type="candlestick",
 open = ~Open, close = ~Close,
 high = ~High, low = ~Low) %>%
 
 add_lines(y = ~up , name = "Bollinger Bands",
 line = list(color = '#ccc', width = 0.5),
 legendgroup = "Bollinger Bands",
 hoverinfo = "none") %>%
 add_lines(y = ~dn, name = "B Bands",
 line = list(color = '#ccc', width = 0.5),
 legendgroup = "Bollinger Bands",
 showlegend = FALSE, hoverinfo = "none") %>%
 add_lines(y = ~mavg, name = "Mv Avg",
 line = list(color = '#E377C2', width = 0.5),
 hoverinfo = "none") %>%
 
 layout(title = "DARWIN OHLC Candlestick Chart",
 yaxis = list(title="DARWIN Quote"),
 legend = list(orientation = 'h', x = 0.5, y = 1,
 xanchor = 'center', yref = 'paper',
 font = list(size = 10),
 bgcolor = 'transparent'))) 
 
 } else if (grepl("closes", chart.type)) {
 print(df %>%
 plot_ly(x = ~Date, type="scatter", mode="lines",
 y = ~Close) %>%
 
 layout(title = "DARWIN OHLC Line Chart (Close Quotes)",
 yaxis=list(title="Closing Quote")) )
 
 } else if (grepl("returns", chart.type)) {
 
 rets <- (df$Close[2:nrow(df)] / df$Close[1:nrow(df)-1]) - 1
 
 print(
 plot_ly(x = df$Date, type="scatter", mode="lines",
 y = c(0, cumprod(1+rets)-1)*100) %>%
 
 layout(title = "DARWIN OHLC Line Chart (C.Returns)",
 yaxis=list(title="Cumulative Returns (%)")) )
 }
}

# Plot DARWIN $DWC's OHLC Candlestick Chart
ts.visualize.DARWIN.xts(Convert.DARWIN.To.D1.OHLC.XTS(DWC.M1.QUOTES.29.12.2017.dt), chart.type = "candlestick")

This should display the following chart:

DARWIN OHLC Candlestick Chart with Bollinger Bands

DARWIN OHLC Candlestick Chart with Bollinger Bands

 


Or to plot only cumulative returns, run:

ts.visualize.DARWIN.xts(Convert.DARWIN.To.D1.OHLC.XTS(DWC.M1.QUOTES.29.12.2017.dt), chart.type = “returns”)

DARWIN Cumulative Returns Chart

DARWIN Cumulative Returns Chart

 

All code referenced in this post is available in a dedicated code listing on our GitHub profile, so feel free to download and modify as necessary to implement any number of technical indicator overlays, or anything else you wish.


6) Efficient Data Storage Practices

Working with thousands of DARWIN datasets in CSV format can be a fairly slow process for more complex efforts such as algorithmic DARWIN portfolio creation, segmentation, etc.

Though this post’s material is fairly elementary, our choice of data storage formats and I/O tools will make a significant performance difference in future posts where we embark on more sophisticated work involving for example, deep learning, real-time portfolio rebalancing, etc.

It is therefore good practice to store data in an appropriate format for the task at hand, and use efficient I/O tools from the get go.

For example, let’s benchmark the performance of the following R functions:

  1. write.zoo()
  2. saveRDS()
  3. readRDS()
  4. read.table()

To do this, run the following code snippet in your R console:

DWC.D1.QUOTES.OHLC.xts <- Convert.DARWIN.To.D1.OHLC.XTS(DWC.M1.QUOTES.dt)

test.IO.funcs <- function() {
 microbenchmark(
 write.zoo(DWC.D1.QUOTES.OHLC.xts, "DWC.D1.QUOTES.OHLC.xts.csv", sep=","),
 saveRDS(DWC.D1.QUOTES.OHLC.xts, "DWC.D1.QUOTES.OHLC.xts.rds"),
 
 readRDS("DWC.D1.QUOTES.OHLC.xts.rds"),
 read.table("DWC.D1.QUOTES.OHLC.xts.csv", header=TRUE, sep=",")
 ) 
}
test.IO.funcs()

Output:

> test.IO.funcs()
Unit: microseconds
                                                                            expr      min       lq      mean    median        uq       max neval
 write.zoo(DWC.D1.QUOTES.OHLC.xts, "DWC.D1.QUOTES.OHLC.xts.csv",      sep = ",") 6953.374 7353.838 8187.7620 7674.7640 8297.2715 22585.929   100
                   saveRDS(DWC.D1.QUOTES.OHLC.xts, "DWC.D1.QUOTES.OHLC.xts.rds") 1938.971 2120.158 2279.6689 2200.0930 2374.3710  3932.811   100
                                           readRDS("DWC.D1.QUOTES.OHLC.xts.rds")  281.451  333.952  414.5268  403.8215  494.6115   754.746   100
              read.table("DWC.D1.QUOTES.OHLC.xts.csv", header = TRUE, sep = ",") 1738.838 2022.656 2299.9467 2198.3170 2394.5030  5117.430   100

A quick glance at the mean time indicates that:

  1. saveRDS (which serializes R objects for storage/loading) is approximately 4 times as fast as write.zoo()
  2. readRDS (which loads serialized R objects from storage) is approximately 6 times as fast as read.table()

Based on these results, it becomes immediately clear that in the very least, we should consider saving DARWIN time series data in RDS format for future use.

It is also possible that loading all available DARWIN datasets in 1-minute precision, could exhaust available System RAM. In such cases, partial I/O and data chunking will become necessary.

We will address this subject in future posts when we make a case for using the HDF5 data format for DARWINs, which has widespread adoption in the institutional finance community (hedge funds, investment banks, etc), particularly in the high frequency trading (HFT) space.


7) Future posts – what’s to come!

What can you expect in future?

What can you expect in future?

This post described how to acquire and process DARWIN time series data from GitHub, in R.

Future posts will discuss:

  1. Time series modelling in further detail,
  2. Applications of machine learning to the DARWIN asset class,
  3. Advanced infrastructure requirements (e.g. Amazon AWS and/or GPU hardware as part of our stack),
  4. .. and more.

Code listing on GitHub:

File “DLABS-BP-MLD-2.R” under tools -> R


Additional Resource: Setting up a DARWIN Data Science Environment