Skip to content Skip to footer

Setting up a DARWIN Data Science Environment

This post describes how to setup a data science environment for DARWIN R&D.
Whether you’re a Data Scientist, Quant, Trader, Investor, Researcher, Developer or just someone keen on putting the DARWIN asset class under a scientific microscope, the contents of this post should hopefully give you a sound start.
The tools, libraries and datasets referenced herein are free to download, and employed by the Darwinex Labs team itself in its day to day efforts.

For your convenience, the rest of this post is structured as follows:

  1. Data Science Environment (Requirements & Setup)
  2. Required Data Science Libraries/Packages
  3. DARWIN Datasets (where & how to get them)

Data Science Environment (Requirements & Setup)

At the end of the day, all researchers have their own preferred R&D stack. For the purposes of this post however, we’ve chosen Python, R and C++ as the programming language base for our environment.
Why?

  1. Python -> easy to understand, powerful programming language with a large base of core libraries for machine learning, AI and statistical research.
  2. R -> free, robust alternative to commercial statistical research environments like MATLAB.
  3. C++ -> for enhancing performance, particularly in cases of mathematically intense calculations on large datasets.

Readers are of course most welcome to either extend this or craft a different stack should they so wish.

Requirements

For each language in our data science environment, we need the following:

  1. PythonAnaconda® Distribution – a free package and environment manager for Python developers.
  2. RBase R v3.3.2 or later, and RStudio Desktop – a fantastic IDE for code editing and visualization in R.
  3. C++Rtools for compiling external code modules in C++, for subsequent use in R when necessary.

Setup Instructions

  1. Python: Download and install the Anaconda Distribution, selecting Python v2.7. It ships with the Spyder IDE for code editing and visualization in Python, as well as Jupyter Notebook for compiling and sharing your research with colleagues, academia, etc.
  2. R: First download and install R v3.3.2 or later via the link above. Once installed, download and install RStudio via its link above.
  3. C++: Lastly, download and install Rtools v3.3.x (e.g. v3.3.2 if you downloaded R v3.3.2), for compiling external C++ code for use within R scripts.

Required Data Science Libraries / Packages

For Python:

We will initially require the following libraries:
Pandas: for data analysis, processing, restructuring and cleansing.
NumPy: for numerical and scientific computation using high performance data structures and vectorized mathematics.
SciPy: extends NumPy with functionality and additional algorithms for data manipulation and visualization.
Matplotlib: for 2D and 3D graphics.
Sci-Kit Learn: an extremely well-documented, robust and well-supported machine learning library.
Fortunately, all five ship with Anaconda and are installed by default when you install the Anaconda Distribution.

For R:

The following packages are essential to a lot of the research you’ll end up doing on DARWIN datasets:
R.utils, plotly, data.table, PerformanceAnalytics, TTR, xts, anytime, pracma, urca, forecast, tseries, stats, PortfolioAnalytics, RCurl, jsonlite, zoo, snow, sm, profr, proftools, MonteCarlo, microbenchmark, astsa, Rcpp, RcppArmadillo, RcppParallel, doParallel, inline, rbenchmark, knitr, plyr, corrplot, network, sna, ggplot2, GGally, xlsx.

Note: A convenient way to download and install all of these in your R data science environment, is to run the following code in an Rscript terminal or in the RStudio console:
if (!require(“pacman”)) install.packages(“pacman”)
# Define list of packages required for this project.
package.list <- c(“R.utils”, “plotly”, “data.table”, “PerformanceAnalytics”,
“TTR”, “xts”, “anytime”, “pracma”, “urca”, “forecast”, “tseries”, “stats”, “PortfolioAnalytics”,
“RCurl”, “jsonlite”, “zoo”, “snow”, “sm”, “profr”, “proftools”,
“MonteCarlo”, “microbenchmark”, “astsa”, “Rcpp”, “RcppArmadillo”, “RcppParallel”,
“doParallel”, “inline”, “rbenchmark”, “knitr”, “plyr”, “corrplot”,
“network”, “sna”, “ggplot2”, “GGally”, “xlsx”)
# Summon Pacman!
pacman::p_load(char=package.list, install=TRUE, update=FALSE)

DARWIN Datasets (where and how to get them)

Once the steps above are completed successfully, all we need is a DARWIN dataset to begin!
At the present time, data up to November 29, 2017 for DARWIN $DWC‘s Quotes is available via the Darwinex Labs GitHub profile, in both Daily (D1) and 1-minute (M1) precision.
We periodically update this dataset on GitHub, so check back every week or so for updates. And yes, we are working on an API where accessing data on-demand will become  a lot simpler (watch this space!).

You may download this data directly from GitHub in two ways:

1) Right-Click & Save-As on this link for 1-minute (M1) and this link for Daily (D1) data, or..
2) Execute the following code in an Rscript terminal or RStudio Console:
library(data.table)
DWC.M1.GitHub <- fread(“https://github.com/darwinex/DarwinexLabs/blob/master/datasets/community-darwins/DWC.M1.QUOTES.29.11.2017.csv?raw=true&#8221;, colClasses=”character”)
DWC.D1.GitHub <- fread(“https://github.com/darwinex/DarwinexLabs/blob/master/datasets/community-darwins/DWC.D1.QUOTES.29.11.2017.csv?raw=true&#8221;, colClasses=”character”)
Column #1 contains the Timestamp in POSIXct format, and Column #2 contains the Quote in the deepest available decimal precision.

Additional Resource: (Video) Setting up a DARWIN Data Science Environment

4 Comments

  • Martyn Tinsley
    Posted December 2, 2017 at 7:31 am

    Great post. Thank you.

  • Ricardo
    Posted December 19, 2017 at 11:26 pm

    Hi
    great news that Darwins data are available!
    I was trying in
    https://github.com/darwinex/DarwinexLabs/tree/master/datasets/trader-darwins/quotes
    file: OYA.4.2_QUOTES_Latest.csv
    “”,”timestamp”,”quote”
    “2”,”1314737940000″,”100.0″
    “3”,”1314824340000″,”100.0″
    trying to convert the timestamp (Unix EPOCH in milliseconds) to date using R
    I get weird results:
    > as.POSIXct(1314737940000, origin=”1970-01-01″)
    [1] “43632-05-21 09:20:00 CEST”
    library(anytime)
    > anytime(1314824340000)
    [1] “43635-02-15 08:20:00 CET”
    what is the correct way to convert the data?
    thanks

    • Post Author
      The Market Bull
      Posted December 22, 2017 at 5:22 pm

      Hi Ricardo!
      Both as.POSIXct() and anytime() in R, expect numeric timestamp inputs to be in seconds.
      The timestamp precision in the Quotes history is in milliseconds (UTC timezone). Therefore, to perform a correct conversion, you will need to adjust the values by 1000 and execute either:
      as.POSIXct(1314737940000 / 1000, origin=”1970-01-01″, tz=”UTC”)
      or
      anytime(1314737940000 / 1000, tz=”UTC”)
      Hope this helps!

Leave a comment

Setting up a DARWIN Data Science Environment - Darwinex Blog
logo-footer

The Darwinex® brand and the http://www.darwinex.com domain are commercial names used by Tradeslide Trading Tech Limited, a company regulated by the Financial Conduct Authority (FCA) in the United Kingdom with FRN 586466, with company registration number 08061368 and registered office in Acre House, 11-15 William Road, London NW1 3ER, UK. and by Sapiens Markets EU Sociedad de Valores SA, a company regulated by the Comisión Nacional del Mercado de Valores (CNMV) in Spain under the number 311, with CIF A10537348 and registered office in Calle de Recoletos, 19, Bajo, 28001 Madrid, Spain.

CFDs are complex instruments and come with a high risk of losing money rapidly due to leverage. 63% of retail investor accounts lose money when trading CFDs with this provider. You should consider whether you understand how CFDs work and whether you can afford to take the high risk of losing your money.