Setting up a DARWIN Data Science Environment
This post describes how to setup a data science environment for DARWIN R&D.
Whether you’re a Data Scientist, Quant, Trader, Investor, Researcher, Developer or just someone keen on putting the DARWIN asset class under a scientific microscope, the contents of this post should hopefully give you a sound start.
The tools, libraries and datasets referenced herein are free to download, and employed by the Darwinex Labs team itself in its day to day efforts.
For your convenience, the rest of this post is structured as follows:
- Data Science Environment (Requirements & Setup)
- Required Data Science Libraries/Packages
- DARWIN Datasets (where & how to get them)
Data Science Environment (Requirements & Setup)
At the end of the day, all researchers have their own preferred R&D stack. For the purposes of this post however, we’ve chosen Python, R and C++ as the programming language base for our environment.
- Python -> easy to understand, powerful programming language with a large base of core libraries for machine learning, AI and statistical research.
- R -> free, robust alternative to commercial statistical research environments like MATLAB.
- C++ -> for enhancing performance, particularly in cases of mathematically intense calculations on large datasets.
Readers are of course most welcome to either extend this or craft a different stack should they so wish.
For each language in our data science environment, we need the following:
- Python: Anaconda® Distribution – a free package and environment manager for Python developers.
- R: Base R v3.3.2 or later, and RStudio Desktop – a fantastic IDE for code editing and visualization in R.
- C++: Rtools for compiling external code modules in C++, for subsequent use in R when necessary.
- Python: Download and install the Anaconda Distribution, selecting Python v2.7. It ships with the Spyder IDE for code editing and visualization in Python, as well as Jupyter Notebook for compiling and sharing your research with colleagues, academia, etc.
- R: First download and install R v3.3.2 or later via the link above. Once installed, download and install RStudio via its link above.
- C++: Lastly, download and install Rtools v3.3.x (e.g. v3.3.2 if you downloaded R v3.3.2), for compiling external C++ code for use within R scripts.
Required Data Science Libraries / Packages
We will initially require the following libraries:
Pandas: for data analysis, processing, restructuring and cleansing.
NumPy: for numerical and scientific computation using high performance data structures and vectorized mathematics.
SciPy: extends NumPy with functionality and additional algorithms for data manipulation and visualization.
Matplotlib: for 2D and 3D graphics.
Sci-Kit Learn: an extremely well-documented, robust and well-supported machine learning library.
Fortunately, all five ship with Anaconda and are installed by default when you install the Anaconda Distribution.
The following packages are essential to a lot of the research you’ll end up doing on DARWIN datasets:
R.utils, plotly, data.table, PerformanceAnalytics, TTR, xts, anytime, pracma, urca, forecast, tseries, stats, PortfolioAnalytics, RCurl, jsonlite, zoo, snow, sm, profr, proftools, MonteCarlo, microbenchmark, astsa, Rcpp, RcppArmadillo, RcppParallel, doParallel, inline, rbenchmark, knitr, plyr, corrplot, network, sna, ggplot2, GGally, xlsx.
Note: A convenient way to download and install all of these in your R data science environment, is to run the following code in an Rscript terminal or in the RStudio console:
if (!require(“pacman”)) install.packages(“pacman”)
# Define list of packages required for this project.
package.list <- c(“R.utils”, “plotly”, “data.table”, “PerformanceAnalytics”,
“TTR”, “xts”, “anytime”, “pracma”, “urca”, “forecast”, “tseries”, “stats”, “PortfolioAnalytics”,
“RCurl”, “jsonlite”, “zoo”, “snow”, “sm”, “profr”, “proftools”,
“MonteCarlo”, “microbenchmark”, “astsa”, “Rcpp”, “RcppArmadillo”, “RcppParallel”,
“doParallel”, “inline”, “rbenchmark”, “knitr”, “plyr”, “corrplot”,
“network”, “sna”, “ggplot2”, “GGally”, “xlsx”)
# Summon Pacman!
pacman::p_load(char=package.list, install=TRUE, update=FALSE)
DARWIN Datasets (where and how to get them)
Once the steps above are completed successfully, all we need is a DARWIN dataset to begin!
We periodically update this dataset on GitHub, so check back every week or so for updates. And yes, we are working on an API where accessing data on-demand will become a lot simpler (watch this space!).
You may download this data directly from GitHub in two ways:
2) Execute the following code in an Rscript terminal or RStudio Console:
DWC.M1.GitHub <- fread(“https://github.com/darwinex/DarwinexLabs/blob/master/datasets/community-darwins/DWC.M1.QUOTES.29.11.2017.csv?raw=true”, colClasses=”character”)
DWC.D1.GitHub <- fread(“https://github.com/darwinex/DarwinexLabs/blob/master/datasets/community-darwins/DWC.D1.QUOTES.29.11.2017.csv?raw=true”, colClasses=”character”)
Column #1 contains the Timestamp in POSIXct format, and Column #2 contains the Quote in the deepest available decimal precision.