Machine Learning on DARWIN Datasets (MLD-I)
Machine learning in essence, is the research and application of algorithms that help us better understand data.
By leveraging statistical learning techniques from the realm of machine learning, practitioners are able to draw meaningful inferences from and turn data into actionable intelligence.
Furthermore, the availability of several open source machine learning tools, platforms and libraries today enables absolutely anyone to break into this field, utilizing a plethora of powerful algorithms to discover exploitable patterns in data and predict future outcomes.
This development in particular has given rise to a new wave of DIY retail traders, creating sophisticated trading strategies that compete (and in some cases, outperform others) in a space previously dominated by just institutional participants.
In this introductory blog post, we will discuss supportive reasoning for, and different categories of machine learning. In doing so, we will lay the foundation for using machine learning techniques to create DARWIN trading strategies in future blog posts in this series.
For your convenience, this post is structured as follows:
1) The Case for Machine Learning
2) Three Main Categories of Machine Learning
3) Setting up Python/R & C++ for Machine Learning on DARWIN Datasets
1) The Case for Machine Learning
We live in an age where both structured and unstructured data are available in abundance. Not only that, people now also have the tools and resources to gather this data for themselves if they so wish (at little to no cost), a reality that did not exist before.
Over time, machine learning has evolved into a robust means for capturing knowledge from, analyzing and creating predictive models for large volumes of data in a scalable, efficient manner when compared to manual human-driven practices. In doing so, it has also enabled practitioners to iteratively improve upon existing models and incorporate data driven decision-making in their pursuits.
Apart from its widespread use in finance, machine learning has also given rise to things over time that many now take for granted.
- Email SPAM filters,
- Video recommendation engines,
- Personalized advertising,
- Internet search engines,
- Industrial robotics (e.g. in the automobile industry),
- ..and even self-driving cars!
The DARWIN dataset (a multivariate time series) can therefore benefit from machine learning led research, and that’s exactly what this series of blog posts aims to lay the groundwork for.
In fact, there exists an ever-growing number of DARWIN assets on our Exchange that are powered entirely by machine learning driven trading strategies, three categories of which we discuss next.
Three Main Categories of Machine Learning
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
An argument can indeed be made for a fourth category – Deep Reinforcement Learning – that involves a combination of supervised and reinforcement learning practices (more on this in future posts).
We will now discuss the key differences between these three types, and with the help of examples, develop an understanding of their practical applications.
1) Supervised Learning
In supervised learning, our aim is to “learn” a predictive model from “labeled” training data. The learned model is assessed for its ability to generalize well to unseen data, after which it can be used to predict outcomes based on future unseen data.
There are two main sub-categories of supervised learning:
The term regression was coined by an English Statistician, Sir Francis Galton in 1886, in an article he wrote called Regression Towards Mediocrity in Hereditary Stature, where he described his research findings on how children’s heights did not depend on their parents’, but in fact regressed towards the population mean.
In regression tasks, we first use an existing set of:
- Continuous predictor variables (e.g. historical scores of a DARWIN’s investment attributes and underlying strategy data) and,
- A continuous response or target variable (i.e. the corresponding DARWIN Quote)
For example, one possible training set using DARWIN data, could have the following structure:
Timestamp | Ex | Mc | Rs | Ra | Os | Cs | R+ | R- | Dc | La | Pf | Cp | uVar | oOrd | dLev | Quote
..where uVar = Underlying Strategy VaR (%), oOrd = Open Orders and dLev = D-Leverage.
In this example, the Quote represents our response or target variable, and the rest our predictor variables. However, there is nothing stopping us from considering any other variable as our target variable.
For example, a study could switch from attempting to predict a DARWIN’s next Quote (for trade entry purposes) to say predicting the next La (for forecasting loss aversion). Several possibilities exist depending on the problem one is trying to solve.
In all cases, supervised machine learning attempts to find relationships between the predictor variables that “explain” the data, and the target variable (the output).
The following image illustrates one of the most basic forms of regression tasks, a linear regression.
In this example, a straight line is “fit” robustly to training data containing predictor values (x) and a response value (y), such that the distance between the data points and the line is minimized. The resultant gradient and intercept of the line can then be used to predict the outputs (y) of future unseen data (x).
Future blog posts in this series will cover the details, mathematical notation and how to perform regression tasks on DARWIN datasets in Python and R, with sample code.
In this sub-category of supervised machine learning, our task is to predict what discrete group (or “class”) unseen data belongs to.
As in regression analysis, the predictive model is once again “learned” from a training set where predictor variables and their corresponding target variable have already been provided to us. Only in this instance, the target variable is not a continuous numeric value, but a fixed set of class labels or groups.
Using the example given above in the discussion on regression analysis applied to DARWIN data, a classification approach could modify the problem from predicting a continuous output (DARWIN Quote), to a binary output (UP or DOWN).
The predictive model in this case would then be used to predict the DARWIN’s next movement (UP or DOWN) as opposed to a numeric value for its next forecast Quote (or any other target variable depending on the problem being attempted).
However, binary classification is not a must. A predictive model will classify unseen data based on class labels (groups) observed in the training set, thereby also permitting multi-class classification.
If the training set of DARWIN data contained rows of attribute scores for predictors (as in the regression example above) and class labels UP, DOWN, SIDEWAYS, BREAKOUT, STAY-OUT for targets, a robust predictive model could then “classify” future unseen data as one of these classes, possible use cases including forecasting direction, volatility, risk management, etc.
Future posts in this series will cover the details, mathematical notation and how to perform classification tasks on DARWIN datasets in Python and R, with sample code.
2) Unsupervised Learning
Unlike supervised machine learning where a training set contains predictors and a target variable’s true outcomes, in unsupervised learning the data structure is unknown.
Unsupervised learning techniques can be used to study this unknown structure, in an attempt to explore and extract valuable intelligence for a variety of predictive modeling purposes.
There are two main sub-categories of unsupervised learning:
- Dimensionality Reduction
Clustering is an unsupervised learning technique that enables practitioners to take data with unknown structure and assemble it into meaningful classes or clusters.
Unlike supervised classification problems where training data will enable the “learning” of underlying relationships from already available ground truths, clustering algorithms will assemble data of unknown structure into classes without any previous knowledge of underlying relationships.
Each class or cluster arrived upon essentially includes a set of observations that are quite similar to each other, but dissimilar to observations found in other clusters. This makes clustering a great approach to extracting meaningful intelligence from input data.
Some of the many motivations for utilizing unsupervised learning in finance include data cleansing, portfolio selection, de-noising and detecting regime change.
The following image illustrates how clustering algorithms can be deployed on data with unknown structure, and yield finite numbers of clusters based on the similarity of predictor data:
Future posts in this series will explore and implement possible use cases of unsupervised clustering to DARWIN data, such as dynamic portfolio selection, custom filter creation, determination of seasonality in DARWIN risk profiles, to name a few ideas.
Working code in Python/R/C++ will also be provided alongside any implementations arrived upon.
2.2) Dimensionality Reduction
Data of large dimensions can present challenges in terms of storage, computational efficiency (especially in real-time – an important consideration for trading algorithms) and performance.
Combining 12 investment attributes for each DARWIN, across over 2,500 DARWINs (as of 07 December, 2017 12:30 GMT), with the multitude of underlying strategy parameters available, as well as any additional feature engineering can quickly give rise to situations where a dimensionality reduction exercise may be warranted.
Dimensionality reduction is useful for:
- Reducing data from large to smaller dimensions, such that most of the important information in it is retained.
- Visualization exercises where data of large dimensional space can be projected onto 1D to 3D space for subsequent rendering in standard statistical charts for analysis.
The following image illustrates how dimensionality reduction can project a multi-dimensional (>3) dataset to a 2D surface while retaining most of its important information:
Future posts in this series will outline the rationale and implementation of any dimensionality reduction exercises carried out, accompanied by Python/R/C++ source code where appropriate.
3) Reinforcement Learning
This sub-category is related to supervised learning, and involves the development of agents (e.g. systems) that optimize their own performance via interactions with their environment.
Agents respond to the state of their current environment, which also contains a reward signal. With repeated interactions using a trial-and-error driven approach, the agent learns what series or assortment of actions leads to maximal reward.
Possibly one of the most amazing developments in the field of reinforcement learning is DeepMind’s AlphaGo Zero – in a nutshell, a reinforcement learning algorithm that mastered the game of Go by playing against itself repeatedly!
Reinforcement learning has several applications in trading, including its use in trade entry/exit timing, portfolio rebalancing and determining optimal holding periods, to name a few.
Future posts in this series will assess the suitability of reinforcement learning to DARWIN datasets, present any studies carried out and provide Python/R/C++ source code for the same.
Setting up Python/R & C++ for Machine Learning on DARWIN Datasets
In order to follow along with our future publications that include implementations and source code, you’ll need to have a functional DARWIN data science environment setup to support Python, R & C++.
For a detailed set of requirements and configuration instructions, please see our recent blog post Setting up a DARWIN Data Science Environment.
As always, if you have any questions, please feel free to leave them in the comments section at the bottom of this post and we’ll respond as soon as we can!