## LVQ and Machine Learning for Algorithmic Traders – Part 1

Algorithmic traders across all spectra of asset classes, often face a rather daunting challenge.

### What are the best inputs for an algorithmic trading strategy’s parameter space?

Different algorithmic trading strategies (whether manual or automated) will each have their own unique set of parameters that govern their behaviour.

Granted.. Genetic and Walk-Forward Optimization will help algorithmic traders establish what input values (or ranges thereof) in chosen parameter spaces, yield favourable results historically.

They will also help traders identify optimal time periods over which to re-optimize “the currently optimized parameter space”…. yes, that could indeed, get pretty messy.

While this approach may or may not yield robust parameter inputs, several questions still remain in algorithmic traders’ minds:

1) Should absolutely all parameters be optimized, or just some? If so, which ones?

2) What is the relevance and unique importance of each parameter in the trading strategy?

### Why is this important for Algorithmic Traders?

Selecting the right parameters in your trading algorithm can be the difference between:

• Average performance with a large number of parameters -> painfully long optimization times,
or,
• Fantastic performance with a smaller number of parameters -> much shorter optimization times.

### What is the solution?

Selecting the most appropriate parameters is a practice known as Feature Selection in the Machine Learning world, a vast and complex area of research and development.

Needless to say it cannot be encapsulated in one single blog post, which therefore implies that there will be more blog posts on this subject in the very near future 🙂

R (Statistical Computing Environment)

For now, we will focus on estimating “the most important” parameters in a trading strategy, using a bit of machine learning in R.

Specifically, we will make use of the caret (short for Classification and Regression Training) package in R, as it contains excellent modeling functions to assist us with this Feature Selection problem.

Lastly, we will use a small constructed sample of 1,000 id|feature|target records as the dataset, to demonstrate Linear Vector Quantization (the solution).

### Step 1 – Load the “caret” machine learning library in R

> library(caret)

### Step 2 – Prepare the data

Construct a dataset containing 1,000 training data points in CSV form.

Making sure you’re in the directory where the training data resides, type the following commands in your R console:

> train.blogpost <- read.csv("data.csv", head=T, nrows=1000)

We need only the “feature” and “target” column values in the dataset. Type the following command in your R console to achieve this:

train.blogpost <- train.blogpost[,grep("feature|target",names(train.blogpost))]

### Step 3 – Construct an LVQ Model on the data.

> model.control <- trainControl(method="repeatedcv", number=10, repeats=3)> model <- train(as.factor(target)~., data=train.blogpost, method="lvq", preProcess="scale", trControl=model.control)

### Step 4 – Retrieve the “importance” of each “feature” from the computed model.

> importance <- varImp(model, scale=FALSE)> print(importance) loess r-squared variable importanceonly 20 most important variables shown (out of 21)Overall feature2  0.011949 feature18 0.010770 feature7  0.010556 feature16 0.010522 feature5  0.010400 feature11 0.009825 feature1  0.009673 feature14 0.009672 feature3  0.009663 feature13 0.008916 feature21 0.008846 feature15 0.008737 feature10 0.008616 feature17 0.008180 feature19 0.007864 feature12 0.005575 feature9  0.005268 feature8  0.005124 feature20 0.005089 feature4  0.005052 >

### Step 5 – Visualize the importance of each feature.

plot(importance)

LVQ Importance Visualization – Machine Learning in RThe plot of “feature importance” above clearly shows that features 12, 9, 8, 20, 4 and 6 have little impact on the outcome (the “target”), compared to the rest of the features.

To put it into context – in a trading strategy, these features may well have been parameters called:

Stop Loss 1, Stop Loss 2, Take Profit 1, Take Profit 2, RSI Top, RSI Bottom.. and so on.

### Conclusion

By conducting LVQ analysis on optimization results, algorithmic traders can save themselves not only time, but lost accuracy.

Machine learning techniques of this nature, greatly reduce the time a trader needs to spend on any optimization problem.

By ascertaining the relevant importance of parameters in this manner, traders can not only simplify their algorithms, but also make them more robust than previously possible with a larger number of parameters.

* please activate CC mode to view subtitles.

## Quantitative Modeling for Algorithmic Traders – Primer

Quantitative Modeling techniques enable traders to mathematically identify, what makes data “tick” – no pun intended 🙂 .

They rely heavily on the following core attributes of any sample data under study:

1. Expectation – The mean or average value of the sample
2. Variance – The observed spread of the sample
3. Standard Deviation – The observed deviation from the sample’s mean
4. Covariance – The linear association of two data samples
5. Correlation – Solves the dimensionality problem in Covariance

## Why a dedicated primer on Quantitative Modeling?

Understanding how to use the five core attributes listed above in practice, will enable you to:

1. Construct diversified DARWIN portfolios using Darwinex’ proprietary Analytical Toolkit.
2. Conduct mean-variance analysis for validating your DARWIN portfolio’s composition.
3. Build a solid foundation for implementing more sophisticated quantitative modeling techniques.
4. Potentially improve the robustness of trading strategies deployed across multiple assets.

Hence, a post dedicated to defining these core attributes, with practical examples in R (statistical computing language) should hopefully serve as good reference material to accompany existing and future posts.

### Why R?

1. It facilitates the analysis of large price datasets in short periods of time.
2. Calculations that would otherwise require multiple lines of code in other languages, can be done much faster as R has a mature base of libraries for many quantitative finance applications.

* Sample data (EUR/USD and GBP/USD End-of-Day Adjusted Close Price) used in this post was obtained from Yahoo, where it is freely available to the public.

### Before progressing any further, we need to download EUR/USD and GBP/USD sample data from Yahoo Finance (time period: January 01 to March 31, 2017)

In R, this can be achieved with the following code:

library(quantmod)

getSymbols("EUR=X",src="yahoo",from="2017-01-01", to="2017-03-31")

getSymbols("GBP=X",src="yahoo",from="2017-01-01", to="2017-03-31")

Note: “EUR=X” and “GBP=X” provided by Yahoo are in terms of US Dollars, i.e. the data represents USD/EUR and USD/GBP respectively. Hence, we will need to convert base currencies first.

To achieve this, we will first extract the Adjusted Close Price from each dataset, convert base currency and merge both into a new data frame for use later:

eurAdj = unclass(EUR=X$EUR=X.Adjusted) # Convert to EUR/USD eurAdj = 1/eurAdj  gbpAdj <- unclass(GBP=X$GBP=X.Adjusted)

# Convert to GBP/USD
gbpAdj <- 1/gbpAdj

# Extract EUR dates for plotting later.
eurDates = index(EUR=X)

# Create merged data frame.
eurgbp_merged <- data.frame(eurAdj,gbpAdj)

EUR/USD and GBP/USD (Jan 01 – Mar 31, 2017)

Finally, we merge the prices and dates to form one single dataframe, for use in the remainder of this post:

eurgbp_merged = data.frame(eurDates, eurgbp_merged)

colnames(eurgbp_merged) = c("Dates", "EURUSD", "GBPUSD")

### The mean μ of a price series is its average value.

It is calculated by adding all elements of the series, then dividing this sum by the total number of elements in the series.

Mathematically, the mean μ of a price series P, where elements p ∈ P, with n number of elements in P, is expressed as:

$$μ = E(p) = \frac{1}{n} ∑ (p_1 + p_2 + p_3 + … + p_n)$$

In R, the mean of a sample can be calculated using the mean() function.

For example, to calculate the mean price observed in our sample of EUR/USD data, ranging from January 01 to March 31, 2017, we execute the following code to arrive at mean 1.065407:

mean(eurgbp_merged$EURUSD) [1] 1.065407 Using the plotly library in R, here’s the mean overlayed graphically on this EUR/USD sample: library(plotly) plot_ly(name="EUR/USD Price", x = eurgbp_merged$Dates, y = as.numeric(eurgbp_merged$EURUSD), type="scatter", mode="lines") %>% add_trace(name="EUR/USD Mean", y=(as.numeric(mean(eurgbp_merged$EURUSD))), mode="lines")

EUR/USD Mean R Plotly Chart (Jan 01 – Mar 31, 2017)

### The varianceσ² of a price series is simply the mean or expectation, of the square of (how much price deviates from the mean).

It characterises the range of movement around the mean, or “spread” of the price series.

Mathematically, the variance σ² of a price series P, with elements p ∈ P, and mean μ, is expressed as:

$$σ²(p) = E[(p – μ)²]$$

Standard Deviation is simply the square root of variance, expressed as σ:

$$σ = \sqrt{σ²(p)} = \sqrt{E[(p – μ)²]}$$

In R, the standard deviation of a sample can be calculated using the sd() function.

For example, to calculate the standard deviation observed in our sample of EUR/USD data, ranging from January 01 to March 31, 2017, we execute the following code to arrive at s.d. 0.00996836:

sd(eurgbp_merged$EURUSD) [1] 0.00996836 Using the plotly library in R again, we can overlay a single (or more) positive and negative standard deviation from the mean, as follows: plot_ly(name="EUR/USD Price", x = eurgbp_merged$Dates, y = as.numeric(eurgbp_merged$EURUSD), type="scatter", mode="lines") %>% add_trace(name="+1 S.D.", y=(as.numeric(mean(eurgbp_merged$EURUSD))+sd(eurgbp_merged$EURUSD)), mode="lines", line=list(dash="dot")) %>% add_trace(name="-1 S.D.", y=(as.numeric(mean(eurgbp_merged$EURUSD))-sd(eurgbp_merged$EURUSD)), mode="lines", line=list(dash="dot")) %>% add_trace(name="EUR/USD Mean", y=(as.numeric(mean(eurgbp_merged$EURUSD))), mode="lines")

EUR/USD Mean +/- 1 Standard Deviation R Plotly Chart (Jan 01 – Mar 31, 2017)

### The sample covariance of two price series, in this case EUR/USD and GBP/USD, each with its respective sample mean, describes their linear association, i.e. how they move together in time.

Let’s denote EUR/USD by variable ‘e’ and GBP/USD by variable ‘g‘.

These price series will then have respective sample means of $$\overline{e}$$ and $$\overline{g}$$ respectively.

Mathematically, their sample covariance, Cov(e, g), where both have n number of data points $$(e_i, g_i)$$, can be expressed as:

$$Cov(e,g) = \frac{1}{n-1}\sum_{i=1}^{n}(e_i – \overline{e})(g_i – \overline{g})$$

In R, sample covariance can be calculated easily using the cov() function.

Before we calculate covariance, let’s first use the plotly library to draw a scatter plot of EUR/USD and GBP/USD.

To visualize linear association, we will also perform a linear regression on the two price series, followed by drawing this as a line of best fit on the scatter plot.

This can be achieved in R using the following code:

# Perform linear regression on EUR/USD and GBP/USD
fit <- lm(EURUSD ~ GBPUSD, data=eurgbp_merged)

# Draw scatter plot with line of best fit
plot_ly(name="Scatter Plot", data=eurgbp_merged, y=~EURUSD, x=~GBPUSD, type="scatter", mode="markers") %>%

add_trace(name="Linear Regression", data=eurgbp_merged, x=~GBPUSD, y=fitted(fit), mode="lines")

EUR/USD and GBP/USD Scatter Plot with Linear Regression

Based on this plot, EUR/USD and GBP/USD have a positive linear association.

To calculate the sample covariance of EUR/USD and GBP/USD between January 01 and March 31, 2017, we execute the following code to arrive at covariance 7.629787e-05:

cov(eurgbp_merged$EURUSD, eurgbp_merged$GBPUSD)

[1] 7.629787e-05

Problem: Being dimensional in nature, calculating just Covariance makes it difficult to compare price series with significantly different variances.

Solution: Calculate Correlation, which is Covariance normalized by the standard deviations of each price series, hence making it dimensionless and a more interpretable ratio of linear association between two price series.

Mathematically, Correlation ρ(e,g) of EUR/USD and GBP/USD, where $$σ_e$$ and $$σ_g$$ are their respective standard deviations, can be expressed as:

$$ρ(e,g) = \frac{Cov(e,g)}{σ_e σ_g} = \frac{\frac{1}{n-1}\sum_{i=1}^{n}(e_i – \overline{e})(g_i – \overline{g})}{σ_e σ_g}$$

• Correlation = +1 indicates EXACT positive association.
• Correlation = -1 indicates EXACT negative association.
• Correlation = 0 indicates NO linear association.

In R, correlation can be calculated easily using the cor() function.

For example, to calculate the correlation between EUR/USD and GBP/USD, from January 01 to March 31, 2017, we execute the following code to arrive at 0.5169411:

cor(eurgbp_merged$EURUSD, eurgbp_merged$GBPUSD)

[1] 0.5169411

0.5169411 implies reasonable positive correlation between EUR/USD and GBP/USD, which is what we visualized earlier with our scatter plot and line of best fit.

In future blog posts, we will examine how to construct diversified DARWIN Portfolios using the information above in practice.