Feature selection in trading algorithms

Lately I have been looking for a more systematic way to get around overfitting and in my quest I found it useful to borrow some techniques from the Machine Learning field.

If you think about it, a trading algorithm is just a form of AI applied to prices series. This statement, although possibly obvious, puts us in the position to apply a number of Machine Learning techniques to our trading strategies design.

Expanding what discussed here (and here), it seems intuitive that the more features in a model, the more generally speaking the model might be subject to overfitting. This problem is known as the bias-variance trade-off and is usually summarised by the graph on the right.

bias-variance tradeoff

As complexity increases, performance in the training set increases while prediction power degrades

What’s possibly less intuitive is that the specific features used in relation with the dynamics to predict play a key role in determining whether we are overfitting past data, so that the error behaviour showed in the graph is just a generalization.

Something particularly interesting is that the use of the very same feature (e.g. in our application an indicator, a take profit or stop loss mechanism, etc) might or might not cause overfitting according to the dynamics we are trying to fit.

The reason behind this is that some phenomena (or some times even variants of the same phenomenon) simply can’t be described by some features.

As an example, imagine you are trying to forecast the future sales of a sportwear store in Australia. A “good” feature to use could be the season of the year, as (say) Aussies are particularly keen in water sports and so springs and summers tend to show the best sales for the year.
Now imagine trying to forecast the future sales of a similar sportwear store located somewhere in the US. It might be the case that US citizens don’t have a preference for any particular season, as in the summer they practice water sports and in the winter they go skiing. In this new scenario, a model using the season of the year as a feature is more likely to result in an overfitted model because of the different underlying dynamics.

Back to financial markets, an example of this could be how a stop loss mechanism tends to be (generally speaking and according to my experience) a good feature for trend-following strategies, but not for mean-reversion strategies (and viceversa for target profit orders). A possible explanation of this could be that trends are well described by the absence of big adverse movements, but their full extension can’t be known beforehand (but this is just me trying to rationalize my empirical findings).

So, how do you understand which features are good candidates?
Luckily for us, there are a whole bunch of techniques developed in the Machine Learning field to operate feature selection. I recommend the following 2003 paper for an overview of the methods: An Introduction to variable and feature selection by Isabelle Guyon. Any text of Machine Learning should also cover some of the techniques, as it does the exceptional Stanford’s Machine Learning class in Coursera 
Any other readers’ recommendation (or comment) is of course very welcome.

Andrea

Advertisements

About mathtrading

My name is Andrea La Rosa and I am a quant trader based in the UK. In the past I worked as a quant in the prop desk of an investment bank, before deciding to fully dedicate myself to quantitative trading.
This entry was posted in On backtesting, Trading Strategies Design and tagged , , , , . Bookmark the permalink.

14 Responses to Feature selection in trading algorithms

  1. Interesting article Andrea! I’m glad you mentioned the bias-variance trade-off. I think it should be part of every quant trader curriculum 🙂

    There are some great books that cover this topic in depth. In particular the relatively new ‘Introduction to Statistical Learning’ (Hastie et al) and ‘Elements of Statistical Learning’ (again, Hastie et al). The latter is actually free to download as a PDF from the book website, although requires a reasonable degree of mathematical sophistication.

  2. Doug says:

    Overall interesting post, but there is a conceptual error in your example of overfitting. Overfitting occurs when you sample from a distribution and fit the noise from the sampling rather than the underlying distribution itself. Thus overfitting affects out sample predictions even when applied to a new sample from the very same distribution.

    The difference between Aussie and US sports sales are not due to the noise in the sample, but rather the fact that US sports sales represent a related, but fundamentally different, distribution. The error in your prediction isn’t due to noise in how you sampled Aussie sales, but rather the fact that you tried predict one distribution by using data gathered from a different one.

    The difference is subtle but important. Namely overfitting will always go away once your sample size becomes large enough. At the asymptote you can never overfit if you train on an infinite number of data points. But in your example it doesn’t matter how many Aussie sports stores I survey, I’ll always make the same error when trying to use it to predict American sports sales.

    Not to say that this issue isn’t also relevant for finance. One issue is that the distribution of financial asset returns changes from day to day. Notably in the case of regime change. So that makes two separate arguments against using too many features: 1) it’s prone to overfitting and 2) it’s less robust across regime changes. But it’s important to recognize that these are two different phenomenon.

    • mathtrading says:

      Hi Doug, thanks for your comment.
      I think you misunderstood what I wrote, or maybe I haven’t been very clear. I wasn’t suggesting that you fit your model to the Australian store and then use it to predict the US sales. I was instead saying that if you take a model having the month of the year as a feature, fitting and deploying it with the Australian data could work fine – but if you try to fit and deploy it with the US data it would be more likely to result in an overfitted model (especially if you don’t allow feature elimination), exactly because dynamics are such that the month of the year is not relevant, and so all you would be fitting are random occurrences, aka noise (provided you don’t have an infinite sample, in which case I agree it should be easier to uncover noise 🙂 ).

      I also don’t agree entirely with your definition of overfitting, as the way I see it goes behind the distributions of events, as distributions are not enough to fully describe what’s happening, e.g. for one they don’t take into account autocorrelations (https://mathtrading.wordpress.com/2013/01/03/defining-a-market-vs-characterizing-a-market/).

      Finally, you could argue that regime changes and overfitting are two sides of the very same coin, according to what your definition of regime is. But I agree that in some cases it’s useful to consider each of them individually.

      Andrea

  3. Doug says:

    On the issue of stops affecting mean reversion, I have a pet theory. In general mean-reversion strategies earn alpha because you’re providing liquidity in the opposite direction of the natural balance of buyers and sellers. When there are more natural buyers (pushing up the price) you sell, and vice versa. There’s an analogy to re-insurance here: there are more natural buyers of hurricane exposure, so if you short hurricane exposure (i.e. write insurance that pays off against hurricanes) you tend to earn premium over time.

    In re-insurance it’s well known that the most profitable times are those immediately following a bad season. A lot of hurricanes in a season means that many re-insurers breach their capital limits and have to exit the business. For those that survive the reduced competition means especially juicy premiums. A stop loss applied to the re-insurance business would be very bad indeed.

    Similarly with regards to mean-reversion, a bad drawdown leads to a heavy exit of those trading the strategy. As these parties unwind their portfolio this tends to exacerbate the losses of the remaining traders. However those that weather the pain are soon rewarded, the same natural liquidity imbalances still continue to exist. But reduced competition lead to higher strategy returns.

    Trend following strategies aren’t subject to the same dynamics. You’re trading in the natural direction of the buyers and sellers. Competitor unwinds don’t lead to high drawdowns, since the strategy is buffered by the natural order flow. In this case stop losses are more likely to shutdown broken strategies, rather than stop trading during profitable regimes.

    • mathtrading says:

      I tend to agree with your theory and thanks for the interesting parallel with the re-insurance business. I like to think along the lines that if you are trading mean-reversion, then the more price moves away the more you should think its valuation is wrong and it has to revert, so that the use of a stop-loss doesn’t make sense. But I can’t avoid wondering if I wouldn’t be able to find a rational explanation if things worked the other way around!

  4. gregor says:

    The best way to do this is partial least squares.

    • mathtrading says:

      Gregor, would you care to elaborate and motivate your statement? Are you referring to feature selection in general, to trading or to the particular sales example?
      My impression is that what’s the “best” method is highly dependant on what one is trying to do, and even then it’s not so trivial.

      • gregor says:

        I mean feature selection when the number of features is large. This is a general result for training classifiers (and regression models). Principal components ignore values of the predicted variable, and just look at the features. PLS looks at both.

  5. mathtrading says:

    Thanks for the clarification and the pointer Gregor, I am not too familiar with PLS, I will have a look at it.
    Andrea

  6. Helen K says:

    Great post! When will you be adding more content?

  7. pat says:

    Hi Andrea,

    You have an interesting blog. I just started reading and had a small comment.

    Another possible financial trading analogy of over-fitting in some sense, would be to have a model that fits well to some financial instruments but the same parameter selections or set of features do worse in others.
    Some people feel that a more robust trading system should behave well in many markets. From an ML perspective, I think of regularization or sacrificing some of my fitness potential in order to work more robustly across different markets/scenarios.

    I’ve found feature selection to be a very interesting topic. You might have some interest to look into LASSO modeling.

    Anyways,
    Hope to read more.
    Pat

    • mathtrading says:

      Hi Pat,

      Thanks for your compliments.

      You raise a very good point about overfitting and applicability to different financial instruments. I agree with you to a certain extent: I believe that different markets share some common characteristics, so that when I try to exploit those I actually use the same features and even the very same parametrizations across different instruments. I like to think about it more as filtering noise that would lead to overfitting rather than giving up fitness potential – as the otherwise achievable increase in fitness is anyway less likely to be meaningful.
      However, at the same time I think each market inevitably has some very unique characteristics, so that it makes sense for some inefficiencies to be markets-specific.

      And thanks for the tip on LASSO – I guess how one applies it is what makes the difference, as for everything else. I have actually used it in some situations as a part of the optimization process, it brings the added difficulty of formulating the trading model in such a way that one can directly apply it, which I found it to be not always easily done or even worth my time for what I’m currently doing. I think it could shine in an environment where one is trying to select among a large number of features (e.g. GP and others), although then you’ll face the limitation of the linearity of the approach (i.e. I don’t know how well a LASSO approach would keep track of the benefits coming from the coupling of different features). But to be honest I don’t have a massive experience with it to support these statements, so these are just suppositions.

      Hope to hear more from you!
      Andrea

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s