The two most important aspects of machine learning are selecting features and cleaning your data.
While it may sound impressive to say that you use a logistic regression, a random forest or a Naïve Bayes algorithm to trade GOOG or the USD/JPY, the actually algorithm isn’t the most important element of the analysis. The old mantra “garbage in, garbage out” is very relevant in machine learning. You can have the most powerful algorithm in the world, but if your data is garbage, you’ll only discover garbage.
You should spend most of your time on, at least at first, selecting your features and cleaning your data. These are two of the problems we have addressed with TRAIDE. I will get into how we addressed these problems towards the end of the article.
In the financial world, our feature selection is going to be from a list of fundamental, technical, and sentimental indicators. Most commonly used indicators are going to include:
It may not seem like it, but you select features all the time when you trade. The difference in machine learning is that an algorithm is going to analyze the features and their relationship to an asset rather than you. So instead of you spending months to years studying the market, let an algorithm look through thousands of data points to find the patterns for you. For example, you may look at the EPS this quarter and see if it was higher than the expected EPS or you may select a 50-period moving average on a chart and go short every time the price action retraces to the line. In order to quantify these values, you would calculate the difference between the EPS and the expected EPS and take the difference between the 50-period moving average and the price of an asset. The same way you look at this data and learn how certain assets respond as these features change, so does a machine-learning algorithm. Then, when you feed the algorithm new data with the same features, it can make an informed decision on what the effect is going to be on the price of an asset, just as you might.
When you select features, you should have a good understanding of why you selected or created those features. For example, if I believe the USD/JPY price doesn’t stray very far from the 100-period MA, I might calculate how far away, in standard deviations the current price is from the 100-period MA at each open price (on whatever timeframe bars I am trading), and create a binary variable for every instance that the price is more than 2 standard deviations from the 100-period MA; 1 for 2+ standard deviations and 0 for less than 2. I can create a few features that I believe will unveil information in my data and have a machine-learning algorithm analyze those features together over a specific date range. To follow the standard approach of a trader, I would then test the performance of the algorithm over new, unseen data. If the algorithm performs poorly over this new data, then the feature may not contain any valuable information. If the algorithm performs well, I may be onto something.
Creating features starts to become fun. What are some examples of features you might create using EPS or the 50-period moving average? I would play around with (Actual EPS – Expected EPS) or (Current Quarter EPS – Previous Quarter EPS) or ((The Estimize Community’s EPS Prediction – Wall Street’s EPS Prediction)/(Actual EPS))… For the 50-period moving average example, I might create a feature that is the (Current Price – 50-period MA) or (100-period MA – 50-period MA) or ((|100-period MA – 50-period MA|/(Current Price)).
In TRAIDE, you will have a comprehensive list of technical, fundamental and sentimental indicators. You won’t know it, but you will be selecting features. TRAIDE takes care of the calculations while you compare moving averages or actual EPS to expectations.
Just like if you studied incorrect information for a test, you would not perform well on your test the next day. The same is true for an algorithm. If you feed the algorithm data that is not adjusted for daylight savings, the algorithm is going to make poor predictions on new data that is adjusted for daylight savings. If you pull your data from Yahoo Finance, you will find that some data is missing and some is just inaccurate. You need to clean this data by checking it for holes and removing inaccuracies. A reliable, clean data source is difficult to come by for cheap. TRAIDE aggregates and cleans data from disparate data sources so you don’t have to. It would be expensive for any individual to pay for a clean, reliable data source, but we are able to provide all the data you need free with your subscription.
As machine learning becomes more and more popular, it is important that it is utilized in the correct way. The most important aspects of machine learning, especially when just beginning, are selecting valuable features and cleaning your data. After hours of gathering data and cleaning Excel sheets row-by-row, I know how tedious and painful it can be. This is why we have taken the liberty of making this process as easy and seamless as possible in TRAIDE.