Data mining is a powerful tool that is becoming more popular and accessible within the financial markets.
Data mining is a subset of computer science. It joins branches of computer science, machine learning, a subcategory of artificial intelligence, and databases systems, with statistics. It is the process of discovering information in large data sets. The goal of data mining is to transform a data set into understandable and usable information. There are 6 subclasses within data mining; anomaly detection, association rule learning, clustering, summarization, classification and regression.
Within the world of trading, there are many ways data mining techniques are utilized to discover actionable information. Each data mining technique has inherent limitations and underlying assumptions, making different techniques more suitable for certain applications. Some of the more common applications of data mining in the trading world are detecting insider trading and fraud, portfolio management, and creating trading strategies.
The SEC requires that every director, officer or owner of more than 10% of a company’s stock, who purchases or sells shares in their company, file a Form 4. The Form 4 is then stored and made accessible in the EDGAR database. The historical Form 4 filings offer an enormous data set that is ripe for data mining; millions of transactions, tens of thousands of companies, and hundreds of thousands of owners.
Most trades by owners are not illegal. For example, an owner might adjust his/her portfolio to adapt to the current economic conditions or for liquidity purposes. An insider trade is only considered illegal when the trading involves information that has not been made public. So the goal of data mining, in this case, is to distinguish the everyday legal trades from the irregular trades that imply the owner had nonpublic information before buying or selling their shares.
An example of a data mining application set up to do just this is the Securities Observation, News, Analysis and Regulation (SONAR) system. SONAR tags irregular trades for further investigation. It aggregates, processes, and analyzes tens of thousands of news stories and SEC filings every day, turning up hundreds of alerts for analysts to investigate further. SONAR utilizes natural language processing (NLP) text mining, statistical regressions, rules-based inference, uncertainty and fuzzy matching to look for outliers in common patterns.
Other interesting research includes Kirkos et al. (2007), who used classification techniques, decision trees, bayesian networks and neural networks to classify firms that issued fraudulent financial statements.
Donoho (2004) found promising results in utilizing data mining techniques (decision trees, logistic regressions and neural networks) in detecting insider options trading.
Other interesting finds include Cheng and Lo (2006), who found that owners who intend to buy shares also tend to release abnormally negative news just before their own purchases. Brockmen et al. (2010) found that owners will release abnormally positive news before stock option exercises.
An interesting piece of research done by Lakonishok and Lee (2001) found that the market tends to underreact to the signals of owners buying and selling their own company’s shares.
Data mining allows us to uncover consistent patterns in owners and how they trade their own company’s shares. Accessing and analyzing this information for yourself could give you an information advantage in your own trading.
How do you decide which securities to hold and how much to allocate to each individual asset and to each asset class as a whole? While your goal might be to create a portfolio that is going to minimize the risk for a targeted return, it’s often a very difficult task, especially as your assets grow in size and number.
Capital Asset Pricing Model (CAPM) and Arbitrage Pricing Theory (APT) are common tools in risk management and portfolio optimization. Neural networks have been integrated with APT so that APT is used to determine prices while the neural network classifies each risk factor in the future.
Different data-mining techniques can be applied to various tasks in managing your portfolio. For example, a genetic algorithm could be used to select the assets, a neural network for predicting the returns of each asset and a genetic algorithm for allocating funds to each asset.
The most common approach to predicting individual stocks is to take in fundamental factors like earnings per share, P/E and PEG ratios, revenues, debt, market share, market cap., volume, etc… Regressions, various types of neural networks, decision trees, or support vector machines analyze these factors over a large set of historical data and classify the direction of tomorrow’s stock price. The performance is typically measured by the accuracy of the model on new data. This is commonly referred to as a “black-box” approach.
A more transparent data mining technique is called association rule learning. With association rule learning, we analyze the same factors and stock, but rather than the algorithm acting on our behalf, we create our own rules based on what the algorithm uncovered. The benefit is two-fold; we know exactly what information was discovered within our data and, at the end of the day, we are the ones buying and selling the stock. Both are very important. It is important to know exactly what information was uncovered within our data so that you can validate it. If a rule doesn’t make sense to you, then your model is probably incorrect and you need to tweak it. Data mining can uncover spurious correlations and patterns and association rule learning helps us avoid that.
In order to see and understand what information was uncovered, you need a good visualization tool. This is why we built TRAIDE. You can select the factors you want to analyze, mine for patterns and information in a particular asset, and then visualize those patterns on an interactive dashboard. You can then tweak and optimize your rules, testing them over your data in real time.
Data mining has many uses in the financial markets. There is a plethora of research in academia and a growing amount of real-world applications for you to explore and apply to your own systems.