Intro to Forecasting

Forecasting is another technique that uses structured data (often obtained by using techniques from Natural Language Processing and Object Recognition) to inform decision-making. Forecasting techniques predict future outcomes or states.

Why would we want to forecast? Say you’re buying a house. It might be useful to predict what the value of your investment will be in a year. What if you renovated the kitchen? Forecasting techniques can help you determine how much value a kitchen remodel might add. As a business, budgeting is incredibly important. If you can predict demand, customer churn, preventative maintenance costs, and yield, to name a few, you can efficiently deploy resources across your business. If you have an online store and can accurately predict what your customer might buy next, you can surface that item in their search results or in advertisements to increase the probability of a sale.

Forecasting uses large amounts of historical data to create groups that have historically behaved in similar ways. It can then match a new data point to one of these groups and use that group’s historical performance to predict the future. Forecasting differs from Optimization, which you can read more about here, because it doesn’t have to determine which steps to take to reach a defined goal. It simply matches new data to patterns in your historical data. You may notice that the first part of Forecasting sounds a lot like Clustering and Classification, which we discussed earlier here. Forecasting techniques are often built off Clustering and Classification algorithms.

Forecasting algorithms use what we call “features” to identify groups that behave similarly. These features are measurable characteristics that are independent of the thing we are trying to predict. Let’s say we’re trying to predict the price of a house like we mentioned earlier. A couple features we might use would be the age of the house and the number of bedrooms. The price of the house, the thing we’re trying to predict, is at least partially determined by its features (in this case, its age and number of bedrooms). When we say the features are independent, we mean that the features themselves, the house’s age and number of bedrooms, will not increase or decrease because the price changes. The price (our dependent variable), will of course change as the house gets older or if we decide to add an extra bedroom.

In this simple case, we have two features. If we made a table of historical prices and the age and number of bedrooms, we could do some simple math and probably predict the price of a house without the use of AI. The problem is that a house’s price relies on a lot more than the two features we’re using in this example. How big is the house? Is there a garden? A porch?  How many stories does it have? How good are the nearby schools? What is the median income of the area? Which way are interest rates trending? Is there strong employment in the area? The number of features we could consider is almost overwhelming. Housing is not even the most complex problem we tackle in forecasting. Without the use of AI, it would be very difficult to develop a predictive model with any level of accuracy for these more complex problems.

As the number of features increases, so too does the amount of training data needed to train an accurate model. This data is used to try to isolate what each feature’s effect is on what we’re trying to predict. In our house example from before, if our training data only had old houses with 1 bedroom and new house with 4 bedrooms houses, our model wouldn’t be able to isolate age from number of bedrooms when predicting price. We’d want to have a training set with variations that represent close to all the combinations of age and number of bedrooms. As you probably guess, when you start adding more features, the amount of training data you need to cover all these variations goes up significantly.

If we’re trying to solve a problem with hundreds or thousands of features, things become more complex. We obviously need more training data, but we also need more computing power. This increases the amount of time needed to train the data, which means model tweaking takes longer and the cost of computing resources goes up. Data Scientists have developed several techniques to help decrease the amount of computing resources needed.

Two of the techniques Data Scientists use to decrease the necessary computing power are “boosting” and “bagging”. Instead of training your algorithm on all your features and all your historical data at the same time, these techniques divide your training data and train your algorithm on a subset of your data. They do this multiple times, training on different subsets of the data, and then merge the results together into a single model. Boosting keeps all the features intact but limits the number of training examples. In the house price example, this would mean that out of all the historical house sales data you have, the boosting algorithm (like XGBoost for example) would only pay attention to some of the home sales. It would then run again, paying attention to a different subset. Once it was done, it would then merge the results. Bagging, in contrast, would pay attention to all of the home sales, but would only pay attention to a subset of the features we’re considering (say age and number of bedrooms). The bagging algorithm (like Random Forest for example) would then run again considering other features (say area median income, and school ranking) and then merge the results. This is only a very cursory description of these two “meta-algorithms”. We’ll dive further in depth in a future article.

An important thing to remember about Forecasting is that it can’t understand the meaning of features. The age of the house in our previous example could very well be the age of a car or the number of days since someone’s last birthday. The algorithm can’t understand the meaning behind the problem it is solving and instead just looks for mathematical patterns in the data. This lack of understanding can be a feature in that it can identify patterns that humans wouldn’t normally be able to. It can also be a limitation. If the historical data that is being used to train the algorithm has built-in bias, so will the resulting algorithm. Amazon, for example, spent years training an AI system to sort through resumes and predict which candidates were most likely to get hired based on past hiring data. These resumes would then be surfaced to hiring managers who would interview these candidates. The algorithm that they built, however, consistently downgraded female candidates because in the historical data, men were more likely to get hired. You can read more about Amazon’s AI Recruitment system here.

Another issue that compounds the problem with built-in bias is that the decision-making process for Forecasting algorithms can be opaque. While simpler models like Classification and Regression Trees provide some transparency into how their decision-making process works, more complex models are often unintelligible to humans. If a Forecasting algorithm predicts that someone is more likely to default on a loan, we won’t be able to explain why the algorithm predicted the default other than to say that it was reflected in the training data. This makes it difficult to identify built-in bias in a Forecasting system except by observing its predictions. We will discuss issues with algorithmic transparency and techniques for bias elimination in a future blog article.

Forecasting is currently one of the most active areas in AI research. With enough un-biased historical data, AI can make accurate predictions about the future. Whether that be predicting customer churn, valuation, or predictive maintenance, accurate Forecasting can be invaluable.