So exactly how do predictive models work?
My experience has been that it’s often hard to come by an easy-to-understand explanation that doesn’t involve a discussion with lots of Big Data hype or a vendor trying to sell their software tools, without helping you understand what these tools actually do.
For me, the easiest way to understand how predictive models work is using an example, asking a real question with a problem to solve. So with an example many readers may find potentially interesting and understandable, let’s take a look at predictive modeling in a scenario with you as a homeowner.
Let’s say you live in Broomfield, Colorado and you want to put your home up for sale. You want it to sell quickly, but not at a fire sale price. You’d like to know: What’s the difference between “homes that sell in less than 30 days” and “homes that sell in more than 30 days”? Your goal (and the problem you are solving) is to predict (as accurately as you can) how quickly your home might sell.
Now suppose you have a real estate broker friend with a spreadsheet full of data gathered from five years of buying and selling homes in the Denver—Boulder corridor. The spreadsheet has 100 columns with information about the home, the buyers and the sellers… and 10,000 rows of data for each home bought or sold during that timeframe.
You count and see that 1,000 homes out of 10,000 sold in less than 30 days over the last 5 years.
If you could figure out what are those 5-10 pieces of data in that spreadsheet, called “predictors”, that are common among most of the homes that have already sold in less than 30 days, you could apply that knowledge, or “rule,” to new homes for sale.
Assume the predictors turn out to be: Basement, Kitchen, Furnace, AskingPrice, FinishedSqFeet and ZipCode.
You could then predict, based on what happened from the last five years’ worth of data, which kind of homes ( based on the predictors) will be most likely to sell in less than 30 days. And from there, by comparing the predictors for a home that sold quickly with the data regarding your own home, you can solve your problem and make a more accurate prediction about when your home is likely to sell. The “rule” that combines those predictors is the predictive model.
The rule, or predictive model, is:
Basement=Finished AND Kitchen=Remodeled AND Furnace<5 years old
AND AskingPrice<500,000 AND AskingPrice>200,000 AND FinishedSqFeet>2000
AND (ZipCode=12345 OR ZipCode=12346)
That is the “big idea” behind predictive models: if you can answer a question by looking at past data and figuring out the predictors and the rule that can identify a pattern describing something that has already happened most of the time, you can apply that rule to current and new data to predict what should happen most of the time going forward.
The rule, when applied to those 10,000 rows of past home sale data, can correctly identify
850 of the 1,000 homes that sold in less than 30 days.
Of course, there’s no guarantee that your home will sell in 30 days even if your home shares those 5-10 predictors with most of the homes that have already sold in less than 30 days; that’s not how prediction based on probability works. However, the probability of your home selling in less than 30 days should be better than “50/50” if the predictive model is any good. And that’s an easy way to think about your objective with a predictive model: to make better predictions than you would with a coin flip or following an intuitive hunch if those are the only other “rules” available to you.
Finding what’s in common and creating that rule
How you find those 5-10 predictors in the spreadsheet data, and how you combine them into a rule you can use to make predictions about new homes for sale is a complex topic I’ll save for another discussion. By using complex mathematical algorithms and software tools, however, your friendly neighborhood data scientist can find the most potent predictors and figure out the rule for you. As you would expect, the accuracy level of the predictive model will be largely dependent on the quality, relevance and quantity of past data that was used. And remember that it need not be “big” data…”small” and relevant data can work well, too.
Not all prediction is the same
There are many different kinds of prediction we encounter in our lives, and making predictions using predictive models is one of many ways to do so. But using predictive models as part of your business analytics strategy is not just another way to do something you can do already do.
Where past data is available and no good rule for making a prediction exists, predictive models can be built to do the job better…better than you or anyone else would likely do making a prediction based on guesswork or gut feel. Put another way, why would you guess what makes a home sell in less than 30 days when you can make a better prediction by looking at the data about homes that have already sold in less than 30 days?
If you found this helpful and would like to pass on your feedback, or if you have questions about how predictive modeling might help solve your most critical business problems, please email me at email@example.com.
Let’s “Lunch ‘n’ Learn”
If you are interested to learn more about how predictive models work and how they can be of benefit to your organization, I’ve got a “lunch and learn” presentation I can deliver via webmeeting for you that might to the trick. You can read more about it here. Let me know!
Gene Connolly is an independent consultant with a passion for data and analytics; reading, writing and sharing what he knows with others who have an interest in how they can harness the power of prediction for themselves. You can find him on LinkedIn , on Twitter or by email at firstname.lastname@example.org.