Wednesday, January 03, 2007

Holdout Validation: Time-Based or Not?

Happy New Year, B2B marketers and analytics types.

I'm working on a predictive modeling project where we're trying to forecast whether or not a lead will turn into something else. The project has been mainly focused, as such projects always are, on data collection and generating business hypotheses. There are a couple of issues, though, that I thought I'd surface in the blog that are very analytical in nature.

The first issue is choosing what model type to use. I've always liked logistic regression because it is 100% mathematical and has reliable diagnostics. These models also tend to be simple. However, they don't always work very well. For one of our sub-models, we've branched out and tried a C5.0 algorithm. This algorithm--at first blush, anyway--blew away the logistic regression model. The gains charts were like night and day. However, when we ran the validation data against it, the model stopped looking so great. My hypothesis as to why this happens--classification algorithms based on branching or tree logic are much more vulnerable to confounding by spurious data than logistic regression. They are kind of "black boxes" and are much more vulnerable to complexity. Anyone have any thoughts on this? We'll continue looking at multiple techniques, but we'll also make sure to always do a thorough validation.

The second issue is choosing a validation data set. There are two schools of thought on this one. I used to think that you should always use most recent period data to train a model, and then test it on a holdout of data from that same period. Say you have 24 months of lead data, for example. You would train the model on all 24 months, and then test it on a holdout from the same 24 months (say, 30% of the population.)

The other approach is to train on older data and then test on more recent data. The advantages to this is that you are hopefully ensuring that the model will work over a long period of time and won't work for only a few months. But, on the flip side, you are potentially ignoring your must valuable set of predictor data--the most recent stuff. I can see the merits of both approaches. I think I've settled on always using both approaches--modeling with the recent data included and not included. If the model holds up reasonably well with older data, it's a good sign the model is robust. But, if the more recent data shows a slightly better fit, why not actually field this model?

5 comments:

Anonymous said...

Andy-

Any chance you have any advice
as to how to find a telemarketer.
I'm in the NY area and need
someone who can speak with high
level exec's. Any resources you
can think of that could help my search?

Thank you
Pam McClure
pmcclure@thelabconsulting.com

Anonymous said...

If you are talking about the internet marketing then the Lead and Demand Generation Strategies should be their for the high profits in the business so we needs to more concentrate about the leads generation and internet marketing

Anonymous said...

In market validation and degradation analysis is the only tried and true way. Historical time-based and hold outs as you mentioned are not really valid for real world validation. I would suggest resampling and in market degradation analysis.

Dr. H

Scott Johnson said...

ACQUIRELists :: Your Trusted Email Appending Partner

Add 100% opted-in, Accurate, Deliverable email addresses to your existing database
Our email appending services add 100% opted-in email addresses to your existing postal database. Only our appending services can provide accurate deliverable e-mail addresses. Our Email& Data Appending services include: •email appending •phone and fax appending •mailing address appending for Direct Marketing •Alternate contact & Title appending •Targeted Contact Discovery and List Building •Reverse appending •B2C data enhancement •Database Management

For more information please visit www.acquirelists.com or leave a message on 646-619-4865 and we will get back to you within 1 business day.

Amit Gadkari said...

C5 doing better than logistic on the training dataset, but not on validation:
This is the classic overfitting problem in decision trees where the nodes are segmented to such a level that non existent patterns are captured. Some commercial softwares like SAS limit this by what they call 'pruning'. Regression method addresses overfitting by clustering or PCA before the regression. In any case, only results on the validation dataset are important.

Time based holdout: I would recommend using the same timeframe for model building, assuming that the objective of model building is capturing the factors internal to the business (and its customers). This is because seasonality and external macro factors could introduce a bias in a holdout from a different time period. The external factors are generally better captured by tweaking rules for the environment as it unfolds based on domain expert judgement. That being said, just a quick check on time robustness is good to make sure that the model is usable over time.