Saturday, April 17, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Machine Learning

Sampling For Your Analysis – Predictive Analytics Times

December 4, 2019
in Machine Learning
Sampling For Your Analysis – Predictive Analytics Times
586
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter
By: Sam Koslowsky, Senior Analytic Consultant, Harte Hanks

So you have a mailing campaign you are about to conduct. Your goal is to secure both increased response rates and sales volume. And a customer targeting methodology is crafted. Nothing elaborate-but response and sales models will be developed. You have results from a previous program, and are prepared to aggregate the data so that it can be mined. We have 974,232 individuals tagged as mailed, and 11,418 flagged as responders.

Response models can be developed in the standard way. We may use a sample of mailed households, with associated responders, and a model can be constructed. Similarly, we retrieve our 11,418 responders (those that made a purchase), and formulate a sales model. While this certainly is a typical approach for marketers, there is a thorny question that must be addressed.   How do we use these models? Granted, the response models can be deployed against the new mailing population. But the sales analysis was developed, on responders, only. How do we deploy a model constructed on responders only, and apply it to the larger universe-the mailed population? Targeted modeling based on non-randomly selected samples can lead to erroneous results.

You might also like

Machine learning models may predict criminal offenses related to psychiatric disorders

How To Ensure Your Machine Learning Models Aren’t Fooled

Scientists use machine learning to classify millions of new galaxies

When we employ machine learning algorithms to our sales prediction problem, we may very well encounter a problem of selection bias.  As spending (sales) is said to be a censored variable (if you didn’t respond, you have NO sales), it results in  many records having zero values (i.e. no sales). Researchers have observed, that if this is not considered in the analysis, then our modeling result may well produce biased results.

Other similar sample selection issues are quite prevalent. Take for example, the bank that uses sophisticated model algorithms to determine who should be approved for a mortgage loan. The analytic process is statistical and ‘objective’ in nature. Or is it? The models typically build a profile of previous applicants that have been good borrowers-that is they have repaid their loans according to agreed upon terms.

But wait a minute. The statistical models employed to assess new applicants, are developed from previous historical loan data. The sample used to construct the model is based on those that have already been approved, and are currently the bank’s customers!  And here’s the issue: You cannot develop a model on a bank’s customers, and deploy those results to the entire world!  The bank and the world are different universes.

The above credit issue is addressed through various means, specifically through a process known as reject inference. Much has been written on this topic, and I leave it to the interested reader to further investigate the details.

The details of the recently issued Apple credit card, however, are not as clear. Claims that men were assigned higher credit lines than women were confirmed by several sources. Does that make the credit grantor biased?

Leo Kelion, Technology desk editor at bbc.com suggests that the difference in credit line assignment may be due to how “the algorithms involved were developed, they were trained on a data set in which women indeed posed a greater financial risk than the men. This could cause the software to spit out lower credit limits for women in general, even if the assumption it is based on is not true for the population at large.” Again, we observe a sample bias-one that has caused much commotion in the industry.

Another often quoted illustration of sample bias involves a problem, that unfortunately, many managers do not consider.

Take the researcher interested in describing the incomes of those attending college to those who did not.  The straight-forward procedure is to compare these incomes would be to determine the average income for college students as compared to the mean income for non-college attending students. Simple enough, you think.  But wait a minute, again. It could very well be that those attending college possess additional attributes that may not be identified in the non-college population.  For example, patience or concentration, may be characteristics that impact income whether or not an individual attends college. These additional characteristics may very well misrepresent the evaluation and comparison of the two groups.

Let’s briefly return to the response/sales models referred to earlier, in an attempt to assess the situation. First, a brief recap.

Marketers have developed response models to target audiences most likely to respond. This has often provided to be the bread and butter of marketing campaigns. But hold on. These response models may be very good-but are the highly responsive segment the most profitable? Are they purchasing the most?  The fact is that response models can often locate the most responsive customers who actually spend the least! To minimize the potential to target responsive low spending individuals, researchers have extended the response model scenario to include response AND spending algorithms. This, as mentioned above, results in a real selection bias. Models can be developed for spending. That’s good.  But spending implies response. As we proceed to operationalize the model, we are applying the results to the full contact universe-not only the responders!

There are a number of approaches that have been suggested to address this problem. These tactics may very well not address the underlying problems, but in practice, at least some of them, some of the time, appear to do a credible job at confronting our dilemma.

  1. Use just a response model. Perhaps the response model correlates well with sales, and one analysis is adequate.
  2. We can ignore the problem. Even though, theoretically, this is not appropriate, in practice it may meet the marketer’s objective. Build two models on the appropriate universe, and select the winning model based on a criterion that it performs well on the initial mailing population. Use expected value as a criterion. Expected value in nothing more than the result that is achieved my multiplying the response score by the sales score. Sort by this value, and select your mailing population.
  3. Construct only a sales model. Skip over the response effort. Perhaps response will correlate with the ten sales deciles.
  4. Fill the non-responders with a ‘0’ sales value. After all, they didn’t spend!
  5. Fill the non-responders with random values (very low values)
  6. Use what is referred to as the Heckman correction-a tool developed to respond to sample bias.

Let me share some quick thoughts on Item 1 and 6 above.

  1. Use just response analysis

The marketer’s response model consisted of typical predictors including wealth indicators, current retailer’s performance in geographies, and family composition. The final algorithm, generated through a logistic regression, produced the following results.

DECILE Response Rate
     1      5.86%
     2      2.60%
     3      1.52%
     4      0.86%
     5      0.48%
     6      0.26%
     7      0.10%
     8      0.04%
     9      0.01%
     10      0.00%
   Total      1.17%

 

Results were better than expected. Let’s now add sales to the above decile table. Remember, for non-responders, as there are no sales, those records are NOT included in the performance report listed below.

DECILE  Response Rate   Sales
        1         5.86% $136.33
        2         2.60% $120.64
        3         1.52% $114.96
        4         0.86% $112.11
        5         0.48% $106.33
        6         0.26% $171.82
        7         0.10% $175.40
        8         0.04% $81.57
        9         0.01% $74.39
       10         0.00% $80.15
      Total         1.17% $123.83

 

While there certainly appears to be a relationship between response and sales, it is also evident that there isn’t that much of a distinction in sales volume for the first 6 or so deciles. While this approach is not really satisfying, it nevertheless, does provide increased response rates, and ‘better’ that average sales estimates. Beware, this is not always the case. I have seen performance reports showing a fairly constant spending throughout the ten deciles.

In any event, this procedure does not directly address the sample selection bias.

James Heckman, in a landmark study (Heckman, J. (1979). “Sample selection bias as a specification error”. Econometrica 47 (1): 153–61), proposed a two-stage estimation procedure to tackle the selection bias problem. In a first step, a regression analysis is performed for analyzing response. With a bit more statistical juggling, the output of this first step regression in then incorporated as an additional explanatory variable in the spending regression model. This tactic, popularly referred to as the Heckman correction should not be considered as the ultimate solution. It is not magical. It doesn’t always produce the results you may be looking for. So be wary.

While there are numerous software packages designed to perform a Heckman analysis, three that I am aware of include:

  • R procedure SampleSelection
  • Stata procedure Xtheckman
  • QLIM procedure in SAS

Constructing carefully developed samples is a necessary ingredient in building predictive models and in performing objective analyses. But analysts must be vigilant and careful in constructing those subsets without bias. These samples must characterize the universe as a whole, if inferences gleaned on the sample are to be correctly deployed to records outside of the sample. While there are techniques that may be used to mitigate the issues, none are really foolproof. Careful and deliberate construction of data for model building, and communication about the potential issues of these data can make the job of the analyst and the marketer somewhat easier, but still complex.

About the Author

Sam Koslowsky serves as Senior Analytic Consultant for Harte Hanks. Sam’s responsibilities include developing quantitative and analytic solutions for a wide variety of firms. Sam is a frequent speaker at industry conferences, a contributor to many analytic related publications, and has taught at Columbia and New York Universities. He has an undergraduate degree in mathematics, an MBA in finance from New York University, and has completed post-graduate work in statistics and operations research.

Harte Hanks is a global marketing services firm specializing in multichannel marketing solutions that connect our clients with their customers in powerful ways. Experts in defining, executing and optimizing the customer journey, Harte Hanks offers end-to-end marketing services including consulting, strategic assessment, data, analytics, digital, social, mobile, print, direct mail and contact center. From visionary thinking to tactical execution Harte Hanks delivers smarter customer interactions for some of the world’s leading brands. Harte Hanks 5000+ employees are located in North America, Asia-Pacific and Europe.



Credit: Google News

Previous Post

New Zealand’s gun buyback scheme impacted by data breach, SAP to blame

Next Post

Iranian hackers deploy new ZeroCleare data-wiping malware

Related Posts

Machine learning approach identifies more than 400 genes tied to schizophrenia
Machine Learning

Machine learning models may predict criminal offenses related to psychiatric disorders

April 16, 2021
How To Ensure Your Machine Learning Models Aren’t Fooled
Machine Learning

How To Ensure Your Machine Learning Models Aren’t Fooled

April 16, 2021
Scientists use machine learning to classify millions of new galaxies
Machine Learning

Scientists use machine learning to classify millions of new galaxies

April 16, 2021
Machine learning algorithm predicts risk for suicide attempt
Machine Learning

New machine learning method for designing more effective antibody drugs

April 16, 2021
UR: Using Machine Learning to Identify Transients in the DESI Survey
Machine Learning

UR: Using Machine Learning to Identify Transients in the DESI Survey

April 16, 2021
Next Post
Iranian hackers deploy new ZeroCleare data-wiping malware

Iranian hackers deploy new ZeroCleare data-wiping malware

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

A new collective to fight adtech fraud: Friday’s daily brief
Digital Marketing

A new collective to fight adtech fraud: Friday’s daily brief

April 17, 2021
Cyberattack on UK university knocks out online learning, Teams and Zoom
Internet Security

Cyberattack on UK university knocks out online learning, Teams and Zoom

April 17, 2021
SBI Sumishin Net Bank partners with DLT Labs on supply chain financing network
Blockchain

SBI Sumishin Net Bank partners with DLT Labs on supply chain financing network

April 16, 2021
Machine learning approach identifies more than 400 genes tied to schizophrenia
Machine Learning

Machine learning models may predict criminal offenses related to psychiatric disorders

April 16, 2021
Templates Vs Machine Learning OCR | by Infrrd | Mar, 2021
Neural Networks

Templates Vs Machine Learning OCR | by Infrrd | Mar, 2021

April 16, 2021
How you handle email replies matters for great customer experiences
Digital Marketing

How you handle email replies matters for great customer experiences

April 16, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • A new collective to fight adtech fraud: Friday’s daily brief April 17, 2021
  • Cyberattack on UK university knocks out online learning, Teams and Zoom April 17, 2021
  • SBI Sumishin Net Bank partners with DLT Labs on supply chain financing network April 16, 2021
  • Machine learning models may predict criminal offenses related to psychiatric disorders April 16, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates