Tuesday, December 11, 2018

Stratified Sampling, Over-Sampling and Under-Sampling

Stratification can be applied to obtain a subset with a specific distribution.



Real-life data is almost always unbalanced, for example there may be an overwhelming majority of negative cases and just a handful of positives; this fact makes it difficult to train a classifier able to tell the classes apart. 

Behavioral Analytics is a field of information security that applies data science to detect which ones, out of the millions of daily user actions in your company networks, are the act of a hacker. There are multiple techniques that can apply, of which the most powerful is a classifier that given a representation of behavioral data, it classifies it as either benign or malicious.

The most challenging aspect of this technique is that, unless the classifier is extremely precise, it will generate so many false positives that will cause cause "alert fatigue" (it will cry wolf hundreds of times a day, so you will ignore it). This problem is the very nature of the business, finding needles in haystacks.

In addition, unbalanced data also makes the training process extremely inefficient, by spending most of the training cycles looking at examples of the vast negative class and very little time looking at examples of a relatively small positive class.

Let's look at other example, suppose you are a data scientist working for a search engine company that shows ads to users based on the words that they search for, e.g.: if a user searches for the words "Harry Potter" your classifier may chose to show ads of quidditch balls or potion cauldrons.

Now, you have one year of recorded history of what your search engine users did, for each search, you recorded the following 3 pieces of information (a.k.a. 'attributes'): (1) the words searched (a.k.a 'keywords'), (2) the product that you decided to advertise, and  (3) whether the user clicked on the ad (represented by a Y or a N). This data looks as follows:


Search Keywords
   Product Advertised
    User Click
Harry Potter
   Potion Cauldron
N
Harry Potter
   Quidditch Balls
Y
Harry Potter
Grammar book
N
Harry Potter
Algebra book
N
Algebra tutor
Algebra book
Y

The large majority of users didn't click on the ads your engine chose to display, so your dataset contains a 99% of negative cases (N) and 1% of positives (Y).

If you had to look at a search that a user just typed in and decide whether to show the Algebra ad or not, what data would you base your decision on?

Your best strategy would be to look at your history and find out which kind of keywords were typed on those searches that culminated on the user clicking on the Algebra book ad, i.e.: the positive cases.

Now, if you train a machine learning classifier with your historic dataset, the algorithm will process all the cases, and therefore it will spend 99% of its time learning to recognize keywords typed in searches that didn't culminate in a clicked ad (a vast unbounded universe of keywords), instead of focusing on learning the keywords typed in searches that ended with a click on the Algebra ad, which is what you need it to learn.




I hope the above discussion offers you an intuition of why  the training of a classifier with an  unbalanced dataset is inefficient.

To help the classifier spend its precious learning time efficiently, we want to train it with roughly the same number of positive and negative cases (50% of each class). And this is what the techniques of Over-Sampling and Under-Sampling achieve, by picking the same number of examples from each class.

Going back to the dataset from the example, you could pick the whole positive class, sample 1% of the negative class, and ignore the other 98% of your data, which is composed entirely of negative cases; this would render a balanced dataset that is 2% of the original in size, this is called Under-Sampling. Alternatively, you could sample the positive cases "with replacement" to enlarge the positive set up to a factor of 99; this would give you a balance dataset that is almost twice the size of the original, this is called Over-Sampling.


In the following code we show how to create a perfectly balanced dataset, of any size we need, using over and under-sampling:

def balanced_sampling(X, y, sample_size):

    # count the classes
    classes = np.unique(y)

    # pick an equal-size sample from each class (strata)
    class_sample_size = sample_size//len(classes)
    class_samples = []
    for c in classes:
        idx_class = np.asarray([i for i in range(len(y)) if y[i] == c]) 
        idx_class_sample = np.random.choice(idx_class, 
                                            size=class_sample_size, 
                                            replace=(class_sample_size>len(idx_class)))
        class_samples.append(idx_class_sample)
    
    # mix all strata together
    idx_balanced_sample = np.concatenate(class_samples)
    np.random.shuffle(idx_balanced_sample)

    # return the  sample
    return  (X[idx_balanced_sample],  y[idx_balanced_sample])


Here are the salient points of the code:

Decide that the number of elements of each class that we need in order to return a balanced sample, this is achieved in this line:
    class_sample_size = sample_size//len(classes)

Next, we need loop over the classes, and for each class, we need to find all its examples:
    for c in classes:

        idx_class = np.asarray([i for i in range(len(y)) if y[i] == c]) 

Once we have all the elements of the class, we can use the numpy.random.choice() function to randomly pick a subset of examples that we need from this class. Note that the class_sample_size is passed and there is a third parameter 'replace' (for 'sampling with replacement') which is only true if we want to sample a larger number of examples than those in the class:
        idx_class_sample = np.random.choice(idx_class, 
                                            size=class_sample_size, 
                                            replace=(class_sample_size>len(idx_class)))

Finally, we can concatenate the indexes sampled for each class and shuffle them so the sample is returned in random order:
    idx_balanced_sample = np.concatenate(class_samples)

    np.random.shuffle(idx_balanced_sample)



You can find the whole notebook in github.

In this blog we discussed the need for a balanced dataset to train machine learning models, and over/under sampling, a technique to extract a class-balanced sample from an unbalanced dataset.

There is a related concept, called Stratified Sampling; this has the exact opposite goal of  over/under sampling. Stratified sampling is applied when we need the sample distribution to match exactly the class distribution of the original dataset, in this case we would be making sure that the sample contains exactly 1% of positive cases (no more, no less). You may think that sampling uniformly at random would achieve that, but is not guaranteed, so that is why you sample from the positive class until you get the exact number of examples that you need to make that 1% of the sample size. 

So far we have only discussed over/under/stratified sampling with respect to the target variable (click/no-click). You may also want to do stratified sampling with regard to another variable that is not the target, commonly applied for prediction (not for training) to match some user demographics, for example if you are sampling data for an app that is used mostly by teenagers, you may want to extract a sample that matches the age distribution of the population of users of the app.


I hope you enjoyed the article, well being realistic, I settle for making you a bit more comfortable thinking about this a topic that tends to come up constantly in the data scientist job, for example:
  1. Having to train a classifier when you have very few labels
  2. Preparing datasets
  3. In a data science interview
  4. In your sleep!
In the next article in this series we will see how to use this dataset to train a classifier, and we will also compare over/under sampling with other techniques for unbalanced data, like class weights and asymmetric loss functions.

7 comments:

  1. Working through open sources inspires possession which aides in faster, exact and better information conveyance. machine learning course

    ReplyDelete
  2. Lucky 15 casino mobile app - JDH Hub
    Lucky 15 casino mobile app download for 순천 출장샵 Android & iPhone 춘천 출장마사지 Lucky 15 casino mobile app download for Android and 충주 출장안마 iOS users is available 오산 출장샵 now. Lucky 15 casino mobile 동해 출장샵

    ReplyDelete
  3. Harrah's Resort Southern California - FilmFileEurope
    Harrah's Resort Southern California - a 4,853-room hotel with 토팡 69 guestrooms. The 토팡 3,034-room hotel 토토365 includes 566 guest rooms and features a 꽁 머니 지급 full-service 키링 토토 spa,

    ReplyDelete
  4. Borgata Hotel Casino & Spa - JTR Hub
    Located bsjeon.net in Atlantic City, Borgata Hotel Casino & 출장샵 Spa poormansguidetocasinogambling.com offers 바카라 사이트 the finest www.jtmhub.com in amenities and entertainment. It also provides a seasonal outdoor swimming

    ReplyDelete
  5. The RTP for the Game of Thrones slot is ninety five.03% for the 15 payline model and slightly higher at ninety five.07% for the 243 methods to win. The brand is the wild symbol and you can get get} in 카지노 on all the action on cell and desktop. Players who've a registered account will obtain e mail alerts from Casino JackpotCity that can comprise bonus codes for mid-week bonus presents.

    ReplyDelete
  6. Blue display additionally allows the same native talking dealer and desk to characteristic in numerous interiors, so could be} highly cost effective for operators with quantity of} brands. Speed Auto Roulette is a superfast variant with a formidable 2,500 video games per day. Betting happens only through the 1xbet spin in Speed Auto Roulette. Slots.lv is an excellent roulette website where you may get} a gorgeous 200% welcome bonus (300% for Bitcoin) and a lot as} 150% in your subsequent 8 deposits.

    ReplyDelete
  7. For these seeking to gamble anonymously, it properly be|might be|could be} worth downloading a VPN to maintain your self off grid. Once downloaded, set your location to an eligible nation outdoors of South Korea. As you'll have have} noticed on this page, the most typical form of welcome bonus is matched deposit. As the name suggests, first deposits after registering will 온라인카지노 be matched, usually 100 percent.

    ReplyDelete

Stratified Sampling, Over-Sampling and Under-Sampling

Stratification can be applied to obtain a subset with a specific distribution. Real-life data is almost always unbalanced, for example t...