Stratification can be applied to obtain a subset with a specific distribution.
Real-life data is almost always unbalanced, for example there may be an overwhelming majority of negative cases and just a handful of positives; this fact makes it difficult to train a classifier able to tell the classes apart.
Behavioral Analytics is a field of information security that applies data science to detect which ones, out of the millions of daily user actions in your company networks, are the act of a hacker. There are multiple techniques that can apply, of which the most powerful is a classifier that given a representation of behavioral data, it classifies it as either benign or malicious.
The most challenging aspect of this technique is that, unless the classifier is extremely precise, it will generate so many false positives that will cause cause "alert fatigue" (it will cry wolf hundreds of times a day, so you will ignore it). This problem is the very nature of the business, finding needles in haystacks.
In addition, unbalanced data also makes the training process extremely inefficient, by spending most of the training cycles looking at examples of the vast negative class and very little time looking at examples of a relatively small positive class.
Let's look at other example, suppose you are a data scientist working for a search engine company that shows ads to users based on the words that they search for, e.g.: if a user searches for the words "Harry Potter" your classifier may chose to show ads of quidditch balls or potion cauldrons.
Now, you have one year of recorded history of what your search engine users did, for each search, you recorded the following 3 pieces of information (a.k.a. 'attributes'): (1) the words searched (a.k.a 'keywords'), (2) the product that you decided to advertise, and (3) whether the user clicked on the ad (represented by a Y or a N). This data looks as follows:
Search Keywords
|
Product Advertised
|
User Click
|
Harry Potter
|
Potion Cauldron
|
N
|
Harry Potter
|
Quidditch Balls
|
Y
|
Harry Potter
|
Grammar book
|
N
|
Harry Potter
|
Algebra book
|
N
|
Algebra tutor
|
Algebra book
|
Y
|
If you had to look at a search that a user just typed in and decide whether to show the Algebra ad or not, what data would you base your decision on?
Your best strategy would be to look at your history and find out which kind of keywords were typed on those searches that culminated on the user clicking on the Algebra book ad, i.e.: the positive cases.
Now, if you train a machine learning classifier with your historic dataset, the algorithm will process all the cases, and therefore it will spend 99% of its time learning to recognize keywords typed in searches that didn't culminate in a clicked ad (a vast unbounded universe of keywords), instead of focusing on learning the keywords typed in searches that ended with a click on the Algebra ad, which is what you need it to learn.
I hope the above discussion offers you an intuition of why the training of a classifier with an unbalanced dataset is inefficient.
To help the classifier spend its precious learning time efficiently, we want to train it with roughly the same number of positive and negative cases (50% of each class). And this is what the techniques of Over-Sampling and Under-Sampling achieve, by picking the same number of examples from each class.
Going back to the dataset from the example, you could pick the whole positive class, sample 1% of the negative class, and ignore the other 98% of your data, which is composed entirely of negative cases; this would render a balanced dataset that is 2% of the original in size, this is called Under-Sampling. Alternatively, you could sample the positive cases "with replacement" to enlarge the positive set up to a factor of 99; this would give you a balance dataset that is almost twice the size of the original, this is called Over-Sampling.
In the following code we show how to create a perfectly balanced dataset, of any size we need, using over and under-sampling:
def balanced_sampling(X, y, sample_size):
# count the classes
classes = np.unique(y)
# pick an equal-size sample from each class (strata)
class_sample_size = sample_size//len(classes)
class_samples = []
for c in classes:
idx_class = np.asarray([i for i in range(len(y)) if y[i] == c])
idx_class_sample = np.random.choice(idx_class,
size=class_sample_size,
replace=(class_sample_size>len(idx_class)))
class_samples.append(idx_class_sample)
# mix all strata together
idx_balanced_sample = np.concatenate(class_samples)
np.random.shuffle(idx_balanced_sample)
# return the sample
return (X[idx_balanced_sample], y[idx_balanced_sample])
Here are the salient points of the code:
Decide that the number of elements of each class that we need in order to return a balanced sample, this is achieved in this line:
class_sample_size = sample_size//len(classes)
Next, we need loop over the classes, and for each class, we need to find all its examples:
for c in classes:
idx_class = np.asarray([i for i in range(len(y)) if y[i] == c])
Once we have all the elements of the class, we can use the numpy.random.choice() function to randomly pick a subset of examples that we need from this class. Note that the class_sample_size is passed and there is a third parameter 'replace' (for 'sampling with replacement') which is only true if we want to sample a larger number of examples than those in the class:
idx_class_sample = np.random.choice(idx_class,
size=class_sample_size,
replace=(class_sample_size>len(idx_class)))
Finally, we can concatenate the indexes sampled for each class and shuffle them so the sample is returned in random order:
idx_balanced_sample = np.concatenate(class_samples)
np.random.shuffle(idx_balanced_sample)
You can find the whole notebook in github.
In this blog we discussed the need for a balanced dataset to train machine learning models, and over/under sampling, a technique to extract a class-balanced sample from an unbalanced dataset.
There is a related concept, called Stratified Sampling; this has the exact opposite goal of over/under sampling. Stratified sampling is applied when we need the sample distribution to match exactly the class distribution of the original dataset, in this case we would be making sure that the sample contains exactly 1% of positive cases (no more, no less). You may think that sampling uniformly at random would achieve that, but is not guaranteed, so that is why you sample from the positive class until you get the exact number of examples that you need to make that 1% of the sample size.
So far we have only discussed over/under/stratified sampling with respect to the target variable (click/no-click). You may also want to do stratified sampling with regard to another variable that is not the target, commonly applied for prediction (not for training) to match some user demographics, for example if you are sampling data for an app that is used mostly by teenagers, you may want to extract a sample that matches the age distribution of the population of users of the app.
I hope you enjoyed the article, well being realistic, I settle for making you a bit more comfortable thinking about this a topic that tends to come up constantly in the data scientist job, for example:
- Having to train a classifier when you have very few labels
- Preparing datasets
- In a data science interview
- In your sleep!