External Threat Management

Lessons from a Data Scientist: Onboarding your Machine (Learning Program)

These days, ‘machine learning’ is a buzzword you can’t avoid while reading about pretty much any industry.

It’s ability to “outthink” humans is touted as a magical ROI booster that can drastically maximize productivity while minimizing resource expenditure. The cybersecurity industry is no different—with internet-scale cyber attack campaigns overwhelming cybersecurity teams that struggle to process alerts quickly enough amidst oceans of data, machine learning was supposed to be the silver bullet for any modern cybersecurity problem that human employees could simply set and forget. However, with great hype often comes great disappointment, and we’re now experiencing the blowback from a growing number of people who believe it has not at all lived up to expectation.

The truth is, machine learning is no silver bullet. However, that doesn’t mean it’s not immensely helpful to cybersecurity programs and crucial to the future of cybersecurity—people just need to reconsider the way they use it. Rather than treating it as an all-powerful robot overlord, the secret to unlocking its potential is treating it as a very junior employee.

Machine Learning isn’t the Smartest One in the Room

Machine learning models are fast, tireless, retentive, and completely without any common sense. Just like any intern on their first day, you wouldn’t assume it knows how your organization works, nor, necessarily, the concepts you hope it will eventually master. When you start with machine learning project, think of it as an onboarding process—in the beginning, you need to check in on your models frequently and spend a lot of time getting them pointed in the right direction. At first, the models, which you hope will drive your business to new heights by processing terabytes of data at awesome speed, won’t even understand the task you’re asking of them.

Machine learning’s inability to think critically is a major source of disappointment for those new to the field, and why humans will continue to have a very prominent role in the machine-learning age of cybersecurity. Because your models are low-level (but diligent!) taskmasters that can’t see the big picture, you need to continually spoon feed them instructions. Over time, they’ll eventually see patterns based on your feedback and begin to get the hang of what you want them to look for.

As your models learn, you’ll need to check in on them less and less, but they can and should never be completely autonomous. Be aware of the common cognitive bias of anthropomorphizing apparently intelligent systems. They don’t see things the way you see them and don’t follow a thought process anything similar to our own. They can quickly stray away from the task at hand, sending your entire program into disarray. Below, look at the way deep learning models “understand” the MNIST dataset. None of those images are recognizable to us as numbers.

FIG-1 Visualization of the MNIST data set from a machine learning perspective

Here’s how to make the most of your machine learning program so that it can live up to the hype:

Implement safety nets and monitoring: Once you think your model is performing well, you need a few things to make sure it doesn’t go off the rails. Before you go and build a pipeline, make sure you have the proper safety nets in place. The first of these safety nets is what we call a ‘tripwire.’ If your model exceeds your expectation of the number of instances it will classify within a certain period, your tripwire will automatically disable it. This measure is critical to prevent your model from running out of control.

Going rogue is extremely common for models when they’re first released because, although you’ve provided your initial model a pristine, hand-curated data set from which to learn, the real world is really dirty—dirty in ways you could never anticipate. Just like a fresh college graduate, your model will encounter things that didn’t appear in its textbook causing it to default to biases formed through its training data.

For example, If your training data only contains cats and dogs, when you provide it with a fish, it will try to classify it as either a cat or dog. To see what I mean, check out Shutterstock’s Deep Dream Generator, which uses AI to create images based on other images. When asked to create a unique image from the only two it’s ever seen, it’s unable to bring anything genuinely new to the table, and the result is an abstract combination of the two. Unlike a human with common sense, your model will need to be corrected, learn from its mistakes, and try again. The algorithm used to train your model also has inherent biases. Just like people, every model creates its own view of the problem. At first, it makes assumptions that oversimplify the solution (we ’ll get into this later).

The next safety net is a whitelist. These are lists of items your models should ignore. In a perfect world, you wouldn’t need whitelists because you would invest the time engineering better features and retraining your model until it gets a specific example right. However, when you need to act now, you will be thankful you have them. While not ideal, whitelists not only prevent your current model from classifying an instance incorrectly, but it also helps all of your future models.

Prevent Degradation: Your model may work at first, but without proper feedback, its performance will degrade over time—the precision during the first week will be better than on the tenth week. How long it takes the model to degrade to an unacceptable level depends on your tolerance and its ability to generalize to the problem. For example, a signature-based detection model (e.g., regular expression) may start degrading almost immediately and need to be updated weekly or even daily to keep up with an evolving security cyber threat. Meanwhile, a decision tree may be able to generalize to an entire class of cyber threats given the right features and provide good performance for a longer period.

The world changes all the time, and it’s important that your model changes with it. If you need your model to keep up with current trends, selecting an instance-based model or a model that can learn incrementally is critical. Just as providing frequent feedback helps an employee learn and grow, your model needs the same kind of feedback.

Active Learning: Active learning places an expert in the loop. When the model is unsure how to categorize a certain instance, having the ability to ask for help is critical. Models typically provide a probability or score with their prediction, which gets turned into a binary decision based on some threshold you’ve provided (e.g., cyber threat or not a cyber threat).

But with no guidance, things get problematic and quick. Imagine a junior cybersecurity researcher that doesn’t know how to assess a certain cyber threat. They think something might be malicious, but they aren’t quite sure. They fire off an email to you requesting help, but that email doesn’t get answered for a month or more.

Left to its own devices, the employee may make an incorrect assumption. In the case where the instance was just below the cutoff, but the cyber threat was real, the model will continue to ignore it, resulting in a potentially serious false negative. However, if he chose to act, the model will continue to flag benign instances generating a flood of false positives. Developing a feedback mechanism that provides your model with the ability to identify and surface questionable items is critical to the success of your model.

At RiskIQ, we know how critical it is to keep our models up to date. We have worked hard to make it easy for anyone in the company to provide our models with feedback. This feedback is delivered to our models immediately.

Blending and Co-training: Everyone knows collaboration and diversity help organizations grow. When the CEO surrounds herself with “yes-men” or a lone wolf decides they can do better by themselves, ideas stagnate. Machine learning models are no different. Every data scientist has their “go-to” algorithm they use to train their models. It is important to not only try other algorithms but try other algorithms together. At RiskIQ, we use blended (also known as stacked) models where the base models marry two or more different perspectives.

We also use a technique inspired by to co-training, a semi-supervised technique where two or more very good supervised models are used together to classify unlabeled examples. When the models disagree on the classifications of these examples, the disagreements are escalated to the active learning system described above.

Machine Learning is Here to Stay

We live in a data-driven society, in which humans really can’t go it alone. With some work, machine learning can be used to leverage your employees’ knowledge and abilities to fill a necessary gap in the talent pool. However, Machine learning models are not something you can set and forget. They need frequent feedback and monitoring to provide you with the best performance. Do yourself a favor and make providing that feedback easy. The time you invest in it will pay dividends.

Subscribe to Our Newsletter

Subscribe to the RiskIQ newsletter to stay up-to-date on our latest content, headlines, research, events, and more.

Base Editor