In the age of internet-scale threats, automation has been the weapon of choice for cybercriminals, who are pulling off data breaches at an unprecedented rate. But in the hands of the good guys, who are tirelessly collecting internet data they use to train the machine-learning models that uncover these digital threats, automation is the only thing that can shift the balance of power from the criminals back to the infosec community.
Today, cyber attackers are attacking businesses on a completely different front from the ones they're used to defending, a place many of them have little to no visibility: the Internet. This wild, lawless expanse that sits beyond the frontier of corporate firewalls, proxies, and other traditional cyber defenses is not only outside the purview of network security, it's also incomprehensibly big. Combined, the lack of visibility and vast openness creates a broad and inviting attack surface for hackers to exploit brands, consumers, and employees with relative impunity.
Automation gives rise to threat factories
Much as automation in factories increased manufacturing production tenfold, it's done the same in cybercrime. To take advantage of the sheer area with which they have to work, threat actors overwhelm victims by mass producing infrastructure for their campaigns. They spin up thousands of domains, IP addresses, SSL certificates and content in extremely short periods. They then use this infrastructure to deploy scaled phishing attacks, flood the web with infringing domains, and commit brand fraud for monetary gain.
An automated threat feeding frenzy
Meanwhile, businesses are helpless as these threats intercept them and their customers across the near-infinite expanses of the internet—entirely out of view of security teams, who don't have a clue until it's far too late. Before these teams can even address the threat, their website or server has been compromised, their customer's credentials stolen, or their prospects siphoned off to a dangerous or fraudulent site. Often, these victimized businesses first hear about these breaches from a customer complaint, or worse, the news.
Automation is the perfect camouflage
In response, security researchers trained machine-learning models using signature-based detection (taught to associate certain patterns in code with malicious activity) to alert them of the presence of malware. Not to be outdone, threat actors began using automation of their own. The result is an arms race between ever more sophisticated machine learning and ever more clever obfuscation techniques.
The automation arms race
The arms race began in earnest once threat actors began automating the production of randomly generated malware that appears different every time it's deployed. This 'high-entropy' code creates so much extra noise that it could confuse and defeat the static approach of signature-based detection. Automation helped threat actors turn the tables on their infosec pursuers with 'fingerprinting' techniques. These techniques can determine if a user visiting one of their infected websites is an actual victim or a researcher or virtual user (bot) deployed by researchers trying to analyze the malware.
Some of these fingerprinting techniques identify the browser or IP address from which a user is interacting with a site. Others are time-based and can tell how quickly things on a web page are executed, i.e., a bot may move far faster than a typical user and a researcher analyzing the page considerably slower. If anything does arouse suspicion, these fingerprinting techniques prevent the malware from firing, thus ensuring that the threat actors make a successful getaway.
So, with more and more automated digital threat infrastructure spun up every day and obfuscation techniques evolving, how can infosec pros better use machine learning to respond?
Fighting fire with fire
Like any problem with a machine-learning solution, it all starts with big data. In machine learning, more data usually means smarter models. And with detecting digital threats on the internet about as easy as finding needles in a cyberspace-sized haystack, intelligent machine learning models are the only way to identify and mitigate these digital threats before they strike.
Luckily, efforts are underway by the infosec community to collect as much internet data as possible, not only about digital threats themselves but the internet as a whole. By analyzing all this data, machine-learning models can see the internet not as the infinite, chaotic place it appears to humans and in which threat actors can hide so easily, but the tidy graph of highly-connected data points that it really is, in which threat actors have nowhere to hide.
About those models:
There is no silver bullet for taking down cybercrime with machine learning. The solution lies in a partnership between human and machine, and a blend of different types of models, each of which brings something unique to the table.
Machine learning: building the automated cyber warrior
Machine learning models are broad, fast, and tireless. However, to get them started, you always need a human to write the rules, beginning with a simple decision tree. As the human analyzes more and more attacks, they can build out a larger and larger decision tree, creating stepping stones to more nuanced threat detection. Eventually, the machine learning algorithms will have enough data to need less and less human intervention. It's ready to start detecting digital threats at internet scale.
But how can these algorithms learn to battle obfuscation techniques like fingerprinting? Here are three ways:
Active Learning: Active learning places an expert in the loop. When the model is unsure how to categorize a particular instance, having the ability to ask for help is critical. Models typically provide a probability or score with their prediction, which gets turned into a binary decision based on some threshold you’ve provided (i.e., threat or not a threat).
But with no guidance, things get problematic and quick. Imagine a junior security researcher that doesn’t know how to assess a specific digital threat. They think something might be malicious, but they aren’t quite sure. They fire off an email to you requesting help, but that email doesn’t get answered for a month or more.
Left to its own devices, the employee may make an incorrect assumption. In the case where the instance was just below the cutoff, but the digital threat was real, the model will continue to ignore it, resulting in a potentially severe false negative. However, if it chooses to act, the model will continue to flag benign instances generating a flood of false positives. Developing a feedback mechanism that provides your model with the ability to identify and surface questionable items is critical to the success of your model.
Blending and Co-training: Everyone knows collaboration and diversity help organizations grow. When the CEO surrounds herself with “yes-men” or a lone wolf decides they can do better by themselves, ideas stagnate. Machine learning models are no different. Every data scientist has their “go-to” algorithm they use to train their models. It is essential to not only try other algorithms but try other algorithms together. At RiskIQ, we use blended (also known as stacked) models where the base models marry two or more different perspectives.
We also use a technique inspired by co-training, a semi-supervised method where two or more excellent supervised models are used together to classify unlabeled examples. When the models disagree on the classifications of these examples, the disagreements are escalated to the active learning system described above.
Preventing Degradation: Your model may work at first, but without proper feedback, its performance will degrade over time (the precision during the first week will be better than on the tenth week). How long it takes the model to deteriorate to an unacceptable level depends on your tolerance and its ability to generalize to the problem.
The world changes all the time and it’s important that your model changes with it. If you need your model to keep up with current trends, selecting an instance-based model or a model that can learn incrementally, is critical. Just as providing frequent feedback helps an employee learn and grow, your model needs the same kind of feedback.
Machine learning is here to stay
The arms race between cybercriminals and the infosec community won't stop any time soon. However, as attackers get increasingly sophisticated, machine learning in the hands of the infosec community will continue to adapt to identify digital threats on an Internet scale. Attack surfaces are more expansive than ever, but as machine learning algorithms get smarter, the bad guys will run out of places to hide.
CTO and Chief Data Scientist
As Chief Data Scientist, Adam leads the data science, data engineering and research teams at RiskIQ. Adam pioneers research automating detection of adversarial attacks across disparate digital channels including email, web, mobile, social media. Adam also has received patents for identifying new external threats using machine learning. Adam received his Ph.D. in experimental particle physics from Princeton University. As an award-winning member of the CMS collaboration at the Large Hadron Collider, he was an integral part in developing the online and offline analysis systems that lead to the discovery of the Higgs Boson.