Your organization’s leadership is 12 times more likely to be the target of a security incident and nine times more likely to be the target of a data breach than they were last year. Find out how they can be protected.
Read the Datasheet
Gift Cardsharks: The Massive Threat Campaigns Circling Beneath the Surface
Learn about the attack group primarily targeting gift card retailers and the monetization techniques they use.
Get the Report
Threat Hunting Workshop Series
Join one of our security threat hunting workshops to get hands-on experience investigating and remediating threats.
Attend an Upcoming Workshop
Inside Magecart: New RiskIQ & Flashpoint Research Report
Learn about the groups and criminal underworld behind the front-page breaches.
Threat Hunting Guide: 3 Must-Haves for the Effective Modern Threat Hunter
The threat hunting landscape is constantly evolving. Learn the techniques, tactics, and tools needed to become a highly-effective threat hunter.
Since releasing our host attribute dataset (pairs, components, trackers), we’ve gotten a lot of great feedback from our community. Users are reporting faster investigation times, more substantial connections and new research leads they wouldn’t have found otherwise. While these datasets are great, they are only a fraction of the data RiskIQ stores on a daily basis. What makes RiskIQs web crawling technology powerful is that it’s not just a simulation, it’s a fully instrumented browser. To understand what this means, we thought it would be useful to go behind the scenes and into a sample crawl response from the RiskIQ toolset.
Understanding a web crawl is a fairly straight-forward process. Similar to how you digest data from pages you browse online, RiskIQ’s web crawlers largely do the same, only faster, automated and made to store the entire chain of events. When web crawlers process web pages, they take note of links, images, dependent content and other details to construct a sequence of events and relationships. Web cCrawls are powered by an extensive set of configuration parameters that could dictate an exact URL starting point or something more complex like a search engine query.
For most web crawls, once they have a starting point, they will perform the initial crawl, take note of all the links from within the page, and then go crawl those follow-on pages performing the same process over again. To avoid crawling forever, most web crawls have a depth limit that stops after 25 or so links outside the initial starting point. RiskIQs configuration allows for some different parameters to be set that dictate the operation of the web crawl.
Knowing that malicious actors are always trying to avoid detection, RiskIQ has invested a great deal of time and effort into their proxy infrastructure. This includes a combination of standard servers and mobile cell providers being used as egress points deployed all over the world. Having the ability to simulate, say, a mobile phone in the region where its being targeted means the RiskIQ crawlers have a higher likelihood of observing the full exploitation chain.
RiskIQ provides a web front-end for the raw data collected from a web crawl and slices it up into the following sections. Its worth noting, most, if not all of the data presented in the interface is searchable via database queries, some of which power the PassiveTotal host attribute datasets.
If you ever view a web page in Google Chrome and fire up the developer tools, you might notice a few messages in the console. Sometimes these are leftover debugging statements from the site author and other times they are errors that were encountered when rendering the page. These messages are stored with the web crawl details and presented in the raw form. While not always helpful, it’s possible that unique messages could be buried inside of the messages pane that may show signs of a common author or malicious tactic.
Dependent requests are a great place to start when investigating a suspicious website. Seeing all the requests made to create the page and how it was loaded can sometimes reveal things like malvertising through injected script tags or iframes. Additionally, being able to see the resolving IP address means we can use that as another reference point for our investigation. Passive DNS data may reveal additional host names that have already been actioned by the team.
Just like a standard browsing experience, RiskIQ web crawlers understand and support the concept of cookies. These are stored with the requests and include things like the original names and values, supported domains and whether or not the cookie has been flagged as secure. Often associated with legitimate web behavior, cookies are also used by malicious actors to keep track of infected victims or store data to be used later. Using the unique names or values of a cookie allow an analyst to begin making correlations that would otherwise be lost in datasets like passive DNS, or web scraping.
To thoroughly crawl a web page, RiskIQ crawlers need to understand the HTML DOM to extract additional links. Each A element shows the original href and text properties while also preserving the XPath location of where it was found within the DOM. While a bit abstract, having the XPath location allows an analyst to begin thinking about the structure of the web page in a different way. Longer path locations imply deeper nesting of link elements, while shorter locations could be top-level references.
Headers make it possible for the modern web to work properly. They dictate the rules of engagement and describe what the client is requesting and how the server responds. RiskIQ keeps both ends of the headers and not only preserves the keys and values but also the order in which they were observed. This process captures both standard and custom headers which create the opportunity for unique fingerprinting of specific servers or services.
Today’s web experience is largely built on dynamic content. Web pages are fluid and may change hundreds of times after their initial load. In fact, some web pages are simply shells that only become populated after a user has requested the page. If the RiskIQ web crawlers only downloaded the initial pages, many of them would appear blank or lack any substantial content. Because the crawlers act as a full browser, they can observe changes made to the structure of the page and log them in a running list. This log becomes extremely powerful when trying to determine what happened inside the browser after the page loaded. What might start off as a benign shell of a page, may eventually become a platform for exploit delivery or the start of a long redirection sequence to a maze of subsequent pages.
If you think of a web crawl as a tree structure, you may have an initial trunk with multiple branches where each additional branch could have it’s own set of branches. The cause section of RiskIQ content subscribes to this idea of a tree and shows all of the web requests in the order of how they were called. Viewing the relationships to multiple web pages this way allows us to gain an understanding of the role of each page. The above example shows a long-running chain of redirection sequences starting off with bit.ly, then Google and eventually ending at Dropbox.
Data from the cause tree is what we use to create the host pair dataset inside of PassiveTotal. By seeing the full chain of events, we can be flexible in our detection methods. For example, if we had the final page of an exploit kit, we could just the cause chain to walk back up to the initial page that caused it. Along the way, this could reveal more malicious infrastructure or maybe even point to something like a compromised site.
As you can see from the above screenshots and explanations, RiskIQ has a lot of data that can be used for analysis purposes. Our goal at PassiveTotal is to introduce this data in a format that is easy to use and understand. So far, we have released host pairs, trackers and web components which are all derived from the web crawling.
In the “cause” section of the crawl, it was easy to see how all the links related to each other and formed a tree structure. By keeping this structure, it’s possible for us to derive a set of host pairs based on the redirection sequences. For example, one of the bitly links in the screenshot redirects to an Amazon hosted page using an HTTP 302. This would create the host pair of the parent being the bitly link and child being the Amazon link.
Since RiskIQ stores all of the page content from the web crawls, it’s easy to go back in time to extract content from the DOM. Trackers are generated from both inline processing during the crawl and post-processing of data based on extraction patterns. Using regular expressions or code-defined logic, we can extract codes from the web page content, pull the timestamp of the occurrence and for what hostname it was associated with.
Creating the web components dataset is a combination of header and DOM content parsing similar to that of the trackers dataset. Regular expressions and code-defined logic aid in extracting details about the server infrastructure and other details like which web libraries were associated with a page.
We see the current host attributes dataset as just the beginning of what’s possible. In the coming months, we are going to start exploring how we can offer more data from the web crawls to our users in our web interface, API, and third-party integrations. Additionally, we are thinking of ways to expose our crawling capability directly to our community of users. In our next post, we will explore a crimeware campaign used to infect users with malicious Chrome Extensions through an extensive web process.
Tomorrow: RiskIQ's @joshuamayfield sits down with @forrester's @josh_zelonis to discuss what goes into a next-gen vulnerability management program, and why discovering unknowns is where it all starts: https://t.co/kCxgPVJ1sD
What are the keys to a Modern Vulnerability Risk Management Program? On Tuesday, @joshuamayfield and @josh_zelonis will examine why defending your organization's digital attack surface starts with being able to discover unknowns and investigate threats: https://t.co/kCxgPW0Ckb
IGNITE is just 10 days away! RSVP now to kick off #RSAC and party with Flashpoint, @elastic, @ThreatQuotient, @Siemplify, and @RiskIQ: https://t.co/hnlh0UhHEo
The largest UK #GDPR fine was £183m in 2018 as B.A. booking website was hit by Magecart ccard skimming code. @RiskIQ worked with https://t.co/E3JRdvCMWA and Shadowserver to take down the malicious domains. https://t.co/iiH69vbKFK