Derived Host Pairs from Web Crawling - RiskIQ

Derived Host Pairs from Web Crawling

April 19, 2016, Steve Ginty

mm

Did you realize that in loading this blog post, your web browser made over 50 network requests for resources in order to construct it? The modern web is a complex graph of dependent requests made up of images, code libraries, page content and other references. Every day, RiskIQs crawling technology makes nearly 2 billion HTTP requests online and saves the contents of the session inside of a database. Using years of this data, engineers at RiskIQ put together our latest public dataset, host pairs.

Simply put, host pairs are two domains (a parent and a child) that shared a connection observed from a RiskIQ crawl. The connection could range from a top-level redirect (HTTP 302) to something more complex like an iframe or script source reference. What makes this new dataset powerful is the ability to understand relationships between hosts based on details from visiting the actual page. Unlike our other datasets, host pairs relies on knowing the website content, so its likely to surface different values that other sources like passive DNS and SSL certificates could miss.

Chain of Phish

To illustrate how an analyst could use this data, take the domain antivirus.safetynote[.]xyz as an example. This domain was observed attempting to phish users and was placed on the RiskIQ blacklist.

PassiveTotal provides us with a bunch of infrastructure we could explore, but let’s see what a simple query to host pairs could give us. Running a query for all children of antivirus.safetynote.xyz reveals over 30 different domains:

  • boxfoo.com (topLevelRedirect)
  • live.om (topLevelRedirect)
  • track.trk15.info (topLevelRedirect)
  • tumblr.om (topLevelRedirect)
  • wsj.om (topLevelRedirect)
  • adobe.om (topLevelRedirect)
  • trkur4.com (topLevelRedirect)
  • whatsapp.om (topLevelRedirect)
  • www.facebook.com (img.src)
  • ign.om (topLevelRedirect)
  • lpcb9.voluumtrk.com (topLevelRedirect)
  • trkur3.com (topLevelRedirect)
  • areasnap.com (iframe.src)
  • trkur.com (topLevelRedirect)
  • www.giantnet.info (topLevelRedirect)
  • wegotmedia.com (topLevelRedirect)
  • united.om (topLevelRedirect)
  • blogspot.om (topLevelRedirect)
  • delta.om (topLevelRedirect)
  • tmall.om (topLevelRedirect)
  • oron.com (topLevelRedirect)
  • pof.om (topLevelRedirect)
  • trackssummit.info (topLevelRedirect)
  • pogo.om (topLevelRedirect)
  • realtor.om (topLevelRedirect)
  • hadie.persianfbpages.com (topLevelRedirect)
  • rediff.om (topLevelRedirect)
  • spotify.om (topLevelRedirect)
  • sohu.om (topLevelRedirect)
  • toysrus.om (topLevelRedirect)
  • dropbox.om (topLevelRedirect)

Without knowing much about these domains, its clear that most seemed to be a typo-squatted version of a legitimate brand. In parenthesis, we can also see that the connection between our original query and the children were mostly through top-level redirects. From here, we can pick one of these domains, say wsj.om, and look at all the parent domains.

  • antivirus.safetynote.xyz => wsj.om (topLevelRedirect)
  • best.freegiveaways-cm.xyz => wsj.om (topLevelRedirect)
  • de.yahoo.com => wsj.om (topLevelRedirect)
  • loto.com-selected-winners.club => wsj.om (topLevelRedirect)
  • srv2trking.com => wsj.om (topLevelRedirect)
  • www.billigsucher.com => wsj.om (topLevelRedirect)
  • jmozz.reward-zone.0315.pics => wsj.om (topLevelRedirect)
  • mobilex-club.xyz => wsj.om (topLevelRedirect)
  • www.aeriagames.com => wsj.om (topLevelRedirect)
  • d.adsmatic.com => wsj.om (topLevelRedirect)
  • www.retesicura.vodafone.it => wsj.om (topLevelRedirect)
  • av.securealert.xyz => wsj.om (topLevelRedirect)
  • google.com-members.online => wsj.om (topLevelRedirect)
  • 6g0zz.reward-zone.0380.pics => wsj.om (topLevelRedirect)
  • www.liveadexchanger.com => wsj.om (topLevelRedirect)
  • mosaic5.com => wsj.om (topLevelRedirect)
  • nkqgy.voluumtrk.com => wsj.om (topLevelRedirect)
  • offer.freegiveaways-cm.xyz => wsj.om (topLevelRedirect)
  • www.askgamblers.com => wsj.om (topLevelRedirect)
  • tfezz.prize-o-rama.0203.pics => wsj.om (topLevelRedirect)
  • profittradingbot.com => wsj.om (topLevelRedirect)
  • www.phantomproject.co.uk => wsj.om (topLevelRedirect)
  • offer.safertrk.com => wsj.om (topLevelRedirect)
  • winmseole.xyz => wsj.om (topLevelRedirect)
  • longtube.net => wsj.om (topLevelRedirect)

Running some of the other .xyz domains in PassiveTotal reveals that they too are blacklisted and were used as phishing sites. From here, we could continue exploring other domains parent and child relationships to build a larger graph and find more suspect items. In this particular example, performing those queries eventually leads to a bunch of shady ad providers and other blacklisted infrastructure. It’s worth noting that without host pairs, most of these sites would have remained undiscovered since there was no immediate infrastructure connection.

Directly in the Web

Starting today, users can access the host pair data directly from the PassiveTotal web interface. When available, parent and child hosts will be shown inside of a tab sorted on the last time they were observed in a crawl. Each listing will also include the cause of the relationship which users can use to help prioritize their investigations.

Graph Views

In the example above, we quickly went from one domain to over 50 based on just two host pair relationships. As you could imagine, showing that data in a text-based format can quickly become unwieldy. Fortunately, PassiveTotal has transforms for Maltego and have added the host pairs data into that system.

Sticking with the same example above, we can explore this data at scale without having to worry about lists of domains coming back. Instead, we see several hub-spoke connections detailing the relationship and the context as to where it was observed.

What makes Maltego nice is that we can then use the Get Enrichment transform to then identify any of the entities with tags suggesting they are malicious. Additionally, this helps cluster some of the activity based on the tag value which makes it easier to identify exactly what we may want to block.

Similar to the text-based data, graphs too can become large, so we recommend going slowly and keeping your work saved to avoid adding too many connections that would clutter up your work.

Surface Tagged Hosts

If youd like to start playing around with host pairs data in your own application, you can access it directly using our API. An example script can be found in our Python library examples folder and highlights the use case of showing domains from a host pair relationship that has at least one tag associated with it.

This script is simple, but quickly lets you perform some of the same work we did in this post without having to leave your terminal. Using the PassiveTotal libraries, it’s easy to generate your own logic or use cases.

Future of Host Pairs

The addition of host pairs inside of the PassiveTotal platform is a massive step in improving infrastructure connections and providing detailed context to queries. Having full session details, cookies, web page content and more has allowed us to move beyond the traditional static datasets and offer insight into more dynamic processes.

Over time, we’d like to begin merging more of the RiskIQ blacklist/threat data directly into this data source so users could begin filtering out host pairs based on known malicious ones instead of seeing them all. For now, we would love to hear any feedback from you no matter how you use the data!

Share: