Web spiders, or crawlers, and search engines have contributed greatly to the shape of the internet we use today. The year 1993 witnessed the birth of Wandex, the first ever web spider based search engine, which then indexed just 130 websites. During the following years, many search engines emerged including AltaVista, Lycos, Excite, and Google. The success and ease of usability of these search engines have greatly fueled the massive growth of the world wide web during the mid 1990s.
Nowadays, a large percentage of web pages feature dynamic content, which is difficult to crawl by web spiders. Moreover, a significant portion of the internet is invisible to web spiders, including websites that prevent crawling via means of a robot.txt file and web pages that can only be accessed via password authentication. Similar to the surface web, the Tor network also includes a significant invisible portion (password protected or dynamic content). Because onion domains are non-mnemonic, invisible content represents a significant proportion of web pages on the Tor network.
However, the Tor network also includes websites that definitely seek to be visible. Even users running illegal Tor marketplaces, which facilitate the trading of drugs, stolen data, and weapons, want their websites to be visible in order to be able to attract a large number of users. As such, it makes sense to develop an efficient web spider and search engine for the visible parts of the Tor network, especially as recent research studies show that a significant portion of the Tor network is actually visible and well connected via means of hyperlinks, which renders crawling via web spiders possible.
A recently published research study presents a novel dark web spider for crawling and classifying content on the Tor network. The novel onion spider is open source and addresses challenges that are not associated with surface web page indexing. Throughout this article, we will take a look at this novel onion web spider.
Structure of the onion spider:
The basic framework of the onion spider is shown in figure (1). The main goal of this spider is to harvest and identify a significant portion of the Tor network by recursively examining any hyperlinks it detects in the downloaded pages. The spider differentiates between a Tor hidden service, i.e. a given .onion domain address, and a path, which refers to a URL of a website hosted on a given Tor hidden service. The spider stores the downloaded content, the harvested URIs, and the link structure in addition to special forms of metadata including data type and timestamps of the crawled content. The onion spider was seeded with a group of 20,000 Tor hidden service onion addresses obtained from previous research studies, which yielded a high starting crawling speed. Out of the seeded 20,000 Tor hidden services, only around 1,500 were still accessible during the period of testing the spider.
Figure (1): The framework of the onion spider
The onion spider uses the following four modules:
– Conductor: Manages the download, interpretation, and recording of content via controlling other components.
– Network module: Downloads content and manages issues and exceptions associated with the Tor network.
– Parser: Parses the downloaded content and harvests URIs which will be recursively used to download further darknet content.
– Database: Records the downloaded page content in addition to the network’s structure and URIs.
– Tor proxy: Enables the spider to access the Tor network.
The onion spider is published under the MIT license so that it can be further improved by other researchers. It deploys a full data analysis chain, from harvesting to data preprocessing and categorization. The data collected not only includes the content of the Tor hidden services but also the link structure, network topology information, status information, and can be customized to track Tor network changes over time.
The network module takes care of any network associated events throughout the crawling procedure. It considers a download task to serve as input. A download task is composed of a primary URL, a path, and a unique ID. Following acquiring a download task, a special request is then routed to the Tor proxy. When the network module receives the initial response data, it will begin monitoring and filtering the crawled content in order to identify illegal media. If the crawled content passes the filtration process, identified to be legal, and the HTTP status code is not associated with any errors, the download will continue and the content will be added to the result object. Thereafter, the result object is routed to the conductor to process it further.
Extraction of URIs:
After downloading content of a Tor hidden service, the URIs of the downloaded page must be extracted in order to identify where it links to. The spider’s approach utilizes regular expressions in order to match .onion URIs within a string. Nevertheless, it was proven that URIs can be extremely complex, which would render the regular expression filter ineffective. Moreover, it was found there are some pathological instances which require several seconds for extraction of the URIs using regular expression.
As some of the URIs have to be extracted fast enough to introduce new URIs, a faster method was needed. The time needed to extract links was measured and tagged in special HTML document via means of cheerio, an HTML parsing library. To guarantee that only links of Tor hidden services were collected, simple regular expressions were implemented in order to prevent following clearnet (surface web) links. This approach was found to be 10-100 times faster than the traditional expression based deployment which relies on the input string. Nevertheless, it has a downside, as it only harvests links which are part of a tag within the HTML string. Due to the fact that some pages return irrelevant textual content or do not properly link to other web pages, another component was added, executing the technique of slower regular expression. This hybrid process rendered it possible to keep the spider actively running all the time via inputting it with new URIs obtained from the cheerio based HTML extractor and to also have the entire extraction results of the traditional expression based extractor.
Results of testing the onion spider:
Figure (2) illustrates a Venn diagram that depicts the difference between two onion spider runs which were a week apart. The first run produced 10,957 online Tor hidden services, while the second one produced 9,127. The union of the two runs include 11,743 Tor hidden services, while the intersection included 8,341 Tor hidden services.
Figure (2): Difference between two runs of the onion spider, a week apart
Throughout the crawling process, 34,714 Tor hidden service addresses were identified, of which around 10,000 Tor hidden services were responsive to the HTTP(S) request. Out of the identified Tor hidden services, around 7,500 returned relevant content, i.e. not including an empty HTML page or an HTTP error status code. Collectively, 7,355,773 paths as well as 67,296,302 links were harvested. Out of those, 1,561,590 paths were obtained. The remainder of 5 million paths were not obtained due to the fact that they were identified as black hole paths, very unlikely to include further URIs redirecting to unknown hidden services.
The onion spider harvested data over a testing period of a single week. According to the metrics of the Tor project, the mean of the total number of visible and invisible Tor hidden services during this period was 70,640. Having identified 34,714 Tor hidden services, it could be assumed that around 50% of Tor hidden services are mostly visible. Nevertheless, this only represents an upper bound, as approximately 33% of these were only responsive via ports 80 and 443 over HTTP(S): either a non-responsive Tor hidden service was really not accessible anymore, and a stale hyperlink was erroneously followed, or the Tor hidden service is coded to ignore the onion spider. As 10,975 visible Tor hidden services were responsive, this proves that around 15% of all Tor hidden services are visible. There are many reasons explaining why a percentage of these Tor hidden services were not accessible. A reasonable explanation is that they possibly connect via other ports or different network protocols, or they might filter out connection requests via other means.