Tor is by far the most widely used internet censorship circumvention solution. The Tor browser relies on a unique obfuscation technology known as “Meek” to promote users’ privacy. Meek obfuscates Tor network traffic so that it seems like ordinary forms of internet traffic. However, hidden Markov models can be used to identify Tor network traffic. A recently published research paper presents a novel approach for detection of Meek obfuscated Tor traffic which relies on a mixture of Gaussians based Hidden Markov Model (MGHMM). Experiments conducted by the authors of the paper prove that the proposed model is highly effective in the identification of Tor network traffic. Throughout this article, we will take a look at this novel Tor traffic analysis model.
The adversary model and capturing of Tor traffic:
The proposed approach relies on a relatively weak threat model via which an adversary can access network traffic traveling between the user and the entry Tor relay node. The adversary does not have to generate, delay, delete, or modify network traffic, which renders the attack highly stealthy. Apart from the adversary’s own Tor relay node, no other node across the network is compromised, which renders the attack easy to launch successfully on a live network setting.
Traffic analysis begins with the interception of network traffic data packets. Only the targeted user and the gateway have the capability of capturing data packets travelling between them. Practically speaking, multiple entities can access network traffic data packets. The LAN’s administrator can access all traffic flowing across all of their network’s endpoints. The ISP can monitor all traffic of its subscribers. After obtaining consent from the ISP, law enforcement agencies can monitor and record all network traffic data packets of any internet user. Moreover, a malicious adversary on the LAN can launch a man in the middle (MITM) attack between the target user and the gateway and route all traffic to their server. The MITM attack can be launched easily via means of ARP spoofing or poisoning.
Assuming an adversary can intercept the Tor network’s data packets via one of the previously mentioned means, the process of capturing of Tor traffic passes through three stages as illustrated in figure (1). Initially, traffic’s data packets are dumped in a file via a simple application such as tcpdump. Thereafter, the data packets are filtered to keep data packets related to Tor network traffic.
Figure (1): Stages of capturing of Tor’s traffic
The traffic is then grouped into streams. As the list of Tor relay nodes is publically available, it is used to filter Tor network traffic out from the remainder of captured traffic. The list of Tor relay nodes can be obtained from Tor’s status files, especially the cached consensus file, which can be manually downloaded via one of the aforementioned authorities; they can also be directly accessed via the local folder of the Tor status. As the adversary is also a Tor user, they will use their very own status file to obtain a list of all Tor relay nodes.
Thereafter, the packets have to be classified into streams at the level of Transmission Control Protocol (TCP). A stream is established via means of a three way TCP handshake (SYN, SYN-ACK, and ACK). Then, the stream is tracked via means of the ports, IP addresses, Sequence/Acknowledgement numbers, and TCP flag bits. Following termination of a TCP connection, the stream is closed.
Circuit construction and analysis of Tor traffic:
Figure (2) illustrates the sequence of data packets required to create a typical circuit composed of three hops. Since the adversary will be only eavesdropping between the first hop and the target Tor user, they will be monitoring only the six data packets shown on the first column of figure (2). Moreover, all the data packets that they monitor are encrypted with a fixed size of 512 bytes, which renders the task of identifying them extremely difficult.
Figure (2): The process of construction of Tor circuit
The only piece of information that the adversary can exploit is the IPT, which refers to the delay period between two successive data packets. The IPTs of the circuit construction’s six data packets exhibit a unique pattern. Actually, the IPT values are based on two elements: the roundtrip time (RTT) of given data packets within the network and the Tor client’s processing time. More precisely, the following can be monitored:
– The IPT between the initial (CREATE) and the following (CREATED) data packets equals the time required to receive CREATED cell following successful sending of CREATE cell.
– The IPT between the 2nd (CREATED) and the 3rd (EXTEND) data packets equals the time required to process the CREATED cell and then create and send the EXTEND cell.
– The IPT between the 3rd (EXTEND) and the 4th (EXTENDED) data packets equals the time required for the EXTEND data packet to arrive at the first hop, in addition to the time required to extend the Tor circuit by adding the 2nd hop and the time required to receive the EXTENDED data packet. This must be proportionately longer than the IPT between the sequence’s 1st and 2nd data packets.
– The IPT between the 4th (EXTENDED) and the 5th (EXTEND) data packets equals the time required to compute the EXTENDED cell and to create and then send the EXTEND cell data packet.
– The final IPT between the 5th (EXTEND) and the 6th (EXTENDED) data packets equals the sum of the following times:
- The time required for the EXTEND data packet to arrive at the first hop.
- The time required for the first hop to compute the received EXTEND data packet, create its very own EXTEND data packet, and then send it.
- The time required for the first hop’s EXTEND data packet to arrive at the second hop.
- The time required for the second hop to compute the received packet, create a CREATE cell data packet, and then send it over to the third hop.
- The time required for the third hop to compute the CREATE cell data packet and so on.
The Mixture of Gaussians based Hidden Markov Model (MGHMM):
The Hidden Markov Model (HMM) is a special system that can be defined at any given time as being one of a group N of states. The system regularly changes its state according to a group of probabilities linked to the state. Every system’s state always creates a single observation out of a group of M observations. As such, if an attacker observes the system for a number of time units n, he will make a number of observations n.
The sequence of exchanged data packets between the Tor user and the first hop, in order to form a three hop Tor circuit, can be represented by an HMM. The group of HMM’s states can be represented by the group of six steps needed to construct a Tor circuit. Because the space of potential IPT values or observations is typically continuous, the circuit construction process is represented by a Continuous HMM throughout which every system state reflects a continuous probability distribution.
The proposed MGHMM is comprised of two main elements:
1- Mixture of Gaussians (MOG) which is used to define the density distribution of Packet Size (PS) and the distribution of Inter-Packet Time (IPT).
2- A special HMM which is used to process the probability of a sequence of Tor network traffic observation and to detect Meek based Tor network traffic via means of two dimensional traffic observations composed by PS and IPT.
The proposed model was tested in the wild, i.e. on real world network traffic. Experiments prove that the proposed MGHMM is capable of identifying Meek based Tor traffic with high efficiency.
MGHMM can identify Meek based Tor traffic even with an adversary with who can only detect sequences of Tor circuit construction. This information is essential when considering the privacy of Tor users, especially in countries where internet censorship is exercised. More research is needed to identify if the proposed method can further classify the types of network traffic transmitted via the Tor network.