Unfortunately, we discovered that we would not be able to use TCP Timestamps. Aaron was able to run some preliminary measurements in a controlled test environment with a few back-end instances, a load balancer, and client. After graphing the results, we discovered a discrepancy in the clock skew calculated on the back-end instances versus the clock skew calculated from the client. While each back-end instance had its own unique clock-skew, our client received clock skews in one range. Amazon’s ELB (Elastic Load Balancer) terminates TCP connections and re-sends requests to server/client machine when configured for HTTP/HTTPS or TCP with SSL. Only for pure TCP (without SSL) does the ELB leave off modifying the header.
On another track, while writing the script to diff (linux bash command for comparing files ) websites using Amazon’s ELB, I was able to discover that Reddit uses sticky session cookies for Amazon ELB. During small qa testing for the scripts, I noticed that we would receive consistently receive the same instance identifier for a Reddit page over time, unlike Netflix’s server id which change per request. Examining the Reddit page with Chrome’s developer tools, I found the AWSELB and JSESSIONID cookies indicating a sticky session. This is another possible path for exploration we hope to look into.
In order to find alternative techniques, we performed a search for papers that cite Bellovian’s Counting NATs paper. We were able to find and read a few promising papers (see the bottom of this post) that use TCP time stamps to identify active hosts behind a NAT device. The basic idea behind this technique is that each machine has a unique clock skew calculated as a function of the TCP timestamp, the system time, and the clock frequency ( and a few other variables of course). The papers what use this technique, or slight variations of it, reported relative success in counting hosts.
We decided we would need to run a few controlled experiments to see the applicability of this technique. One potential concern is that the load-balancer may overwrite the TCP timestamp in packets from the server such that our client will only be able to measure the clock skew of the load balancer and would therefore be unable to identify the back-end hosts.
In addition to the controlled experiments, I will also write a script to fetch and diff pages from sites known to be using load balancing ( identified in a previous paper).
1 – Approximating the Number of Active Nodes Behind a NAT Device
2 – IP Agnostic Real-Time Traffic Filtering and Host Identification Using TCP Time stamps
3 – Remote Physical Device Fingerprinting
Last week, after having pulled the data set from Netflix, I wrote a Python script to aggregate the data and generate some statistics. From the results I was able to was able to get a rough approximation about how many front servers for Netflix exist in EC2 regions us-east-1 and us-west-2, and as well as a general idea of the distribution of the load across the servers. The distribution was pretty even across all servers, but it is interesting a speculate what sort of conditions would cause uneven distributions. That , however, would require capturing real-time Netflix packets which is outside the scope of our current project. Some interesting tends I was able to observe is that us-west-2 has fewer servers compared to us-east-1. We discussed this finding and theorized it could possibly be related to the relative size of each region versus its the population density or perhaps that Netflix could have started out with servers in us-east region and those in us-west are more recent additions.
Presently, I am focusing on gathering more data to increase the accuracy of our results. We continue to research alternative methods of identifying these servers.