Week 10 – New Techniques

In order to find alternative techniques, we performed a search for papers that cite Bellovian’s Counting NATs paper. We were able to find and read a few promising papers (see the bottom of this post) that use TCP time stamps to identify active hosts behind a NAT device. The basic idea behind this technique is that each machine has a unique clock skew calculated as a function of the TCP timestamp, the system time, and the clock frequency ( and a few other variables of course). The papers what use this technique, or slight variations of it, reported relative success in counting hosts.

We decided we would need to run a few controlled experiments to see the applicability of this technique. One potential concern is that the load-balancer may overwrite the TCP timestamp in packets from the server such that our client will only be able to measure the clock skew of the load balancer and would therefore be unable to identify the back-end hosts.

In addition to the controlled experiments, I will also write a script to fetch and diff pages from sites known to be using load balancing ( identified in a previous paper).

 

 

1 – Approximating the Number of Active Nodes Behind a NAT Device
2 – IP Agnostic Real-Time Traffic Filtering and Host Identification Using TCP Time stamps
3 – Remote Physical Device Fingerprinting

Week 9 – Generating Statistics

Last week, after having pulled the data set from Netflix, I wrote a Python script to aggregate the data and generate some statistics. From the results I was able to  was able to get a rough approximation about how many front servers for Netflix exist in EC2 regions us-east-1 and us-west-2, and as well as a general idea of the distribution of the load across the servers. The distribution was pretty even across all servers, but it is interesting a speculate what sort of conditions would cause uneven distributions. That , however, would require capturing real-time Netflix packets which is outside the scope of our current project.  Some interesting tends I was able to observe is that us-west-2 has fewer servers compared to us-east-1. We discussed this finding and theorized it could possibly be related to the relative size of each region versus its the population density or perhaps that Netflix could have started out with servers in us-east region and those in us-west are more recent additions.

Presently,  I am focusing on gathering more data to increase the accuracy of our results. We continue to research alternative methods of identifying these servers.

Week 8 – Polling Netflix

The previous week I wrote a Python script that successfully pulled the server id from the HTML of Netflix’s landing page.  I ran the script in a loop to build up a data set from which I then was able to identify 4-5 distinct servers, although I did not observe any simply identifiable  patterns by naive examination. Next , I decided to set up VMs in different regions of EC2 and have them poll the Netflix site in order to build up a larger data set. While I was able to successfully pull data with  VMs set up in several US regions, I ran into a few errors when attempting to poll Netflix from EC2 regions outside the US –I am still looking into the issue to see what is happening differently outside of the country.

On a side note, I also attempted to see if any other software services advertised on AWS as using EC2 happen to reveal load balancing information or server id’s, but so far only Netflix appears to display this information.

This week I will continue to pull data and well as examine the intermediate data sets I have collected.

Week 7 – Investigating Netflix

As we continue to explore alternative methods to identify back-instances behind front-end VMs, I also began to investigate Netflix as a candidate for study. Netflix is built atop Amazon Web Services and utilizes their services from streaming and content delivery.  It is also a strong option for investigation due to the numerous resources and documentation surrounding its infrastructure including the Netflix blog and its numerous open source projects available on Github.

In my initial research I discovered that Netflix has moved away from relying solely on Amazon’s elastic load balancer and built Eureka for middle-tier load balancing and discovery , Zuul for front-end load balancing, and other services. For example, Eureka offers both round-robin load balancing as well as more advanced load-balancing algorithms.  An interesting fact is that if one examines the Netflix landing page they can see the server id  and region listed at the bottom of the page. An initial inquiry is to make simultaneous requests to Netflix’s landing pages, pull the server id off those pages and examine the results to see if we can identify some sort of pattern.

For this and next week I will be working on writing a Python script to scrape the server id information as well as reading more to understand Netflix’s cloud infrastructure setup.

Week 6 – Grace Hopper Conference

Sorry for the delay! This post actually refers to last week when I attended the Grace Hopper Celebration 2014 in Phoenix, AZ. And what fun it was indeed! I only wish it lasted longer…..

So if you don’t know, GHC is the largest technical conference for women in computing and technology. GHC includes everything from technical talks from women in the industry to networking luncheons, career advancement workshops, recognition and awarding of women whose work has made a significant impact in the field and in the world, research poster presentations and more.

First of all,  I was sponsored by UW-Madison’s Computer Science department along with several graduate women in our department. Traveling and attending GHC together really helped us give us women in the department a chance to meet and connect with one another based on our common interests.  I was able to meet so many interesting women in different fields of study from data analysis to databases to HCI!

GHC is also committed to is recognizing and awarding the women for their academic, professional, educational, and social contributions to the field. Among one of the many esteemed women was Barbara Birungi, the founder of WITU-Women in Technology Uganda, who was awarded the 2014 GHC Change Agent ABIE award at the conference.  It was especially relevant to me since I, too, via my parents, originate from Uganda and it was inspiring to know that technology is improving the lives of women all over the world. I ran into Barbara later the conference and chatted briefly about her work.

The technical talks given by women currently in academia and the industry are the other great aspect of GHC . I sat through many lightening talks where  women discussed the technical problems and solutions involved in running large software platforms and services including Facebook, Pinterest, et al. I learned about AB Testing, the development life-cycle of a software release,  various automation tools, and many other industry practices.

I was also able to attend the Women of Color Networking Luncheon and  even sat at the same table as Lynn Almoro, the Vice President of Global Risk Capabilities at American Express, who was also one of the key note speakers for the event. Her speech imparted several beneficial pieces of advice based on her experiences navigating the  professional world  and her various roles as a friend, a mentor, and a leader.

While I have barely touched on all the great things GHC has to offer, I would definitely recommend it to everyone . It was a blast!

Week 5 – Analyzing Data

This week I reviewed the IPids gathered from the captured TCP packets in our closed experiment and discovered results that did not match our initial expectations. In both the packets sent from our client machine and those received from the instances behind the load balancer, the IPids did not reveal a sequential pattern as Bellovian’s paper suggested.

To make sure that the test environment was not somehow altering the results, that is to ensure that the NAT and load balancers where not overwriting IPid fields, we ran one  additional test  between machine directly connected to the Internet (no NAT) and an instance in EC2 (no load balancer), but still observed the same results as before.

We then looked into the Linux Kernel to see exactly how the IPid of TCP packets are initialized. This revealed that earlier versions of the Linux Kernel did in fact use a global counter to set the IPid, but starting from version 2.4.0 onward the IPid is unique on a per socket connection. Most likely the results in Bellovian’s paper were from one  of those earlier versions of the Linux Kernel.

As a conclusion, we have determined that counting  the number of IPid sequences is no longer a feasible technique to discover hosts behind a NAT or load balancer. In next week, we will continue to brainstorm and search for new techniques.

Week 4 – Capturing Packets

This week was spent setting up and running the PoC (Proof of Concept) experiment in Amazon EC2. I created a load balancer connected to two virtual machines acting  as simple web servers. When I made a request to the load balancer’s DNS address, I would receive a simple HTML page from one instance or the other.

The goal was to  examine the packets returning to my client machine to see if there is identifying information in the packets that would distinguish the two VMs. To capture the packets, I used the linux command-line tools  ‘curl’ to make HTTP requests to the loadbalancer and ‘tcpdump’ to capture incoming and outgoing packets and write them to a .pcap file. Then we can use ‘wireshark’ to examine the packets individually.

The week will be spent examining the packets, particularly the IPid’s, to identify patterns, if any, that would distinguish the two VMs.

Week 3 – Information Gathering

Last meeting, we decided to focus on identifying deployments behind load balancers in the cloud. I read a couple of papers  to familiarize myself existing research and glean ideas for techniques on measuring web deployments .   “WhoWas: A Platform for Measuring Web Deployments on IaaS Clouds”  describes the WhoWas platform that uses active probing to perform network measurements and provide the history of an IP address over time.  “A Technique for Counting NATted Hosts”  by Steve Bellovin describes their process of using the unique sequences of IPids to attempt to identify the number of hosts behind a NAT (Network Address Translator).

We decided to explore the applicability of Bellovin’s IPid technique. This week we will work on setting a small trial experiment in Amazon’s EC2 with a few instances and load balancer to see if we can detect unique hosts based on the IPid.

 

 

Welcome to Mai Blog!

Hello and Welcome!

My name is Maimuna Lubega ( Mai, pronounced “My”, for short). I will be documenting  on this blog my experience as an Undergrad Researcher as apart of CRA-W’s 2014 Collaborative Research Experience for Undergraduates program. I will be conducting research on Public Infrastructure-as-a-Service clouds under UW-Madison Professor Aditya Akella and Graduate Researcher Aaron Gember-Jacobsen. Here is a abstract summary of our research project:

Cloud services are a popular web hosting and data storage option for several companies and organizations. This study aims to infer the the back-end configuration behind these web services deployed in Public IaaS clouds, for instance how many back-end servers support a hosted front-end service, the geographic distribution of these back-end resources, and if and how web services are using loading balancing and content distribution networks (CDNs). First, we will explore identifying back-ends behind load balancers or VMs by examining meta-data in HTTP headers and other techniques to try and infer the configuration and number of these back-end servers. Secondly, we will explore how to determine if web services that utilize CDNs are also hosting their content in the cloud or elsewhere. Examining the DNS look ups of a client to infer the content source is one possible technique we will investigate.

More to follow!