Week 8 – Polling Netflix

The previous week I wrote a Python script that successfully pulled the server id from the HTML of Netflix’s landing page.  I ran the script in a loop to build up a data set from which I then was able to identify 4-5 distinct servers, although I did not observe any simply identifiable  patterns by naive examination. Next , I decided to set up VMs in different regions of EC2 and have them poll the Netflix site in order to build up a larger data set. While I was able to successfully pull data with  VMs set up in several US regions, I ran into a few errors when attempting to poll Netflix from EC2 regions outside the US –I am still looking into the issue to see what is happening differently outside of the country.

On a side note, I also attempted to see if any other software services advertised on AWS as using EC2 happen to reveal load balancing information or server id’s, but so far only Netflix appears to display this information.

This week I will continue to pull data and well as examine the intermediate data sets I have collected.

Week 7 – Investigating Netflix

As we continue to explore alternative methods to identify back-instances behind front-end VMs, I also began to investigate Netflix as a candidate for study. Netflix is built atop Amazon Web Services and utilizes their services from streaming and content delivery.  It is also a strong option for investigation due to the numerous resources and documentation surrounding its infrastructure including the Netflix blog and its numerous open source projects available on Github.

In my initial research I discovered that Netflix has moved away from relying solely on Amazon’s elastic load balancer and built Eureka for middle-tier load balancing and discovery , Zuul for front-end load balancing, and other services. For example, Eureka offers both round-robin load balancing as well as more advanced load-balancing algorithms.  An interesting fact is that if one examines the Netflix landing page they can see the server id  and region listed at the bottom of the page. An initial inquiry is to make simultaneous requests to Netflix’s landing pages, pull the server id off those pages and examine the results to see if we can identify some sort of pattern.

For this and next week I will be working on writing a Python script to scrape the server id information as well as reading more to understand Netflix’s cloud infrastructure setup.

Week 6 – Grace Hopper Conference

Sorry for the delay! This post actually refers to last week when I attended the Grace Hopper Celebration 2014 in Phoenix, AZ. And what fun it was indeed! I only wish it lasted longer…..

So if you don’t know, GHC is the largest technical conference for women in computing and technology. GHC includes everything from technical talks from women in the industry to networking luncheons, career advancement workshops, recognition and awarding of women whose work has made a significant impact in the field and in the world, research poster presentations and more.

First of all,  I was sponsored by UW-Madison’s Computer Science department along with several graduate women in our department. Traveling and attending GHC together really helped us give us women in the department a chance to meet and connect with one another based on our common interests.  I was able to meet so many interesting women in different fields of study from data analysis to databases to HCI!

GHC is also committed to is recognizing and awarding the women for their academic, professional, educational, and social contributions to the field. Among one of the many esteemed women was Barbara Birungi, the founder of WITU-Women in Technology Uganda, who was awarded the 2014 GHC Change Agent ABIE award at the conference.  It was especially relevant to me since I, too, via my parents, originate from Uganda and it was inspiring to know that technology is improving the lives of women all over the world. I ran into Barbara later the conference and chatted briefly about her work.

The technical talks given by women currently in academia and the industry are the other great aspect of GHC . I sat through many lightening talks where  women discussed the technical problems and solutions involved in running large software platforms and services including Facebook, Pinterest, et al. I learned about AB Testing, the development life-cycle of a software release,  various automation tools, and many other industry practices.

I was also able to attend the Women of Color Networking Luncheon and  even sat at the same table as Lynn Almoro, the Vice President of Global Risk Capabilities at American Express, who was also one of the key note speakers for the event. Her speech imparted several beneficial pieces of advice based on her experiences navigating the  professional world  and her various roles as a friend, a mentor, and a leader.

While I have barely touched on all the great things GHC has to offer, I would definitely recommend it to everyone . It was a blast!

Week 5 – Analyzing Data

This week I reviewed the IPids gathered from the captured TCP packets in our closed experiment and discovered results that did not match our initial expectations. In both the packets sent from our client machine and those received from the instances behind the load balancer, the IPids did not reveal a sequential pattern as Bellovian’s paper suggested.

To make sure that the test environment was not somehow altering the results, that is to ensure that the NAT and load balancers where not overwriting IPid fields, we ran one  additional test  between machine directly connected to the Internet (no NAT) and an instance in EC2 (no load balancer), but still observed the same results as before.

We then looked into the Linux Kernel to see exactly how the IPid of TCP packets are initialized. This revealed that earlier versions of the Linux Kernel did in fact use a global counter to set the IPid, but starting from version 2.4.0 onward the IPid is unique on a per socket connection. Most likely the results in Bellovian’s paper were from one  of those earlier versions of the Linux Kernel.

As a conclusion, we have determined that counting  the number of IPid sequences is no longer a feasible technique to discover hosts behind a NAT or load balancer. In next week, we will continue to brainstorm and search for new techniques.