After modifying the script to fetch cookies from our list of Alexa’s 10000 websites, I was able to identify a larger number of sites using Amazon’s ELB sticky sessions ( roughly 144 ) compared to number of sites from our diff results. In my pre-research before writing the script, I did note that while some sites had the AWSELB cookie, it did not show up in a cURL command line request. Upon closer examination, AWSELB cookie belonged to another domain. For example, when examining the Heroku landing site using Chrome developer tools, I found the AWSELB session cookie, but belonging to pixel.prfct.co domain. When I checked, pixel.prfct.co did not have a web server.
I have several question regarding this behavior I hope to investigate, including resolving whether Heroku actually uses the ELB or if is it for some other content on the page. We will also discuss the results of this search to determine whether they are relevant to our original goal or at least provide some new insights.
After discussing the results of the search, we have decided to wrap-up this portion of the experiment. From what we were able to find, the current method of inquiry has proved insufficient to determine the number of back-ends behind a load balancer. Understandably, most websites we surveyed appear to have taken steps to hide this information on the front-end as it is a potential security risk.
On last search we plan to try is to check for Amazon ELB sticky session cookies. Amazon ELB sticky sessions involve passing a cookie to the client that routes it back to the same back-end server for a period of time. I plan to modify our bash script to filter out the webpages that have a cookie named AWSELB that indicates a sticky server session. If again we see no significant results, we will turn our attention to our second research goal: identifying the percentage of webpage content hosted within versus without cloud-based CDNs.
I was able to run those scripts and receive the list of websites that were using Amazon’s ELB. Then I used the grep tool to look for string that indicated Amazon EC2 instances including pattern matching IP addresses, looking for terms such as server, aws, ami , and pattern matching instance names via regular expression. Out of the subset of websites examined ( roughly 8,500), we found only a marginal number indicated and identified the back-end instance in the HTML of the landing page.
The format of such identifying string usually included a generic name-number string, such as “aws1qatweb3″ or “aws-web02″, although one site did reveal the internal IPs of the instance in question. Still, in total there were less than 7 unique domains positively identified.
I spent most of these two weeks developing scripts to send multiple requests to domains using Amazon ELB and are on Alexa’s Top Websites list and diff the HTML pages, our goal in mind to examine the diff’d results to see if those pages contain any indicators of the origin server, like those of Netflix.
Halfway through, I chose to switch from python to bash scripting since most of processing relied on native shell commands. Since the list of websites is quite big, we looked for ways to speed up the processing. For that issue, we partitioned the file into same-size chucks and used the GNU Parallel tool to launch multiple jobs that could process each chunk concurrently.
P.S. Happy Belated Thanksgiving Weekend !