Trackography: Methodology

Through Trackography we are examining which companies track us when we access media websites, as well as where our data travels to when we read the news online. Our methodology includes the following:

Stage 1: Mapping the media

Stage 2: Running our software

Stage 3: Making sense of collected data

Stage 4: Examining results

Stage 5: Examining the main tracking companies

Stage 6: Making the privacy policies machine-readable

Stage 7: Visualising results

Read details below:

Stage 1: Mapping the media

We started the project by exploring online tracking through media websites. Unlike other types of websites which are more content-specific, online news are read by most of us everyday - regardless of our background - which means potentially anyone can be tracked and profiled.

We initially compiled lists of media websites for various countries around the world, which can be accessed through our repository on github. Additionally, we collaborated with individuals around the world to review our media lists and to support us with the following:

1. Adding missing websites to our lists which cover the news, are of public interest and which are regularly accessed by most individuals on a national or local level in each country

2. Deleting websites from our lists which are not regularly updated, do not necessarily cover the news and are not regularly accessed by most individuals on a national or local level in each country

3. Separating the following in our lists:

National media websites
Regional media websites
Blogs covering the news

Reviewed media lists can be found in the "verified" section of our repository on github and can potentially be changed and updated by the online community.

Stage 2: Running our software

Once our lists of media websites were reviewed, we ran our data collection software from various countries around the world on these lists. Details about how to run our software can be viewed through our repository on github.

Our software is designed to emulate a browser and to connect to websites included in the media lists. The software not only allows us to view a user's traceroute to the server of a specific website everytime he or she accesses it, but to also collect all the third party URLs which are included in the websites.

As such, our software enables us to detect:

the servers hosting the websites we access and the countries they are located in
the servers of the companies tracking us through these websites and the countries they are located in
the countries which host the network infrastructure required to reach the servers of the websites we access
the countries which host the network infrastructure required to reach the servers of the companies tracking us through these websites

Note that our software provides a snapshot of online tracking within a specific moment in time. This might change depending on various variables, such as the time, ISP and international carrier routing.

We have refrained from running the software over the following:

Tor: Our software performs traceroutes which cannot run over Tor. If the software is run over Tor, the web connection would appear from a different network point than the traceroutes and would lead to inaccurate results.
Network filtering ICMP packets: A traceroute receives ICMP time exceeded packets and the results will be incomplete if the software is run over network filtering ICMP packets. This is displayed by the software.
Internet lines with a lot of packet loss (WiFi/WiMax which is far from the access point): A traceroute is based on a protocol which does not support re-transmission and if the software is run over Internet lines with a lot of packet loss, the possibility of having incomplete results would be high.

We collected the results from our software which was run in various countries around the world.

Stage 3: Making sense of collected data

Once we ran our software and collected results about who tracks us and where our data travels to when we access media websites, we tried to make sense of the collected data. As such, we:

validated the results through a data integrity check
performed a look up of GeoIP database for every IP address from the traceroute and tracking server
stored the collected data in a database which is published via RESTful API
did a statistical analysis on all of the collected data and results which are published via RESTful API

Stage 4: Examining results

Once we made sense of the collected results, we examined them based on the following:

The prevailing companies that track users around the world
The legal privacy framework of some of the countries that host the servers of websites and tracking companies
How tracking differs in various countries around the world
The geopolitics of data

Stage 5: Examining the main tracking companies

We identified some of the main trackers: the companies which are prevalent in tracking users based on the results we have collected so far.

We then collected the following fields of data through the website of each of the main tracking companies we examined:

Headquarters
Parent company
When they were founded
Their services (advertising, profiling, web analytics, market researcher, web crawler)
Their clients

The above data can be viewed through our repository on github.

Stage 6: Making the privacy policies machine-readable

Based on the privacy policies of some of the main tracking companies, we collected the following information on whether:

they collect personally identifiable information (PII)
they collect non-personally identifiable information (non-PII)
they collect technical data
they provide safeguards to prevent the full identification of users' IP address
users can opt-out from being tracked
they support Do Not Track (DNT)
users can access data collected about them
data is being collected by other third parties, in addition to the main trackers
they use browser cookies
they use flash cookies
they use web beacons/ web bugs
they disclose users' data to third parties
they prohibit third parties from using disclosed data for unspecified purposes
they retain data and if so, for how long
they comply with the U.S - EU Safe Harbour Framework
they are certified by TRUSTe

Additionally, we also collected data about correlations between these tracking companies and intelligence agencies.

The above data has been published via RESTful API and can be viewed through our repository on github.

Stage 7: Visualising results

We created a map which visualises most of our research results.

We encourage you to play with our map and to view the following:

which companies track you when you read the news online
which countries your data passes through when you read the news online
which countries are hosting the servers of the companies tracking you
which countries are hosting the servers of the media websites you access
how tracking companies handle your data

Please send feedback to trackmap@tacticaltech.org

pub 3200R/0x94E7EF47 2014-08-05 [expires: 2015-08-30]
Key fingerprint = ABC2 7639 5EE3 3245 A0A1 3973 40E2 6C25 94E7 EF47
uid TrackMap project <trackmap@tacticaltech.org>
sub 3200R/0x504DEBDF 2014-08-05 [expires: 2015-08-30]