diff options
author | Nick Mathewson <nickm@torproject.org> | 2008-05-08 04:13:36 +0000 |
---|---|---|
committer | Nick Mathewson <nickm@torproject.org> | 2008-05-08 04:13:36 +0000 |
commit | 32065813ac34437971cb9c8a95a1923557d0557d (patch) | |
tree | 2fe16f2f91ea0d16de7e2cca2a1673cdd88d21c6 /doc/spec/proposals/ideas | |
parent | 2238d8008d6c1e71e23fa52fbf51dc8773966abe (diff) | |
download | tor-32065813ac34437971cb9c8a95a1923557d0557d.tar tor-32065813ac34437971cb9c8a95a1923557d0557d.tar.gz |
Add proposed methodolody for tracking national usage trends.
svn:r14578
Diffstat (limited to 'doc/spec/proposals/ideas')
-rw-r--r-- | doc/spec/proposals/ideas/xxx-geoip-survey-plan.txt | 88 |
1 files changed, 88 insertions, 0 deletions
diff --git a/doc/spec/proposals/ideas/xxx-geoip-survey-plan.txt b/doc/spec/proposals/ideas/xxx-geoip-survey-plan.txt new file mode 100644 index 000000000..08612aa46 --- /dev/null +++ b/doc/spec/proposals/ideas/xxx-geoip-survey-plan.txt @@ -0,0 +1,88 @@ + + +Abstract + + This document explains how to tell about how many Tor users there + are, and how many there are in which country. Statistics are + involved. + +Motivation + + There are a few reasons we need to keep track of which countries + Tor users (in aggregate) are coming from: + + - Resource allocation. Knowing about underserved countries with + lots of users can let us know about where we need to direct + translation and outreach efforts. + + - Anticensorship. Sudden drops in usage on a national basis can + indicate the arrival of a censorious firewall. + + - Sponsor outreach and self-evalutation. Many people and + organizations who are interested in funding The Tor Project's + work want to know that we're successfully serving parts of the + world they're interested in, and that efforts to expand our + userbase are actually succeeding. So, when you come right + down to it, do we. + +Goals + + We want to know about how many Tor users there are, and which + countries they're in, even in the presence of a hypothetical + "directory guard" feature. Some uncertainty is okay, but we'd like + to be able to put a bound on the uncertainty. + + We need to make sure this information isn't exposed in a way that + helps an adversary. + +Methods: + + Every client downloads network status documents. There are + currently three methods (one hypothetical) for clients to get them. + - 0.1.2.x clients (and earlier) fetch a v2 networkstatus + document about every NETWORKSTATUS_CLIENT_DL_INTERVAL [30 + minutes]. + + - 0.2.0.x clients fetch a v3 networkstatus consensus document + at a random interval between when their current document is no + longer freshest, and when their current document is about to + expire. + + [In both of the above cases, clients choose a directory cache at + random with odds roughly proportional to its bandwidth.] + + - In some future version, clients will choose directory caches + to serve as their "directory guards" to avoid profiling + attacks, similarly to how clients currently start all their + circuits at guard nodes. + + We assume that a directory cache can tell which of these three + categories a client is in by the format of its status request. + + A directory cache can be made to count distinct client IP + addresses that make a certain request of it in a given timeframe. + For the first two cases, a cache can get a picture of the overall + number and countries of users in the network by dividing the IP + count by the probability with which they (as a cache) would be + chosen. Assuming that our listed bandwidth is such that we expect + to be chosen with probability P for any given request, and we've + been counting IPs for long enough that we expect the average + client to have made N requests, they will have visited us at least + once with probability P' = 1-(1-P)^N, and so we divide the IP + counts we've seen by P' for our estimate. + + If directory guards are in use, directory guards get a picture of + all those users who chose them as a guard when they were listed + as a good choice for a guard, and who are also on the network + now. The cleanest data here will come from nodes that were listed + as good new-guards choices for a while, and have not been so for a + while longer (to study decay rates); nodes that have been listed + as good new-guard choices consistently for a long time (to get a + sample of the network); and nodes that have been listed as good + new-guard choices only recently (to get a sample of new users and + users whose guards have died out.) + + Note that these measurements *shouldn't* be taken at directory + authorities: their picture of the network is too skewed by the + special cases in which clients fetch from them directly. + |