aboutsummaryrefslogtreecommitdiff
path: root/doc/spec/proposals/151-path-selection-improvements.txt
blob: e3c8f35451f248ca04a84beacfee6af162b8c20d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
Filename: 151-path-selection-improvements.txt
Title: Improving Tor Path Selection
Version:
Last-Modified:
Author: Fallon Chen, Mike Perry
Created: 5-Jul-2008
Status: Draft

Overview

  The performance of paths selected can be improved by adjusting the
  CircuitBuildTimeout and avoiding failing guard nodes. This proposal
  describes a method of tracking buildtime statistics at the client, and 
  using those statistics to adjust the CircuitBuildTimeout.

Motivation

  Tor's performance can be improved by excluding those circuits that
  have long buildtimes (and by extension, high latency). For those Tor
  users who require better performance and have lower requirements for
  anonymity, this would be a very useful option to have.

Implementation

  Storing Build Times

    Circuit build times will be stored in the circular array
    'circuit_build_times' consisting of uint16_t elements as milliseconds.
    The total size of this array will be based on the number of circuits
    it takes to converge on a good fit of the long term distribution of
    the circuit builds for a fixed link. We do not want this value to be
    too large, because it will make it difficult for clients to adapt to
    moving between different links.

    From our initial observations, this value appears to be on the order 
    of 1000, but will be configurable in a #define NCIRCUITS_TO_OBSERVE.
    The exact value for this #define will be determined by performing
    goodness of fit tests using measurments obtained from the shufflebt.py
    script from TorFlow.
 
  Long Term Storage

    The long-term storage representation will be implemented by storing a 
    histogram with BUILDTIME_BIN_WIDTH millisecond buckets (default 50) when 
    writing out the statistics to disk. The format of this histogram on disk 
    is yet to be finalized, but it will likely be of the format 
    'CircuitBuildTime <bin> <count>', with the total specified as 
    'TotalBuildTimes <total>'
    Example:

    TotalBuildTimes 100
    CircuitBuildTimeBin 1 50
    CircuitBuildTimeBin 2 25
    CircuitBuildTimeBin 3 13
    ...

    Reading the histogram in will entail multiplying each bin by the 
    BUILDTIME_BIN_WIDTH and then inserting <count> values into the 
    circuit_build_times array each with the value of
    <bin>*BUILDTIME_BIN_WIDTH. In order to evenly distribute the 
    values in the circular array, a form of index skipping must
    be employed. Values from bin #N with bin count C and total T
    will occupy indexes specified by N+((T/C)*k)-1, where k is the
    set of integers ranging from 0 to C-1.

    For example, this would mean that the values from bin 1 would
    occupy indexes 1+(100/50)*k-1, or 0, 2, 4, 6, 8, 10 and so on.
    The values for bin 2 would occupy positions 1, 5, 9, 13. Collisions
    will be inserted at the first empty position in the array greater 
    than the selected index (which may requiring looping around the 
    array back to index 0).

  Learning the CircuitBuildTimeout

    Based on studies of build times, we found that the distribution of
    circuit buildtimes appears to be a Pareto distribution. 

    We will calculate the parameters for a Pareto distribution
    fitting the data using the estimators at
    http://en.wikipedia.org/wiki/Pareto_distribution#Parameter_estimation.

    The timeout itself will be calculated by solving the CDF for the 
    a percentile cutoff BUILDTIME_PERCENT_CUTOFF. This value
    represents the percentage of paths the Tor client will accept out of
    the total number of paths. We have not yet determined a good
    cutoff for this mathematically, but 85% seems a good choice for now.

    From http://en.wikipedia.org/wiki/Pareto_distribution#Definition,
    the calculation we need is pow(BUILDTIME_PERCENT_CUTOFF/100.0, k)/Xm. 

  Testing

    After circuit build times, storage, and learning are implemented,
    the resulting histogram should be checked for consistency by
    verifying it persists across successive Tor invocations where 
    no circuits are built. In addition, we can also use the existing
    buildtime scripts to record build times, and verify that the histogram 
    the python produces matches that which is output to the state file in Tor,
    and verify that the Pareto parameters and cutoff points also match.
  
  Soft timeout vs Hard Timeout
   
    At some point, it may be desirable to change the cutoff from a 
    single hard cutoff that destroys the circuit to a soft cutoff and
    a hard cutoff, where the soft cutoff merely triggers the building
    of a new circuit, and the hard cutoff triggers destruction of the 
    circuit.

    Good values for hard and soft cutoffs seem to be 85% and 65% 
    respectively, but we should eventually justify this with observation.

  When to Begin Calculation

    The number of circuits to observe (NCIRCUITS_TO_CUTOFF) before 
    changing the CircuitBuildTimeout will be tunable via a #define. From 
    our measurements, a good value for NCIRCUITS_TO_CUTOFF appears to be 
    on the order of 100.

  Dealing with Timeouts

    Timeouts should be counted as the expectation of the region of 
    of the Pareto distribution beyond the cutoff. The proposal will
    be updated with this value soon.

    Also, in the event of network failure, the observation mechanism 
    should stop collecting timeout data.

  Client Hints

    Some research still needs to be done to provide initial values
    for CircuitBuildTimeout based on values learned from modem
    users, DSL users, Cable Modem users, and dedicated links. A 
    radiobutton in Vidalia should eventually be provided that
    sets CircuitBuildTimeout to one of these values and also 
    provide the option of purging all learned data, should any exist.

    These values can either be published in the directory, or
    shipped hardcoded for a particular Tor version.
    
Issues

  Impact on anonymity

    Since this follows a Pareto distribution, large reductions on the
    timeout can be achieved without cutting off a great number of the
    total paths. This will eliminate a great deal of the performance
    variation of Tor usage.