Implementation
There're some details in the implementation that differ from the paper as well as the original R code.
To compare the graphs shown in the paper with the graphs generated by the
implementations, the "training" data is chosen from the first that metrics web
site provide (2011-03-06) to the approximate date when the paper was written,
ie. 2018-04-01.
This data and the intermediate matrices to find the anomalies are used in the tests.
The original code generates 3 rankings and 4 graphs for countries with anomalous behaviour in the intervals:
- last day
- last week
- last month
- all the data available.
Instead this code generates a ranking and a graph in a date interval ( 1 day by default).
Graphs
The paper includes this graph:
source: 2017-04-2018-04-top_10_countries.svg
Graphs generated with the original code:
source: 2018-03-31-01day.svg
source: 2018-03-31-07day.svg
source: 2018-03-31-30day.svg
source: 2012-2018-all_countries.svg
Generated by this code:
source: 2018-04-01-plot.svg
Data preparation
Starting from Tor’s per-country usage data, we initially remove all
countries whose usage never rises above 100 users, to avoid the
unacceptably high variance in such data.
The code remove countries whose usage never rises 1000 users.
To remove seasonality, Python's STL and R's stl don't produce the same values, though quite approximated.
The same happens when removing countries with constant values .
Principal Components Analysis
Python's PCA and R's prcomp don't produce the same values, though quite approximated.
Reconstruction
For each 180-day period in the dataset we apply a principal
component analysis over the usage time series for all countries,
resulting in a set of components for that time window. Taking
the true observed usage for each country for the final day of each
window, we calculate the approximated value from the first 12
principal components
The code takes the first day for the first 180 days window and the last day for the rest windows (?).
Based on this heuristic,
our experimental results suggest twelve principal components as
broadly optimal across the dataset.
To reconstruct the original data matrix, all the principal compomponents are used (instead of just 12).
There is no need to take the different outputs from Python's PCA to
reconstruct the matrix, since it provides an inverse function that already does
that.