Implementation

There're some details in the implementation that differ from the paper as well as the original R code.

To compare the graphs shown in the paper with the graphs generated by the implementations, the "training" data is chosen from the first that metrics web site provide (2011-03-06) to the approximate date when the paper was written, ie. 2018-04-01.

This data and the intermediate matrices to find the anomalies are used in the tests.

The original code generates 3 rankings and 4 graphs for countries with anomalous behaviour in the intervals:

  • last day
  • last week
  • last month
  • all the data available.

Instead this code generates a ranking and a graph in a date interval ( 1 day by default).

Graphs

The paper includes this graph:

2017-04-2018-04-top_10_countries source: 2017-04-2018-04-top_10_countries.svg

Graphs generated with the original code:

2018-03-31-01day source: 2018-03-31-01day.svg

2018-03-31-07day source: 2018-03-31-07day.svg

2018-03-31-30day source: 2018-03-31-30day.svg

2012-2018-all_countries source: 2012-2018-all_countries.svg

Generated by this code:

2018-04-01-plot source: 2018-04-01-plot.svg

Data preparation

Starting from Tor’s per-country usage data, we initially remove all
countries whose usage never rises above 100 users, to avoid the
unacceptably high variance in such data.

The code remove countries whose usage never rises 1000 users.

To remove seasonality, Python's STL and R's stl don't produce the same values, though quite approximated.

The same happens when removing countries with constant values .

Principal Components Analysis

Python's PCA and R's prcomp don't produce the same values, though quite approximated.

Reconstruction

For each 180-day period in the dataset we apply a principal
component analysis over the usage time series for all countries,
resulting in a set of components for that time window. Taking
the true observed usage for each country for the final day of each
window, we calculate the approximated value from the first 12
principal components

The code takes the first day for the first 180 days window and the last day for the rest windows (?).

Based on this heuristic,
our experimental results suggest twelve principal components as
broadly optimal across the dataset.

To reconstruct the original data matrix, all the principal compomponents are used (instead of just 12).

There is no need to take the different outputs from Python's PCA to reconstruct the matrix, since it provides an inverse function that already does that.