How fast the Internet is!
Extend the ETL pipeline!
Imagine being able to finally know how fast the Internet is.
You will think … but imagine how I know it! I connect with my 1 Gbps provider …
But then with cyberspace …
Not with your neighbor but with the REAL Internet, the network of networks !!!
Ask … but is there anyone who knows? … search search search …

We found ourselves by chance at a conference of GARR (Italian National Research and Education Network) in front of a data scientist who spoke to us about this wonderful dataset[1]

And here is the project born in 2009 from an idea by Vinton Cerf.
And the data is public like Creative Commons Attribution-Non commercial-Share Alike 4.0 International License…!!!

Since 2009, M-Lab has been collecting speed tests produced by users around the world or by ISPs, to know, among other things, how much is the upload and download speed and gives me transmission parameters such as RTT and Retransmission rate, makes a diagnosis of possible bottlenecks, geolocates clients and servers where tests take place (not very precise but still useful), tells me the Autonomous System the client and server belong to, and does so in off-net mode.

“…M-Lab’s measurements are always conducted off-net. This way, M-Lab is able to measure performance from testers’ computers to locations where popular Internet content is often hosted. By having inter-network connections included in the test, test users get a real sense of the performance they could expect when using the Internet…”[2].
But there’s a problem!
The data stored on Google can be viewed and viewed with Big Query but I cannot extract it and transfer it to my server / pc for more than about 16,000 records.
https://cloud.google.com/bigquery/quotas
How can I extract more records?
And now?
My friends and I thought about it a bit …
There are many solutions in the field and one came to us during a Masters in Cybersecurity of the CNR / University of Pisa in a Cyberintelligence course[3][4] and we have found a solution:
extend the ETL pipeline !!!

But not with the usual tools (W open source!)
We build a new chain for data extraction.
But first we install a computer, lots of RAM, computing power and fast storage[5]
We then leaned on a Python script to read the data.

The script is driven by Apache Nifi which is a wonderful tool to automate data flows between software systems, based on the Niagara File software developed by the NSA and open source since 2014.
Behind a web-based interface, Apache Nifi works on a cluster and provides real-time control that simplifies the management of data flow between source and destination. It supports the most disparate data formats, schemas and protocols and is able to handle messages of arbitrary length.
For this purpose it supports over 200 processors (including those by Kafka and Flume) as well as the configuration of the back-pressure threshold on the connection.
The data read and extracted from Google Cloud Storage are stored locally and uploaded to Elastic Seach and displayed with Kibana.
And here is the data that we wanted stored in JSON format on ElasticSearch and displayed in a graphic format that allows us to be able to see, compare and analyze the data collected on the Internet from the speed tests collected by the M-Lab project.

We wanted to talk to you about how quickly human rights are transferred to the Internet but unfortunately we started with bits and bytes…
To the next article of the cyberattack collective[6]
[1] https://www.eventi.garr.it/it/conf19/programma/29-conferenza-garr-2019/lightning-talk/438-stefania-del-prete
[2] https://www.measurementlab.net/faq/
[3] https://www.iit.cnr.it/en/maurizio.tesconi
[4] https://www.iit.cnr.it/en/tiziano.fagni
[5] Andrea Casalegno — CTO — Top-IX, Andrea Rivetti — Program Manager — Top-IX
[6] https://it.linkedin.com/in/nadiaspitilli, https://www.linkedin.com/in/matteo-chesi-927071180, https://www.linkedin.com/in/rodolfo-boraso-22a8311/