APIBackuper: a command-line tool to archive/backup data API calls

Ivan Begtin
4 min readApr 29, 2022

--

A lot of data published as API, iterable by page or skip parameters API. Sometimes it’s documented, sometimes not, but quite often it’s impossible to get whole dataset without iterating all pages/data chunks.

Along other command line tools to extract and process data, several years ago I’ve created open source configuration file based tool called APIBackuper. An idea was to have a command line tool that could extract and export data as JSON lines from any API endpoint.

I will use Berlin airport undocumented API from from Berlin airport website as example.

This API is undocumented but you could find it using Chrome developers tools or similar tools in other browsers on web page https://ber.berlin-airport.de/en/flying/departures-arrivals.html

You may find API endpoint https://ber.berlin-airport.de/api.flights.json and JSON Data.

https://ber.berlin-airport.de/api.flights.json?arrivalDeparture=A&dateFrom=2022-04-28T00:00:00&dateUntil=2022-04-29&search=&lang=en&page=1&terminal=
https://ber.berlin-airport.de/api.flights.json?arrivalDeparture=A&dateFrom=2022-04-28T00:00:00&dateUntil=2022-04-29&search=&lang=en&page=1&terminal=

You may find that all data is in “data.items” array and total number of items defined as “data.total_items”. Also you may notice that each record has unique identifier as id and that default page size is 20.

This information reused than you define API extraction configuration.

APIBackuper configuration file

APIBackuper uses config files apibackuper.cfg to define information about API endpoint and extract data effectively.

For example, configuration file of Berlin airport API will look like this

apibackuper.cfg

I will explain some options in key sections.

project

  • url — full url of API endpoint without any parameter (parameters defined in params.json file)
  • http_mode — GET or POST. This example uses GET requests
  • iterate_by — two modes supported. Iteration by page or by skip number. This API uses page based iteration

params

  • start_page-start start page, default is 1
  • page_size_param — if set it’s used to define page size. Not required for this API
  • page_size — size of a single page, is 20 for this API
  • page_number_param — parameter for url that is number of the page. This API uses parameter “page”

data

  • total_number_key — key in data from server with total number of records, this time it is “data.total_items”
  • data_key — key with data items, path to the JSON array. It’s “data.items” for this API
  • item_key — unique identifier of the record, if exists. This time it is id
  • change_key — key of the record used to identify if record changed. Not used right now, reserved for future use

To do requests for data from server we need to define API endpoint parameters. Parameters defined inside params.json file in the same directory as apibackuper.cfg.

params.json

Parameter page will be replaced during every request since it’s defined as page_number_param in apibackuper.cfg file

Running project

Use command apibackuper run to start data collection, you should run it in the directory with apibackuper.cfg file. Tool will log all requests to the console and to the file apibackuper.log in the same directory

Console log of APIBackuper

Results of each requests stored in sub-dir storage and file storage.zip

Each request response stored as separate JSON file page_1.json, page_2.json, …, page_12.json.

Now we need to create dataset from collected data.

Export data

After data collected, APIBackuper support data export. Use command apibackuper export jsonl data.jsonl and it will export data from storage.zip file and it will write each record as individual line.

data.jsonl

All code and data available at Github https://github.com/ruarxive/apibackuper-berlin-airport-example

More features

APIBackuper support much more features, you could use it to extract each record if API provides item-based endpoint, when you could get list of items and extract each item.

APIBackuper allows to extract all files using predefined patterns for each mentioned url in JSON data.

APIBackuper could estimate time of data extraction and much more.

I use it regularly to collect data from undocumented hundreds API endpoints. Configuration files simplify data extraction process and this tool could be used for data engineering and data preservation tasks.

Please try it https://github.com/ruarxive/apibackuper and provide your feedback.

--

--

Ivan Begtin

I am founder of APICrafter, I write about Data Engineering, Open Data, Data, Modern Data stack and Open Government. Join my Telegram channel https://t.me/begtin