“ATK Data” and data import (ingestion)

An a developer sooner or later you’ll will have to add bulk data import into your app.

What may seem like a trivial task could hide some complexities. As a maintainer of ATK Data project, I wanted to talk about some of the challenges and share some ideas on how database abstraction layer could help you importing bulk data.

Raw import vs Model data import

The purpose of ATK Data (and data abstraction in general) is to separate specific data vendor implementation from the rest of the application.

ATK Data takes one step ahead (compared to ORMs) by introducing “DataSets”. This way your application may be working with a sub-set of rows within table/collection which meet certain conditions (such as “records belonging to certain user” or “excluding deleted rows”).

That means — when you are importing bulk data into DataSet it must conform to conditions you have defined:

$client = new Client($db);
$client->loadBy('name', 'John');
$orders = $user->ref('Orders');
$orders->addCondition('status', 'draft');
$orders->import($raw_order_data);

In this scenario, regardless of the contents of $raw_order_data it’s guaranteed that all newly added records will belong to client “John” and will have “status”=”draft”. Obviously when importing data like this, you also benefit from data persistence abstraction — after all client data may be stored inside RestAPI or NoSQL.

In contrast if you use SQL specific-command such as LOAD DATA INFILE, then you would have to verify condition of newly added data yourself with a specific implementation for your specific database.

Depending on your project guidelines you can make a choice to go for raw import vs model data import.

In this article, I wanted to discuss some ideas what other benefits you can developer gain if you use Model Data Import. Please give me your feedback below, I’ll review it and reply.

Field Value Lookups

Quite often your import data will contain de-normalised data: e.g. could be name of the country or clients name. Your physical database won’t be storing “country_name” for each imported data row, instead it should be replaced by corresponding “id” and placed in “country_id” field.

One approach is to fetch all countries from the database and create a map between country names and IDs. This may significantly slow down your process (especially if the lookup table is large) and will increase memory consumption.

ATK Data already uses a different approach. Assuming the following code:

// Model Client, init() method:
$this->hasOne('country_id', new Country())
->addTitle();
// Import statement:
$cl = new Client($db);
$cl->import([
['name'=>'Janis', 'country'=>'Latvia'],
['name'=>'John', 'country'=>'United Kingdom'],
]);

..the value of country_id in SQL is replaced by expression: (select id from country where name=':a'). Additionally if related “Country” model has any conditions inside inside init() code, those will also be honoured. This is specifically useful to avoid cross-account record referencing (if country is user-specific and country.name=’Latvia’ is owned multiple users)

Specifying Lookup Fields

We are adding new ways to lookup related records by using other fields:

// Model Client, init() method:
$this->hasOne('country_id', new Country())
->withTitle()
->addField('country_code', 'code');
// Import statement
$cl = new Client($db);
$cl->import([
['name'=>'Janis', 'country_code'=>'LV'],
['name'=>'John', 'country_code'=>'GB'],
]);

This approach can lookup by any field and even use multiple fields in the lookup. If the corresponding record is not found, then null will be used and if multiple records match, then the first match will be used (so make sure you lookup using unique fields).

Importing hasMany relationships

NoSQL databases are often used to store hierarchical data. For example when you import Invoice records, you may want to specify “Lines” along with totals. This works well in NoSQL but is not trivial in SQL. ATK Data offers several ways around this.

Map hasMany relationship into sub-table

Consider this definition:

// Invoice Model. init() method.
$this->addField('ref');
$this->hasMany('Lines', new InvoiceLine())
->addField('total', ['aggregate'=>'sum'];
// InvoiceLine Model. init() method.
$this->hasOne('invoice_id', new Invoice());
$this->addField('price', ['type'=>'money']);
$this->addField('qty', ['type'=>'integer']);
$this->addExpression('total', '[price]*[qty]');
// Import code:
$inv = new Invoice($db);
$inv->import([
['ref'=>'INV1', 'Lines'=>[
['price'=>15.99, 'qty'=>2],
['price'=>7.25, 'qty'=>1]
],
['ref'=>'INV2', 'Lines'=>[
['price'=>25.99, 'qty'=>5],
]
]);

ATK Data will automatically populate Invoice and InvoiceLine records and will link them together. This is quite handy if you are use API or import JSON objects.

I should probably note, that if you use beforeSave / beforeInsert (or similar other hooks), those will be properly executed making sure your business logic is not circumvented.

Serialised Data Store

Another approach is to encode Invoice Line data and store it as a JSON object inside your SQL field. ATK Data supports this approach also. The only modification from above code should be:

$this->hasMany('Lines', [new InvoiceLine(), 'serialize'=>'json']);
// ->addField('total', ['aggregate'=>'sum'];
// can't aggregate non-sql data. yet.

The same import statements will work. Also you can write code like this:

$invoice->ref('Lines')->insert(['qty'=>3, 'price'=>4.99]);

This will, in fact update the serialised field. To keep the multiplication logic, you can amend ‘total’ expression into this:

$this->addExpression('total', function($m) { 
return $m['price']*$m['qty'];
});

This will carry out multiplication on the PHP side.

Many to Many relationship handling

The most trouble you normally get from implementation of many-to-many relationship. Imagine a scenario where you work with “Ticket” and user can assign one or multiple “Label” to a ticket (like in the Github.com).

First we define our model and references. This time I’ll use a full code snippet:

class Ticket extends \atk4\data\Model {
public $table = 'ticket';
public $title = 'title';
  function init() {
parent::init();
    $this->addField('title')
$this->hasMany('Labels', new TicketLabel())
->addField('labels', ['concat'=>',']);
}
}
class Label extends \atk4\data\Model {
public $table = 'label';

function init() {
parent::init();
    $this->addField('name');
}
}
class TicketLabel extends \atk4\data\Model {
public $table = 'ticket_label';
public $title = 'label';
  function init() {
parent::init();
    $this->hasOne('ticket_id', new Ticket())
->addTitle();
    $this->hasOne('label_id', new Label())
->addTitle();
}
}

Few points to note. First — I don’t really need a reference from Label to anywhere, so I’m not including it. In TicketLabel model, I’m importing two title fields — “ticket” and “label”. The “label” is also used as a title field for TicketLabel and will further be used for “Issue” field “labels”.

With that you can fetch easily get list of labels, when you’re listing Tickets:

foreach ($tickets as $data) {
echo $ticket['title']." ".$ticket['labels']."\n";
}

The definition above can also be used for either importing data or even to modify existing data:

$issue = new Issue($db);
$issue->import([
['title'=>'Broken build', 'labels'=>'build,bug'],
['title'=>'No Spanish translation', 'labels'=>'translation']
]);

Bonus — many-to-many relation in JSON

You may prefer to store many-to-many relations in a JSON field. ATK Data now allows you to do that, simply use:

$this->hasMany('Labels', [
new TicketLabel(),
'serialize'=>'json',
'actual'=>'label_json'
]);

What do you think?

Some (but not all) features are already implemented in https://github.com/atk4/data. I wanted to get more feedback and ideas from my readers before committing more time into implementing and documenting the rest.