5.5 KiB

Data Load Design

Key point: this is classic ETL so let's reuse those patterns and tooling.

Logic

graph LR

usercsv[User has CSV,XLS etc]
userdr[User has Tabular Data Resource]
dr[Tabular Data Resource]

usercsv --1. some steps--> dr
userdr -. direct .-> dr
dr --2. load --> datastore[DataStore]

In more detail, dividing ET(transform) from L(oad):

graph TD

subgraph "Prepare (ET)"
  rawcsv[Raw CSV] --> tidy[Tidy]
  tidy --> infer[Infer types]
  infer
end

infer --> tdr{{Tabular Data Resource<br/>csv/json + table schema}}
tdr --> dsdelete

subgraph "Loader (L)"
  datastorecreate[DataStore Create]
  dsdelete[DataStore Delete]
  load[Load to CKAN via DataStore API or direct copy]

  dsdelete --> datastorecreate
  datastorecreate --> load
end

Load step in even more detail

graph TD

tdr[Tabular Data Resource on disk from CSV in FileStore of a resource]
loadtdr[Load Tabular Data Resource Metadata]

dscreate[Create Table in DS if not exists]
cleartable[Clear DS table if existing content]
pushdatacopy[Load to DS via PG copy]
done[Data in DataStore]

tdr --> loadtdr
loadtdr --> dscreate
dscreate --> cleartable

cleartable --> pushdatacopy

pushdatacopy --> done

logstore[LogStore]

cleartable -. log .-> logstore
pushdatacopy -. log .-> logstore

Runner

We will use AirFlow.

Research

What is a Tabular Data Resource?

See Frictionless Specs. For our purposes:

NB: even if you want to go direct loading route (a la XLoader) and forget types you still need encoding etc sorted -- and it still fits in diagram above (Table Schema is just trivial -- everything is strings).

What is datastore and how to create the DataStore entry

https://github.com/ckan/ckan/tree/master/ckanext/datastore

Create an entry

curl -X POST http://127.0.0.1:5000/api/3/action/datastore_create -H "Authorization: {YOUR-API-KEY}"

resource
-d '{
  "resource": {"package_id": "{PACKAGE-ID}"},
  "fields": [ {"id": "a"}, {"id": "b"} ]
  }'

https://docs.ckan.org/en/2.8/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_create

Options for Loading

There are 3 different paths we could take:

graph TD

pyloadstr[Load in python in streaming mode]
cleartable[Clear DS table if existing content]
pushdatacopy[Load to DS via PG copy]
pushdatads[Load to DS via DataStore API]
pushdatasql[Load to DS via sql over PG api]
done[Data in DataStore]
dataflows[DataFlows SQL loader]

cleartable -- 1 --> pyloadstr
pyloadstr --> pushdatads
pyloadstr --> pushdatasql
cleartable -- 2 --> dataflows
dataflows --> pushdatasql
cleartable -- 3 --> pushdatacopy

pushdatasql --> done
pushdatacopy --> done
pushdatads --> done

Pros and Cons of different approaches

Criteria Datastore Write API PG Copy Dataflows
Speed Low High ???
Error Reporting Yes Yes No(?)
Easy of implementation Yes No(?) Yes
Works Big data No Yes Yes(?)
Works well in parrallel No Yes(?) Yes(?)

DataFlows

https://github.com/datahq/dataflows

Dataflows is a framework for loading, processing, manipulating data.

Notes an QA (Sep 2019)

Raw insert ~ 15m (on 1m rows) Insert with begin / commit ~5m copy ~82s (though may have limit on b/w) -- and what happens if pipe breaks

Q: Is it better to but everything in DB as a string and cast later or cast and insert in DB. A: Probably cast first and insert after.

Q: Why do we rush to insert the data in DB? We will have to wait until it's casted anyways befroe use A: It's much faster to do operations id DB than outside.