184 lines
5.5 KiB
Markdown
184 lines
5.5 KiB
Markdown
# Data Load Design
|
|
|
|
Key point: this is classic ETL so let's reuse those patterns and tooling.
|
|
|
|
## Logic
|
|
|
|
```mermaid
|
|
graph LR
|
|
|
|
usercsv[User has CSV,XLS etc]
|
|
userdr[User has Tabular Data Resource]
|
|
dr[Tabular Data Resource]
|
|
|
|
usercsv --1. some steps--> dr
|
|
userdr -. direct .-> dr
|
|
dr --2. load --> datastore[DataStore]
|
|
```
|
|
|
|
In more detail, dividing ET(transform) from L(oad):
|
|
|
|
```mermaid
|
|
graph TD
|
|
|
|
subgraph "Prepare (ET)"
|
|
rawcsv[Raw CSV] --> tidy[Tidy]
|
|
tidy --> infer[Infer types]
|
|
infer
|
|
end
|
|
|
|
infer --> tdr{{Tabular Data Resource<br/>csv/json + table schema}}
|
|
tdr --> dsdelete
|
|
|
|
subgraph "Loader (L)"
|
|
datastorecreate[DataStore Create]
|
|
dsdelete[DataStore Delete]
|
|
load[Load to CKAN via DataStore API or direct copy]
|
|
|
|
dsdelete --> datastorecreate
|
|
datastorecreate --> load
|
|
end
|
|
```
|
|
|
|
### Load step in even more detail
|
|
|
|
```mermaid
|
|
graph TD
|
|
|
|
tdr[Tabular Data Resource on disk from CSV in FileStore of a resource]
|
|
loadtdr[Load Tabular Data Resource Metadata]
|
|
|
|
dscreate[Create Table in DS if not exists]
|
|
cleartable[Clear DS table if existing content]
|
|
pushdatacopy[Load to DS via PG copy]
|
|
done[Data in DataStore]
|
|
|
|
tdr --> loadtdr
|
|
loadtdr --> dscreate
|
|
dscreate --> cleartable
|
|
|
|
cleartable --> pushdatacopy
|
|
|
|
pushdatacopy --> done
|
|
|
|
logstore[LogStore]
|
|
|
|
cleartable -. log .-> logstore
|
|
pushdatacopy -. log .-> logstore
|
|
```
|
|
|
|
## Runner
|
|
|
|
We will use AirFlow.
|
|
|
|
|
|
## Research
|
|
|
|
### What is a Tabular Data Resource?
|
|
|
|
See Frictionless Specs. For our purposes:
|
|
|
|
* A "Good" CSV file: Valid CSV - with one header row, No blank header etc...
|
|
* Encoding worked out -- usually we should have already converted to utf-8
|
|
* Dialect - https://frictionlessdata.io/specs/csv-dialect/
|
|
* Table Schema https://frictionlessdata.io/specs/table-schema
|
|
|
|
NB: even if you want to go direct loading route (a la XLoader) and forget types you still need encoding etc sorted -- and it still fits in diagram above (Table Schema is just trivial -- everything is strings).
|
|
|
|
### What is datastore and how to create the DataStore entry
|
|
|
|
https://github.com/ckan/ckan/tree/master/ckanext/datastore
|
|
* provides an ad hoc database for storage of structured data from CKAN resources
|
|
* Connection with Datapusher: https://docs.ckan.org/en/2.8/maintaining/datastore.html#datapusher-automatically-add-data-to-the-datastore
|
|
* Datastore API: https://docs.ckan.org/en/2.8/maintaining/datastore.html#the-datastore-api
|
|
* Making Datastore API requests: https://docs.ckan.org/en/2.8/maintaining/datastore.html#making-a-datastore-api-request
|
|
|
|
#### Create an entry
|
|
|
|
```
|
|
curl -X POST http://127.0.0.1:5000/api/3/action/datastore_create -H "Authorization: {YOUR-API-KEY}"
|
|
|
|
resource
|
|
-d '{
|
|
"resource": {"package_id": "{PACKAGE-ID}"},
|
|
"fields": [ {"id": "a"}, {"id": "b"} ]
|
|
}'
|
|
```
|
|
|
|
https://docs.ckan.org/en/2.8/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_create
|
|
|
|
### Options for Loading
|
|
|
|
There are 3 different paths we could take:
|
|
|
|
```mermaid
|
|
graph TD
|
|
|
|
pyloadstr[Load in python in streaming mode]
|
|
cleartable[Clear DS table if existing content]
|
|
pushdatacopy[Load to DS via PG copy]
|
|
pushdatads[Load to DS via DataStore API]
|
|
pushdatasql[Load to DS via sql over PG api]
|
|
done[Data in DataStore]
|
|
dataflows[DataFlows SQL loader]
|
|
|
|
cleartable -- 1 --> pyloadstr
|
|
pyloadstr --> pushdatads
|
|
pyloadstr --> pushdatasql
|
|
cleartable -- 2 --> dataflows
|
|
dataflows --> pushdatasql
|
|
cleartable -- 3 --> pushdatacopy
|
|
|
|
pushdatasql --> done
|
|
pushdatacopy --> done
|
|
pushdatads --> done
|
|
```
|
|
|
|
#### Pros and Cons of different approaches
|
|
|
|
|Criteria | Datastore Write API | PG Copy | Dataflows |
|
|
|---------|:--------- |:------- | ---------: |
|
|
| Speed | Low | High | ??? |
|
|
|Error Reporting| Yes | Yes | No(?) |
|
|
|Easy of implementation| Yes | No(?) | Yes |
|
|
Works Big data| No | Yes | Yes(?) |
|
|
|Works well in parrallel| No | Yes(?) | Yes(?)
|
|
|
|
### DataFlows
|
|
|
|
https://github.com/datahq/dataflows
|
|
|
|
Dataflows is a framework for loading, processing, manipulating data.
|
|
|
|
* Loader (Loading from external source (or disk)): https://github.com/datahq/dataflows/blob/master/dataflows/processors/load.py
|
|
* Load to an SQL db (Dump processed data) https://github.com/datahq/dataflows/blob/master/dataflows/processors/dumpers/to_sql.py
|
|
* What is error reporting, what is runner system ..., does it have a UI? does it have a queue system?
|
|
* Think data package pipelines is taking care of all of these. https://github.com/frictionlessdata/datapackage-pipelines
|
|
* DPP itself is also a ETL framework, just much heavier and a bit complicated.
|
|
|
|
### Notes an QA (Sep 2019)
|
|
|
|
* Note: TDR needs info on CKAN Resource source so we can create right datastore entry ..
|
|
* No need to validate as we assume it is good ...
|
|
* We might want to do that ... still
|
|
* Pros and Cons
|
|
* Speed
|
|
* Error reporting ...
|
|
* What happens with Copy if you hit an error (e.g. a bad cast?)
|
|
* https://infinum.co/the-capsized-eight/superfast-csv-imports-using-postgresqls-copy
|
|
* https://wiki.postgresql.org/wiki/Error_logging_in_COPY
|
|
* Ease of implementation
|
|
* Good with inserting Big data
|
|
* Create as strings and cast later ... ?
|
|
* xloader implementation with COPY command: https://github.com/ckan/ckanext-xloader/blob/fb17763fc7726084f67f6ebd640809ecc055b3a2/ckanext/xloader/loader.py#L40
|
|
|
|
Raw insert ~ 15m (on 1m rows)
|
|
Insert with begin / commit ~5m
|
|
copy ~82s (though may have limit on b/w) -- and what happens if pipe breaks
|
|
|
|
Q: Is it better to but everything in DB as a string and cast later or cast and insert in DB.
|
|
A: Probably cast first and insert after.
|
|
|
|
Q: Why do we rush to insert the data in DB? We will have to wait until it's casted anyways befroe use
|
|
A: It's much faster to do operations id DB than outside.
|