[#796,site,docs][xl]: copy /docs/dms to portaljs
This commit is contained in:
183
site/content/docs/dms/load/design.md
Normal file
183
site/content/docs/dms/load/design.md
Normal file
@@ -0,0 +1,183 @@
|
||||
# Data Load Design
|
||||
|
||||
Key point: this is classic ETL so let's reuse those patterns and tooling.
|
||||
|
||||
## Logic
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
|
||||
usercsv[User has CSV,XLS etc]
|
||||
userdr[User has Tabular Data Resource]
|
||||
dr[Tabular Data Resource]
|
||||
|
||||
usercsv --1. some steps--> dr
|
||||
userdr -. direct .-> dr
|
||||
dr --2. load --> datastore[DataStore]
|
||||
```
|
||||
|
||||
In more detail, dividing ET(transform) from L(oad):
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
|
||||
subgraph "Prepare (ET)"
|
||||
rawcsv[Raw CSV] --> tidy[Tidy]
|
||||
tidy --> infer[Infer types]
|
||||
infer
|
||||
end
|
||||
|
||||
infer --> tdr{{Tabular Data Resource<br/>csv/json + table schema}}
|
||||
tdr --> dsdelete
|
||||
|
||||
subgraph "Loader (L)"
|
||||
datastorecreate[DataStore Create]
|
||||
dsdelete[DataStore Delete]
|
||||
load[Load to CKAN via DataStore API or direct copy]
|
||||
|
||||
dsdelete --> datastorecreate
|
||||
datastorecreate --> load
|
||||
end
|
||||
```
|
||||
|
||||
### Load step in even more detail
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
|
||||
tdr[Tabular Data Resource on disk from CSV in FileStore of a resource]
|
||||
loadtdr[Load Tabular Data Resource Metadata]
|
||||
|
||||
dscreate[Create Table in DS if not exists]
|
||||
cleartable[Clear DS table if existing content]
|
||||
pushdatacopy[Load to DS via PG copy]
|
||||
done[Data in DataStore]
|
||||
|
||||
tdr --> loadtdr
|
||||
loadtdr --> dscreate
|
||||
dscreate --> cleartable
|
||||
|
||||
cleartable --> pushdatacopy
|
||||
|
||||
pushdatacopy --> done
|
||||
|
||||
logstore[LogStore]
|
||||
|
||||
cleartable -. log .-> logstore
|
||||
pushdatacopy -. log .-> logstore
|
||||
```
|
||||
|
||||
## Runner
|
||||
|
||||
We will use AirFlow.
|
||||
|
||||
|
||||
## Research
|
||||
|
||||
### What is a Tabular Data Resource?
|
||||
|
||||
See Frictionless Specs. For our purposes:
|
||||
|
||||
* A "Good" CSV file: Valid CSV - with one header row, No blank header etc...
|
||||
* Encoding worked out -- usually we should have already converted to utf-8
|
||||
* Dialect - https://frictionlessdata.io/specs/csv-dialect/
|
||||
* Table Schema https://frictionlessdata.io/specs/table-schema
|
||||
|
||||
NB: even if you want to go direct loading route (a la XLoader) and forget types you still need encoding etc sorted -- and it still fits in diagram above (Table Schema is just trivial -- everything is strings).
|
||||
|
||||
### What is datastore and how to create the DataStore entry
|
||||
|
||||
https://github.com/ckan/ckan/tree/master/ckanext/datastore
|
||||
* provides an ad hoc database for storage of structured data from CKAN resources
|
||||
* Connection with Datapusher: https://docs.ckan.org/en/2.8/maintaining/datastore.html#datapusher-automatically-add-data-to-the-datastore
|
||||
* Datastore API: https://docs.ckan.org/en/2.8/maintaining/datastore.html#the-datastore-api
|
||||
* Making Datastore API requests: https://docs.ckan.org/en/2.8/maintaining/datastore.html#making-a-datastore-api-request
|
||||
|
||||
#### Create an entry
|
||||
|
||||
```
|
||||
curl -X POST http://127.0.0.1:5000/api/3/action/datastore_create -H "Authorization: {YOUR-API-KEY}"
|
||||
|
||||
resource
|
||||
-d '{
|
||||
"resource": {"package_id": "{PACKAGE-ID}"},
|
||||
"fields": [ {"id": "a"}, {"id": "b"} ]
|
||||
}'
|
||||
```
|
||||
|
||||
https://docs.ckan.org/en/2.8/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_create
|
||||
|
||||
### Options for Loading
|
||||
|
||||
There are 3 different paths we could take:
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
|
||||
pyloadstr[Load in python in streaming mode]
|
||||
cleartable[Clear DS table if existing content]
|
||||
pushdatacopy[Load to DS via PG copy]
|
||||
pushdatads[Load to DS via DataStore API]
|
||||
pushdatasql[Load to DS via sql over PG api]
|
||||
done[Data in DataStore]
|
||||
dataflows[DataFlows SQL loader]
|
||||
|
||||
cleartable -- 1 --> pyloadstr
|
||||
pyloadstr --> pushdatads
|
||||
pyloadstr --> pushdatasql
|
||||
cleartable -- 2 --> dataflows
|
||||
dataflows --> pushdatasql
|
||||
cleartable -- 3 --> pushdatacopy
|
||||
|
||||
pushdatasql --> done
|
||||
pushdatacopy --> done
|
||||
pushdatads --> done
|
||||
```
|
||||
|
||||
#### Pros and Cons of different approaches
|
||||
|
||||
|Criteria | Datastore Write API | PG Copy | Dataflows |
|
||||
|---------|:--------- |:------- | ---------: |
|
||||
| Speed | Low | High | ??? |
|
||||
|Error Reporting| Yes | Yes | No(?) |
|
||||
|Easy of implementation| Yes | No(?) | Yes |
|
||||
Works Big data| No | Yes | Yes(?) |
|
||||
|Works well in parrallel| No | Yes(?) | Yes(?)
|
||||
|
||||
### DataFlows
|
||||
|
||||
https://github.com/datahq/dataflows
|
||||
|
||||
Dataflows is a framework for loading, processing, manipulating data.
|
||||
|
||||
* Loader (Loading from external source (or disk)): https://github.com/datahq/dataflows/blob/master/dataflows/processors/load.py
|
||||
* Load to an SQL db (Dump processed data) https://github.com/datahq/dataflows/blob/master/dataflows/processors/dumpers/to_sql.py
|
||||
* What is error reporting, what is runner system ..., does it have a UI? does it have a queue system?
|
||||
* Think data package pipelines is taking care of all of these. https://github.com/frictionlessdata/datapackage-pipelines
|
||||
* DPP itself is also a ETL framework, just much heavier and a bit complicated.
|
||||
|
||||
### Notes an QA (Sep 2019)
|
||||
|
||||
* Note: TDR needs info on CKAN Resource source so we can create right datastore entry ..
|
||||
* No need to validate as we assume it is good ...
|
||||
* We might want to do that ... still
|
||||
* Pros and Cons
|
||||
* Speed
|
||||
* Error reporting ...
|
||||
* What happens with Copy if you hit an error (e.g. a bad cast?)
|
||||
* https://infinum.co/the-capsized-eight/superfast-csv-imports-using-postgresqls-copy
|
||||
* https://wiki.postgresql.org/wiki/Error_logging_in_COPY
|
||||
* Ease of implementation
|
||||
* Good with inserting Big data
|
||||
* Create as strings and cast later ... ?
|
||||
* xloader implementation with COPY command: https://github.com/ckan/ckanext-xloader/blob/fb17763fc7726084f67f6ebd640809ecc055b3a2/ckanext/xloader/loader.py#L40
|
||||
|
||||
Raw insert ~ 15m (on 1m rows)
|
||||
Insert with begin / commit ~5m
|
||||
copy ~82s (though may have limit on b/w) -- and what happens if pipe breaks
|
||||
|
||||
Q: Is it better to but everything in DB as a string and cast later or cast and insert in DB.
|
||||
A: Probably cast first and insert after.
|
||||
|
||||
Q: Why do we rush to insert the data in DB? We will have to wait until it's casted anyways befroe use
|
||||
A: It's much faster to do operations id DB than outside.
|
||||
Reference in New Issue
Block a user