# Data Factory Design Our Data Factory system is called AirCan. A Data Factory is a set of services/components to process and integrate data (coming from different sources). Plus patterns / methods for integrating with CKAN and the DataHub. ## Components ```mermaid graph LR subgraph Orchestration airflow[AirFlow] airflowservice[AirFlow service] end subgraph CKAN integration ckanhooks[CKAN extension to trigger and report on factory activity] ckanapi[API for triggering DAGs etc] ckanui[UI integration - display info on ] end subgraph Processors and Flows ckandatastoreload[CKAN Loader lib] ckanharveters[CKAN Harvesters] validation[Validation Lib] end ``` ## DataStore Load job story ### Reporting Integration When I upload a file to CKAN and it is getting loaded to the datastore (automatically), I want to know if that succeeded or failed so that I can share with my users that the new data is available (or do something about the error). For a remote Airflow instance (let's say on Google Composer), describe the DAG tasks and the process. i.e. * File upload on CKAN triggers the ckanext-aircan connector * which makes API request to airflow on GCP and triggers a DAG with following parameters * A f11s resource orject including * the remote location of the CSV file and the reource ID * The target resource id * An API key to use when loading to CKAN datastore * [A callback url] * The DAG * deletes the datatore table * if it exists, creates a new datastore table * loads CSV from the specified location (inforation available on DAG parameters) * converts the CSV to JSON. The output of the converted JSON file will be in a bucket on GCP. * upserts the JSON data row by row into the CKAN DataStore via CKAN's DataStore API * This is what we have now: {'invoke{"message":"Created "}'} `/api/3/action/datastore_create` and passing the contents of the json file * OR using upsert with inserts (faster) NB: datapusher just pushes the whole thing into `datastore_create` so stick with that. * OR: if we are doing postgres copy we need direct access to postgres DB * ... [tbd] notifies CKAN instance of this (?) Error Handling and other topics to consider * How can we let CKAN know something went wrong? Shall we create a way to notify a certain endpoint on ckannext-aircan connector? * Shall we also implement a timeout on CKAN? * What are we going to display in case of an error? * The "tmp" bucket on GCP will eventually get full of files; shall we flush it? How do we know when it's safe to delete a file? * Lots of ways up this mountain. * What do we do for large files? ## AirCan API AirCan is built on AirFlow so we have same basic API TODO: insert link However, we have standard message formats to pass to DAGs following these principles: All dataset and data resource objects should following the Frictionless specs Pseudo-code showing how we call the API: ```python= airflow.dag_run({ "conf": { "resource": json.dumps({ # f11s resource object resource_id: ... path: ... schema: ... }) "ckan_api_key: ... "ckan_api_endpoint": demo.ckan.org/api/ } }) ``` See for latest, up to date version: https://github.com/datopian/ckanext-aircan/blob/master/ckanext/aircan_connector/action.py#L68 ## CKAN integration API There is a new API as follows: `http://ckan:5000/api/3/action/aircan_submit?dag_id=...&dataset=...&resource` Also DAGs can get triggered on events ... TODO: go look at Github actions and learn from it ... ## Architecture Other principles of architecture: * AirFlow tasks and DAGs should do very little themselves and should hand off to separate libraries. Why? To have better separation of concerns and **testability**. AirCan is reasonably cumbersome to test but an SDK is much more testable. * Thus AirFlow tasks are often just going to pass through arguments TODO: expand this with an example ... * AirFlow DAG will have incoming data and config set in "global" config for the DAG and so available to every task ... * Tasks should be as decoupled as possible. Obviously there *is* some data and metadata passing between tasks and that should be done by writing those to a storage bucket. Metadata MUST be stored in f11s format. * See this interesting blog post (not scientific) about why the previous approach, with side effcts, is not very resilient in the long run of a project https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a * don't pass data explicitly between tasks (rather it is passed implicitly via an expectation of where the data is stored ...) * tasks and flows should be re-runnable ... (no side effects principle) Each task can write to this location: ``` bucket/dagid/runid/taskid/resource.json bucket/dagid/runid/taskid/dataset.json bucket/dagid/runid/taskid/... # data files ``` ## UI in DMS URL structure on a daaset ``` # xxx is a dataset /@myorg/xxx/actions/ /@myorg/xxx/actions/runs/{id} ``` Main question: to display to user we need some way to log what jobs are associated with what datasets (and users) and perhaps their status * we want to keep factory relatively dumb (it does not know about datasets etc etc) * in terms of capabilities we need a way to pass permissions into the data factory (you hand over the keys to your car) Simplest approach: * MetaStore (CKAN metadata db) has Jobs table which have structure of `| id | factory_id | job_type | created | updated | dataset | resource | user | status | info |` (where info is json blob) * status = one of `WAITING | RUNNING | DONE | FAILED | CANCELLED`. If failed we should have stuff in info about that. * `job_type` = one of `HARVEST | LOAD | VALIDATE ...` it is there so we could have several different factory jobs in one db * `info`: likely stuff * run time * error information (on failure) * success information: what was outcome, where are outputs if any etc * On creating a job in the factory, the factory returns a factory id. the metastore stores the factory id into a new job object along with dataset and user info ... * Qu: why have id and factory_id separate? is there any situation where you have a job w/o a factory id? * Then on loading a job page in frontend you can poll the factory for info and status (if status is WAITING or RUNNING) * => do we need the `info` column on the job (it's just a cache of this info)? * Ans: useful for jobs which are complete so we don't keep polling the factory (esp if factory deletes stuff) * Can list all jobs for a given dataset (or resource) with info about them Qus: * For Data Factory what do I do with Runs that are stale etc - how do I know who they are associated with. Can I store metadata on my Runs like who requested it etc. ### UI Design Example from Github: ![](https://i.imgur.com/xnTRq5T.png) ## Appendix ### Notes re AirCan API https://medium.com/@ptariche/interact-with-apache-airflows-experimental-api-3eba195f2947 ``` {"message":"Created "} GET /api/experimental/dags//dag_runs/ GET /api/experimental/dags/ckan_api_load_gcp/dag_runs/2020-07-14 13:04:43+00:00 https://b011229e45c662be6p-tp.appspot.com/api/experimental/dags/ckan_api_load_gcp/dag_runs/2020-07-14T13:04:43+00:00 Resp: `{"state":"failed"}` ``` ### Google Cloud Composer Google Cloud Composer is a hosted version of AirFlow on Google Cloud. #### How Google Cloud Composer differs from local AirFlow * File handling: On GCP, all the file handling must become interaction with a bucket ~rufus: what about from a url online (but not a bucket) Specifying the csv resource location (on a local Airflow) must become sending a resource to a bucket (or just parsing it from the JSON body). When converting it to a JSON file, it must become an action of creating a file on a bucket. * Authentication: TODO ### AirFlow Best Practices * Should you and how do you pass information between tasks? * https://medium.com/datareply/airflow-lesser-known-tips-tricks-and-best-practises-cf4d4a90f8f * https://towardsdatascience.com/airflow-sharing-data-between-tasks-7bbaa27eeb1 ### What terminology should we use? ANS: we use AirFlow terminology: * Task * DAG * DagRun For internals what are the options? * Task or Processor or ... * DAG or Flow or Pipeline? TODO: table summarizing options in AirFlow, Luigi, Apache Beam etc. #### UI Terminology * Actions * Workflows Terminology options * Gitlab * Pipelines: you have * Jobs (runs of those * Schedules * Github * Workflows * Runs * (Schedules - not explicit) * Airflow * DAGs * Tasks * DAG Runs