Compare commits

...

59 Commits

Author SHA1 Message Date
47b4ffece8 Update packages/remark-wiki-link/src/lib/remarkWikiLink.ts
Some checks failed
Release / Release (push) Failing after 1h6m8s
2025-04-08 18:12:43 +00:00
Rufus Pollock
816db6858c [site/content/docs/dms][s]: remove this directory as duplicate of content on tech.datopian.com/ especially tech.datopian.com/dms. 2025-02-09 22:59:31 +00:00
github-actions[bot]
0381f2fccf Version Packages 2025-01-22 16:37:23 +01:00
Ola Rubaj
62dbc35d3b fix(LineChart): skip lines at invalid/missing data points (don't force connect) 2025-01-22 16:23:17 +01:00
Lucas Morais Bispo
12f0d0d732
Merge pull request #1347 from datopian/feature/update-links
[md][datahub] updated links from portaljs.org to portaljs.com
2025-01-13 08:52:50 -03:00
muhammad-hassan11
d80d1f5012 removed logs 2025-01-13 16:22:01 +05:00
Anuar Ustayev (aka Anu)
af5b6b7a29
Rename/rebrand from datahub to portaljs.
DataHub.io is becoming something different, e.g., hub for data OR data market[place] while PortalJS.com is a cloud platform for creating managed data portals.
2024-12-24 10:43:26 +05:00
muhammad-hassan11
8487175f01 [md][datahub] updated links from portaljs.org to portaljs.com 2024-12-23 21:57:09 +05:00
Anuar Ustayev (aka Anu)
6551576700
Change back to PortalJS name for data portals. 2024-12-23 11:10:39 +05:00
Lucas Morais Bispo
4fccb2945f
Merge pull request #1346 from datopian/fix/dotorgmerging
[site][WIP] Seo - update title, canonical
2024-12-05 19:35:07 -03:00
lucasmbispo
a9025e5cbe [site]:seo - update title, canonical 2024-12-05 08:14:18 -03:00
github-actions[bot]
ad5a176e85 Version Packages 2024-11-11 15:52:06 +01:00
Ola Rubaj
eeb480e8cf [fix][xs]: allow yearmonth TimeUnit in LineChart 2024-11-11 15:40:07 +01:00
github-actions[bot]
30fcb256b2 Version Packages 2024-10-24 08:53:23 +02:00
Ola Rubaj
a4f8c0ed76 [chore][xs]: update package-lock 2024-10-24 08:46:51 +02:00
Ola Rubaj
829f3b1f13 [chore][xs]: fix formatting 2024-10-24 08:46:27 +02:00
Ola Rubaj
836b143a31 [fix][xs]: make tileLayerName in Map optional 2024-10-24 08:45:51 +02:00
github-actions[bot]
be38086794 Version Packages 2024-10-23 18:08:18 +02:00
Ola Rubaj
63d9e3b754
[feat,LineChart][s]: support for multiple series 2024-10-23 18:03:07 +02:00
Anuar Ustayev (aka Anu)
f86f0541eb
Merge pull request #1332 from datopian/site/fix-showcases
[portaljs site][showcases][s] Merge examples into Showcases tab
2024-10-11 09:36:16 +05:00
Lucas Morais Bispo
64bc212384
Update README.md 2024-10-09 11:46:02 -03:00
Lucas Morais Bispo
1e7daf353d
Add files via upload 2024-10-09 11:28:42 -03:00
lucasmbispo
cc69dabf80 [site][showcases] update examples 2024-10-03 21:04:06 -03:00
lucasmbispo
a5d87712e0 [site][showcases][s] Merge examples into Showcases tab 2024-10-01 11:07:33 -03:00
Rufus Pollock
86834fd1a6
Merge pull request #1317 from loleg/patch-1
Fix link to Next.js in README.md
2024-09-20 13:30:02 +02:00
Oleg Lavrovsky
8a661b1617
Fix link to Next.js in README.md 2024-09-20 11:23:06 +02:00
Rufus Pollock
1baebc3f3c
Merge pull request #1200 from rzmk/patch-1
[#1181, examples/ckan-ssg][xs]: update example generation command
2024-07-05 19:13:43 +02:00
João Demenech
bbac4954f5
Merge pull request #1202 from datopian/changeset-release/main
Version Packages
2024-06-24 17:58:02 -03:00
github-actions[bot]
be6b184884 Version Packages 2024-06-24 20:47:23 +00:00
João Demenech
64103d6488
Merge pull request #1122 from datopian/feature/custom-tile-layer
Custom Tile Layer for Map Component
2024-06-24 17:44:19 -03:00
Demenech
8e3496782c version: add changeset 2024-06-24 17:42:49 -03:00
Mueez Khan
e034503399
[examples/ckan-ssg][xs]: update command to create project 2024-06-22 00:17:49 -04:00
William Lima
93ae498ec2 Code cleanup 2024-06-19 10:10:56 -01:00
William Lima
97e43fdcba add mapbox as default basemap 2024-06-18 22:37:20 -01:00
William Lima
32f29024f8 attr replace fix 2024-06-18 22:05:41 -01:00
William Lima
134f72948c Add TileLayer Presets configuration 2024-06-18 22:01:59 -01:00
Rufus Pollock
c1f2c526a8 [#1181,site][xs]: change portaljs to datahub in github repo references. 2024-06-10 19:31:43 +02:00
João Demenech
8feb87739d
Merge pull request #1173 from datopian/changeset-release/main
Version Packages
2024-06-09 08:06:43 -03:00
github-actions[bot]
3a07267e44 Version Packages 2024-06-09 09:25:23 +00:00
Rufus Pollock
3f19ca16ed
[#1118,docs/portaljs][s 2024-06-09 11:22:25 +02:00
João Demenech
5deabac5fe
Merge pull request #1170 from datopian/fix/iframe-height
[components][iFrame] Change default height
2024-06-04 14:57:24 -03:00
lucasmbispo
96901150c6 [changesets] change major to patch 2024-06-04 09:38:47 -03:00
lucasmbispo
9ff25ed7c4 [components][iFrame] Change iFrame height 2024-06-04 09:38:12 -03:00
lucasmbispo
8f884fceab [components][iFrame] Change default height 2024-06-04 09:26:30 -03:00
Anuar Ustayev (aka Anu)
7094eded50
Merge pull request #1167 from datopian/fix/map-geojson
Fix: autoZoomConfiguration not working properly when the geojson parameter is passed
2024-06-04 14:06:45 +05:00
Rufus Pollock
30e7c6379f
Merge pull request #1069 from marcchehab/patch-2 - Add SiteToc to MobileNav.
This PR adds the SiteToc to the MobileNav. It also fixes double type declarations in MobileNav by importing the interfaces from Nav. Adding SiteToc was then just a matter of uncommenting code that was there already.
2024-05-31 17:16:42 +02:00
Ronaldo Campos
feada58932 Fix: autoZoomConfiguration not working properly when the geojson parameter is passed 2024-05-31 11:37:01 -03:00
William Lima
31406d48e3 Update Map.tsx 2024-05-31 10:29:15 -01:00
Daniellappv
d6bf344ca3
Update CONTRIBUTING.md 2024-05-31 10:55:58 +03:00
William Lima
d1a5138c6e include configs on .env vars or pass through props 2024-05-22 11:48:20 -01:00
William Lima
a6047a9341 Implements Custom Tile Layer
#1121 adds default tile layer and allows user to pass a tile object to map
2024-05-13 12:51:28 -01:00
Ola Rubaj
a4e60540ae
Merge pull request #1119 from datopian/remark-wiki-link-cleanup
## Changes

- remove unneeded tests
- do not remove "index" from the end of tile path in `getPermalinks` function
2024-05-09 02:20:45 +02:00
Ola Rubaj
e4c456c237 rm changeset file 2024-05-09 02:19:54 +02:00
Ola Rubaj
ce9ebbf41e add changeset file 2024-05-09 02:16:05 +02:00
Ola Rubaj
a8fb176bcc rm test for custom permarlink converter (irrelevant) 2024-05-09 02:12:44 +02:00
Ola Rubaj
2ac82367c5 do not remove "index" from the end of file
- should be treated as a regular file name
- it's up to the app how to interpret those paths/files later
2024-05-09 02:12:38 +02:00
Ola Rubaj
85de6f7878 replace inex.md with README.md in test fixtures 2024-05-09 02:09:52 +02:00
luzmediach
1a8e7ac06e NavMobile to use Nav interfaces and add SiteToc to sidebar 2024-01-21 12:48:10 +01:00
marcchehab
4355efe0c4
Update Nav.tsx 2024-01-21 12:36:46 +01:00
125 changed files with 1881 additions and 12501 deletions

View File

@ -4,7 +4,7 @@ title: Developer docs for contributors
## Our repository ## Our repository
https://github.com/datopian/portaljs https://github.com/datopian/datahub
Structure: Structure:
@ -17,7 +17,7 @@ Structure:
## How to contribute ## How to contribute
You can start by checking our [issues board](https://github.com/datopian/portaljs/issues). You can start by checking our [issues board](https://github.com/datopian/datahub/issues).
If you'd like to work on one of the issues you can: If you'd like to work on one of the issues you can:
@ -35,7 +35,7 @@ If you'd like to work on one of the issues you can:
If you have an idea for improvement, and it doesn't have a corresponding issue yet, simply submit a new one. If you have an idea for improvement, and it doesn't have a corresponding issue yet, simply submit a new one.
> [!note] > [!note]
> Join our [Discord channel](https://discord.gg/rTxfCutu) do discuss existing issues and to ask for help. > Join our [Discord channel](https://discord.gg/KZSf3FG4EZ) do discuss existing issues and to ask for help.
## Nx ## Nx

View File

@ -1,31 +1,25 @@
<h1 align="center">
<a href="https://datahub.io/">
<img alt="datahub" src="http://datahub.io/datahub-cube.svg" width="146">
</a>
</h1>
<p align="center"> <p align="center">
Bugs, issues and suggestions re DataHub Cloud ☁️ and DataHub OpenSource 🌀 Bugs, issues and suggestions re PortalJS framework
<br /> <br />
<br /><a href="https://discord.gg/xfFDMPU9dC"><img src="https://dcbadge.vercel.app/api/server/xfFDMPU9dC" /></a> <br /><a href="https://discord.gg/xfFDMPU9dC"><img src="https://dcbadge.vercel.app/api/server/xfFDMPU9dC" /></a>
</p> </p>
## DataHub ## PortalJS framework
This repo and issue tracker are for This repo and issue tracker are for
- PortalJS 🌀 - https://www.portaljs.com/
- DataHub Cloud ☁️ - https://datahub.io/ - DataHub Cloud ☁️ - https://datahub.io/
- DataHub 🌀 - https://datahub.io/opensource
### Issues ### Issues
Found a bug: 👉 https://github.com/datopian/datahub/issues/new Found a bug: 👉 https://github.com/datopian/portaljs/issues/new
### Discussions ### Discussions
Got a suggestion, a question, want some support or just want to shoot the breeze 🙂 Got a suggestion, a question, want some support or just want to shoot the breeze 🙂
Head to the discussion forum: 👉 https://github.com/datopian/datahub/discussions Head to the discussion forum: 👉 https://github.com/datopian/portaljs/discussions
### Chat on Discord ### Chat on Discord
@ -35,13 +29,14 @@ If you would prefer to get help via live chat check out our discord 👉
### Docs ### Docs
https://datahub.io/docs - For PortalJS go to https://www.portaljs.com/opensource
- For DataHub Cloud https://datahub.io/docs
## DataHub OpenSource 🌀 ## PortalJS Cloud 🌀
DataHub 🌀 is a platform for rapidly creating rich data portal and publishing systems using a modern frontend approach. Datahub can be used to publish a single dataset or build a full-scale data catalog/portal. PortalJS Cloud 🌀 is a platform for rapidly creating rich data portal and publishing systems using a modern frontend approach. PortalJS Cloud can be used to publish a single dataset or build a full-scale data catalog/portal.
DataHub is built in JavaScript and React on top of the popular [Next.js](https://nextjs.com/) framework. DataHub assumes a "decoupled" approach where the frontend is a separate service from the backend and interacts with backend(s) via an API. It can be used with any backend and has out of the box support for [CKAN](https://ckan.org/), GitHub, Frictionless Data Packages and more. PortalJS Cloud is built in JavaScript and React on top of the popular [Next.js](https://nextjs.org) framework. PortalJS Cloud assumes a "decoupled" approach where the frontend is a separate service from the backend and interacts with backend(s) via an API. It can be used with any backend and has out of the box support for [CKAN](https://ckan.org/), GitHub, Frictionless Data Packages and more.
### Features ### Features

View File

@ -2,7 +2,7 @@
**🚩 UPDATE April 2023: This example is now deprecated - though still works!. Please use the [new CKAN examples](https://github.com/datopian/portaljs/tree/main/examples)** **🚩 UPDATE April 2023: This example is now deprecated - though still works!. Please use the [new CKAN examples](https://github.com/datopian/portaljs/tree/main/examples)**
This example shows how you can build a full data portal using a CKAN Backend with a Next.JS Frontend powered by Apollo, a full fledged guide is available as a [blog post](https://portaljs.org/blog/example-ckan-2021) This example shows how you can build a full data portal using a CKAN Backend with a Next.JS Frontend powered by Apollo, a full fledged guide is available as a [blog post](https://portaljs.com/blog/example-ckan-2021)
## Developers ## Developers

View File

@ -1,7 +1,7 @@
This is a repo intended to serve as an example of a data catalog that get its data from a CKAN Instance. This is a repo intended to serve as an example of a data catalog that get its data from a CKAN Instance.
``` ```
npx create-next-app <app-name> --example https://github.com/datopian/portaljs/tree/main/examples/ckan-example npx create-next-app <app-name> --example https://github.com/datopian/datahub/tree/main/examples/ckan-ssg
cd <app-name> cd <app-name>
``` ```
@ -19,7 +19,7 @@ npm run dev
Congratulations, you now have something similar to this running on `http://localhost:4200` Congratulations, you now have something similar to this running on `http://localhost:4200`
![](https://media.discordapp.net/attachments/1069718983604977754/1098252297726865408/image.png?width=853&height=461) ![](https://media.discordapp.net/attachments/1069718983604977754/1098252297726865408/image.png?width=853&height=461)
If yo go to any one of those pages by clicking on `More info` you will see something similar to this If you go to any one of those pages by clicking on `More info` you will see something similar to this
![](https://media.discordapp.net/attachments/1069718983604977754/1098252298074988595/image.png?width=853&height=461) ![](https://media.discordapp.net/attachments/1069718983604977754/1098252298074988595/image.png?width=853&height=461)
## Deployment ## Deployment

View File

@ -1,6 +1,6 @@
This example creates a portal/showcase for a single dataset. The dataset should be a [Frictionless dataset (data package)][fd] i.e. there should be a `datapackage.json`. This example creates a portal/showcase for a single dataset. The dataset should be a [Frictionless dataset (data package)][fd] i.e. there should be a `datapackage.json`.
[fd]: https://frictionlessdata.io/data-packages/ [fd]: https://specs.frictionlessdata.io/data-package/
## How to use ## How to use

View File

@ -59,7 +59,7 @@ export default function Layout({ children }: { children: React.ReactNode }) {
<div className="md:flex items-center gap-x-3 text-[#3c3c3c] -mb-1 hidden"> <div className="md:flex items-center gap-x-3 text-[#3c3c3c] -mb-1 hidden">
<a <a
className="hover:opacity-75 transition" className="hover:opacity-75 transition"
href="https://portaljs.org" href="https://portaljs.com"
> >
Built with 🌀PortalJS Built with 🌀PortalJS
</a> </a>
@ -77,7 +77,7 @@ export default function Layout({ children }: { children: React.ReactNode }) {
<li> <li>
<a <a
className="hover:opacity-75 transition" className="hover:opacity-75 transition"
href="https://portaljs.org" href="https://portaljs.com"
> >
PortalJS PortalJS
</a> </a>

View File

@ -6,7 +6,7 @@ A `datasets.json` file is used to specify which datasets are going to be part of
The application contains an index page, which lists all the datasets specified in the `datasets.json` file, and users can see more information about each dataset, such as the list of data files in it and the README, by clicking the "info" button on the list. The application contains an index page, which lists all the datasets specified in the `datasets.json` file, and users can see more information about each dataset, such as the list of data files in it and the README, by clicking the "info" button on the list.
You can read more about it on the [Data catalog with data on GitHub](https://portaljs.org/docs/examples/github-backed-catalog) blog post. You can read more about it on the [Data catalog with data on GitHub](https://portaljs.com/docs/examples/github-backed-catalog) blog post.
## Demo ## Demo

View File

@ -40,7 +40,7 @@ export function Datasets({ projects }) {
<Link <Link
target="_blank" target="_blank"
className="underline" className="underline"
href="https://portaljs.org/" href="https://portaljs.com/"
> >
🌀 PortalJS 🌀 PortalJS
</Link> </Link>

View File

@ -1 +1 @@
PortalJS Learn Example - https://portaljs.org/docs PortalJS Learn Example - https://portaljs.com/docs

View File

@ -6,7 +6,7 @@ A `datasets.json` file is used to specify which datasets are going to be part of
The application contains an index page, which lists all the datasets specified in the `datasets.json` file, and users can see more information about each dataset, such as the list of data files in it and the README, by clicking the "info" button on the list. The application contains an index page, which lists all the datasets specified in the `datasets.json` file, and users can see more information about each dataset, such as the list of data files in it and the README, by clicking the "info" button on the list.
You can read more about it on the [Data catalog with data on GitHub](https://portaljs.org/docs/examples/github-backed-catalog) blog post. You can read more about it on the [Data catalog with data on GitHub](https://portaljs.com/docs/examples/github-backed-catalog) blog post.
## Demo ## Demo

View File

@ -17,7 +17,7 @@ export default function Footer() {
</a> </a>
</div> </div>
<div className="flex gap-x-2 items-center mx-auto h-20"> <div className="flex gap-x-2 items-center mx-auto h-20">
<p className="mt-8 text-base text-slate-500 md:mt-0">Built with <a href="https://portaljs.org" target="_blank" className='text-xl font-medium'>🌀 PortalJS</a></p> <p className="mt-8 text-base text-slate-500 md:mt-0">Built with <a href="https://portaljs.com" target="_blank" className='text-xl font-medium'>🌀 PortalJS</a></p>
</div> </div>
</div> </div>
</footer> </footer>

View File

@ -127,4 +127,4 @@ Based on the bar chart above we can conclude that the following 3 countries have
2. Poland - EUR ~68b. 2. Poland - EUR ~68b.
3. Italy - EUR ~35b. 3. Italy - EUR ~35b.
_This data story was created by using Datopian's PortalJS framework. You can learn more about the framework by visiting https://portaljs.org/_ _This data story was created by using Datopian's PortalJS framework. You can learn more about the framework by visiting https://portaljs.com/_

View File

@ -1,6 +1,6 @@
This demo data portal is designed for https://hatespeechdata.com. It catalogs datasets annotated for hate speech, online abuse, and offensive language which are useful for training a natural language processing system to detect this online abuse. This demo data portal is designed for https://hatespeechdata.com. It catalogs datasets annotated for hate speech, online abuse, and offensive language which are useful for training a natural language processing system to detect this online abuse.
The site is built on top of [PortalJS](https://portaljs.org/). It catalogs datasets and lists of offensive keywords. It also includes static pages. All of these are stored as markdown files inside the `content` folder. The site is built on top of [PortalJS](https://portaljs.com/). It catalogs datasets and lists of offensive keywords. It also includes static pages. All of these are stored as markdown files inside the `content` folder.
- .md files inside `content/datasets/` will appear on the dataset list section of the homepage and be searchable as well as having a individual page in `datasets/<file name>` - .md files inside `content/datasets/` will appear on the dataset list section of the homepage and be searchable as well as having a individual page in `datasets/<file name>`
- .md files inside `content/keywords/` will appear on the list of offensive keywords section of the homepage as well as having a individual page in `keywords/<file name>` - .md files inside `content/keywords/` will appear on the list of offensive keywords section of the homepage as well as having a individual page in `keywords/<file name>`

View File

@ -21,7 +21,7 @@ export function Footer() {
<Container.Inner> <Container.Inner>
<div className="flex flex-col items-center justify-between gap-6 sm:flex-row"> <div className="flex flex-col items-center justify-between gap-6 sm:flex-row">
<p className="text-sm font-medium text-zinc-800 dark:text-zinc-200"> <p className="text-sm font-medium text-zinc-800 dark:text-zinc-200">
Built with <a href='https://portaljs.org'>PortalJS 🌀</a> Built with <a href='https://portaljs.com'>PortalJS 🌀</a>
</p> </p>
<p className="text-sm text-zinc-400 dark:text-zinc-500"> <p className="text-sm text-zinc-400 dark:text-zinc-500">
&copy; {new Date().getFullYear()} Leon Derczynski. All rights &copy; {new Date().getFullYear()} Leon Derczynski. All rights

4
package-lock.json generated
View File

@ -49897,7 +49897,7 @@
}, },
"packages/components": { "packages/components": {
"name": "@portaljs/components", "name": "@portaljs/components",
"version": "0.6.0", "version": "1.2.0",
"dependencies": { "dependencies": {
"@githubocto/flat-ui": "^0.14.1", "@githubocto/flat-ui": "^0.14.1",
"@heroicons/react": "^2.0.17", "@heroicons/react": "^2.0.17",
@ -50383,7 +50383,7 @@
}, },
"packages/remark-wiki-link": { "packages/remark-wiki-link": {
"name": "@portaljs/remark-wiki-link", "name": "@portaljs/remark-wiki-link",
"version": "1.1.3", "version": "1.2.0",
"license": "MIT", "license": "MIT",
"dependencies": { "dependencies": {
"mdast-util-to-markdown": "^1.5.0", "mdast-util-to-markdown": "^1.5.0",

View File

@ -2,7 +2,7 @@
"name": "@portaljs/ckan", "name": "@portaljs/ckan",
"version": "0.1.0", "version": "0.1.0",
"type": "module", "type": "module",
"description": "https://portaljs.org", "description": "https://portaljs.com",
"keywords": [ "keywords": [
"data portal", "data portal",
"data catalog", "data catalog",

View File

@ -1,9 +1,16 @@
import 'tailwindcss/tailwind.css' import 'tailwindcss/tailwind.css'
import '../src/index.css' import '../src/index.css'
import type { Preview } from '@storybook/react'; import type { Preview } from '@storybook/react';
window.process = {
...window.process,
env:{
...window.process?.env,
}
};
const preview: Preview = { const preview: Preview = {
parameters: { parameters: {
actions: { argTypesRegex: '^on[A-Z].*' }, actions: { argTypesRegex: '^on[A-Z].*' },

View File

@ -1,5 +1,41 @@
# @portaljs/components # @portaljs/components
## 1.2.3
### Patch Changes
- [`62dbc35d`](https://github.com/datopian/portaljs/commit/62dbc35d3b39ea7409949340214ca83a448ee999) Thanks [@olayway](https://github.com/olayway)! - LineChart: break lines at invalid / missing values (don't connect if there are gaps in values).
## 1.2.2
### Patch Changes
- [`eeb480e8`](https://github.com/datopian/datahub/commit/eeb480e8cff2d11072ace55ad683a65f54f5d07a) Thanks [@olayway](https://github.com/olayway)! - Adjust `xAxisTimeUnit` property in LineChart to allow for passing `yearmonth`.
## 1.2.1
### Patch Changes
- [`836b143a`](https://github.com/datopian/datahub/commit/836b143a3178b893b1aae3fb511d795dd3a63545) Thanks [@olayway](https://github.com/olayway)! - Fix: make tileLayerName in Map optional.
## 1.2.0
### Minor Changes
- [#1338](https://github.com/datopian/datahub/pull/1338) [`63d9e3b7`](https://github.com/datopian/datahub/commit/63d9e3b7543c38154e6989ef1cc1d694ae9fc4f8) Thanks [@olayway](https://github.com/olayway)! - Support for plotting multiple series in LineChart component.
## 1.1.0
### Minor Changes
- [#1122](https://github.com/datopian/datahub/pull/1122) [`8e349678`](https://github.com/datopian/datahub/commit/8e3496782c022b0653e07f217c6b315ba84e0e61) Thanks [@willy1989cv](https://github.com/willy1989cv)! - Map: allow users to choose a base layer setting
## 1.0.1
### Patch Changes
- [#1170](https://github.com/datopian/datahub/pull/1170) [`9ff25ed7`](https://github.com/datopian/datahub/commit/9ff25ed7c47c8c02cc078c64f76ae35d6754c508) Thanks [@lucasmbispo](https://github.com/lucasmbispo)! - iFrame component: change height
## 1.0.0 ## 1.0.0
### Major Changes ### Major Changes

View File

@ -1,7 +1,7 @@
# PortalJS React Components # PortalJS React Components
**Storybook:** https://storybook.portaljs.org **Storybook:** https://storybook.portaljs.org
**Docs**: https://portaljs.org/docs **Docs**: https://portaljs.com/opensource
## Usage ## Usage

View File

@ -1,8 +1,8 @@
{ {
"name": "@portaljs/components", "name": "@portaljs/components",
"version": "1.0.0", "version": "1.2.3",
"type": "module", "type": "module",
"description": "https://portaljs.org", "description": "https://portaljs.com",
"keywords": [ "keywords": [
"data portal", "data portal",
"data catalog", "data catalog",

View File

@ -11,7 +11,7 @@ export function Iframe({ data, style }: IframeProps) {
return ( return (
<iframe <iframe
src={url} src={url}
style={style ?? { width: `100%`, height: `100%` }} style={style ?? { width: `100%`, height: `600px` }}
></iframe> ></iframe>
); );
} }

View File

@ -5,7 +5,7 @@ import loadData from '../lib/loadData';
import { Data } from '../types/properties'; import { Data } from '../types/properties';
type AxisType = 'quantitative' | 'temporal'; type AxisType = 'quantitative' | 'temporal';
type TimeUnit = 'year' | undefined; // or ... type TimeUnit = 'year' | 'yearmonth' | undefined; // or ...
export type LineChartProps = { export type LineChartProps = {
data: Omit<Data, 'csv'>; data: Omit<Data, 'csv'>;
@ -13,9 +13,10 @@ export type LineChartProps = {
xAxis: string; xAxis: string;
xAxisType?: AxisType; xAxisType?: AxisType;
xAxisTimeUnit?: TimeUnit; xAxisTimeUnit?: TimeUnit;
yAxis: string; yAxis: string | string[];
yAxisType?: AxisType; yAxisType?: AxisType;
fullWidth?: boolean; fullWidth?: boolean;
symbol?: string;
}; };
export function LineChart({ export function LineChart({
@ -26,6 +27,7 @@ export function LineChart({
xAxisTimeUnit = 'year', // TODO: defaults to undefined would probably work better... keeping it as it's for compatibility purposes xAxisTimeUnit = 'year', // TODO: defaults to undefined would probably work better... keeping it as it's for compatibility purposes
yAxis, yAxis,
yAxisType = 'quantitative', yAxisType = 'quantitative',
symbol,
}: LineChartProps) { }: LineChartProps) {
const url = data.url; const url = data.url;
const values = data.values; const values = data.values;
@ -33,6 +35,7 @@ export function LineChart({
// By default, assumes data is an Array... // By default, assumes data is an Array...
const [specData, setSpecData] = useState<any>({ name: 'table' }); const [specData, setSpecData] = useState<any>({ name: 'table' });
const isMultiYAxis = Array.isArray(yAxis);
const spec = { const spec = {
$schema: 'https://vega.github.io/schema/vega-lite/v5.json', $schema: 'https://vega.github.io/schema/vega-lite/v5.json',
@ -44,8 +47,17 @@ export function LineChart({
color: 'black', color: 'black',
strokeWidth: 1, strokeWidth: 1,
tooltip: true, tooltip: true,
invalid: "break-paths"
}, },
data: specData, data: specData,
...(isMultiYAxis
? {
transform: [
{ fold: yAxis, as: ['key', 'value'] },
{ filter: 'datum.value != null && datum.value != ""' }
],
}
: {}),
selection: { selection: {
grid: { grid: {
type: 'interval', type: 'interval',
@ -59,9 +71,25 @@ export function LineChart({
type: xAxisType, type: xAxisType,
}, },
y: { y: {
field: yAxis, field: isMultiYAxis ? 'value' : yAxis,
type: yAxisType, type: yAxisType,
}, },
...(symbol
? {
color: {
field: symbol,
type: 'nominal',
},
}
: {}),
...(isMultiYAxis
? {
color: {
field: 'key',
type: 'nominal',
},
}
: {}),
}, },
} as any; } as any;

View File

@ -12,8 +12,32 @@ import {
import 'leaflet/dist/leaflet.css'; import 'leaflet/dist/leaflet.css';
import * as L from 'leaflet'; import * as L from 'leaflet';
import providers from '../lib/tileLayerPresets';
type VariantKeys<T> = T extends { variants: infer V }
? {
[K in keyof V]: K extends string
? `${K}` | `${K}.${VariantKeys<V[K]>}`
: never;
}[keyof V]
: never;
type ProviderVariantKeys<T> = {
[K in keyof T]: K extends string
? `${K}` | `${K}.${VariantKeys<T[K]>}`
: never;
}[keyof T];
type TileLayerPreset = ProviderVariantKeys<typeof providers> | 'custom';
interface TileLayerSettings extends L.TileLayerOptions {
url?: string;
variant?: string | any;
}
export type MapProps = { export type MapProps = {
tileLayerName?: TileLayerPreset;
tileLayerOptions?: TileLayerSettings | undefined;
layers: { layers: {
data: GeospatialData; data: GeospatialData;
name: string; name: string;
@ -36,7 +60,19 @@ export type MapProps = {
}; };
}; };
const tileLayerDefaultName = process?.env
.NEXT_PUBLIC_MAP_TILE_LAYER_NAME as TileLayerPreset;
const tileLayerDefaultOptions = Object.keys(process?.env)
.filter((key) => key.startsWith('NEXT_PUBLIC_MAP_TILE_LAYER_OPTION_'))
.reduce((obj, key) => {
obj[key.split('NEXT_PUBLIC_MAP_TILE_LAYER_OPTION_')[1]] = process.env[key];
return obj;
}, {}) as TileLayerSettings;
export function Map({ export function Map({
tileLayerName = tileLayerDefaultName || 'OpenStreetMap',
tileLayerOptions,
layers = [ layers = [
{ {
data: null, data: null,
@ -54,6 +90,95 @@ export function Map({
const [isLoading, setIsLoading] = useState<boolean>(false); const [isLoading, setIsLoading] = useState<boolean>(false);
const [layersData, setLayersData] = useState<any>([]); const [layersData, setLayersData] = useState<any>([]);
/*
tileLayerDefaultOptions
extract all environment variables thats starts with NEXT_PUBLIC_MAP_TILE_LAYER_OPTION_.
the variables names are the same as the TileLayer object properties:
- NEXT_PUBLIC_MAP_TILE_LAYER_OPTION_url:
- NEXT_PUBLIC_MAP_TILE_LAYER_OPTION_attribution
- NEXT_PUBLIC_MAP_TILE_LAYER_OPTION_accessToken
- NEXT_PUBLIC_MAP_TILE_LAYER_OPTION_id
- NEXT_PUBLIC_MAP_TILE_LAYER_OPTION_ext
- NEXT_PUBLIC_MAP_TILE_LAYER_OPTION_bounds
- NEXT_PUBLIC_MAP_TILE_LAYER_OPTION_maxZoom
- NEXT_PUBLIC_MAP_TILE_LAYER_OPTION_minZoom
see TileLayerOptions inteface
*/
//tileLayerData prioritizes properties passed through component over those passed through .env variables
tileLayerOptions = Object.assign(tileLayerDefaultOptions, tileLayerOptions);
let provider = {
url: tileLayerOptions.url,
options: tileLayerOptions,
};
if (tileLayerName != 'custom') {
var parts = tileLayerName.split('.');
var providerName = parts[0];
var variantName: string = parts[1];
//make sure to declare a variant if url depends on a variant: assume first
if (providers[providerName].url?.includes('{variant}') && !variantName)
variantName = Object.keys(providers[providerName].variants)[0];
if (!providers[providerName]) {
throw 'No such provider (' + providerName + ')';
}
provider = {
url: providers[providerName].url,
options: providers[providerName].options,
};
// overwrite values in provider from variant.
if (variantName && 'variants' in providers[providerName]) {
if (!(variantName in providers[providerName].variants)) {
throw 'No such variant of ' + providerName + ' (' + variantName + ')';
}
var variant = providers[providerName].variants[variantName];
var variantOptions;
if (typeof variant === 'string') {
variantOptions = {
variant: variant,
};
} else {
variantOptions = variant.options;
}
provider = {
url: variant.url || provider.url,
options: L.Util.extend({}, provider.options, variantOptions),
};
}
var attributionReplacer = function (attr) {
if (attr.indexOf('{attribution.') === -1) {
return attr;
}
return attr.replace(
/\{attribution.(\w*)\}/g,
function (match: any, attributionName: string) {
match;
return attributionReplacer(
providers[attributionName].options.attribution
);
}
);
};
provider.options.attribution = attributionReplacer(
provider.options.attribution
);
}
var tileLayerData = L.Util.extend(
{
url: provider.url,
},
provider.options,
tileLayerOptions
);
useEffect(() => { useEffect(() => {
const loadDataPromises = layers.map(async (layer) => { const loadDataPromises = layers.map(async (layer) => {
const url = layer.data.url; const url = layer.data.url;
@ -100,6 +225,7 @@ export function Map({
</div> </div>
) : ( ) : (
<MapContainer <MapContainer
key={layersData}
center={[center.latitude, center.longitude]} center={[center.latitude, center.longitude]}
zoom={zoom} zoom={zoom}
scrollWheelZoom={false} scrollWheelZoom={false}
@ -144,10 +270,8 @@ export function Map({
map.target.fitBounds(layerToZoomBounds); map.target.fitBounds(layerToZoomBounds);
}} }}
> >
<TileLayer {tileLayerData.url && <TileLayer {...tileLayerData} />}
attribution='&copy; <a href="https://www.openstreetmap.org/copyright">OpenStreetMap</a> contributors'
url="https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png"
/>
<LayersControl position="bottomright"> <LayersControl position="bottomright">
{layers.map((layer) => { {layers.map((layer) => {
const data = layersData.find( const data = layersData.find(

File diff suppressed because it is too large Load Diff

View File

@ -8,7 +8,7 @@
type URL = string; // Just in case we want to transform it into an object with configurations type URL = string; // Just in case we want to transform it into an object with configurations
export interface Data { export interface Data {
url?: URL; url?: URL;
values?: { [key: string]: number | string }[]; values?: { [key: string]: number | string | null | undefined }[];
csv?: string; csv?: string;
} }

View File

@ -28,6 +28,6 @@ export const Normal: Story = {
data: { data: {
url: 'https://app.powerbi.com/view?r=eyJrIjoiYzBmN2Q2MzYtYzE3MS00ODkxLWE5OWMtZTQ2MjBlMDljMDk4IiwidCI6Ijk1M2IwZjgzLTFjZTYtNDVjMy04MmM5LTFkODQ3ZTM3MjMzOSIsImMiOjh9', url: 'https://app.powerbi.com/view?r=eyJrIjoiYzBmN2Q2MzYtYzE3MS00ODkxLWE5OWMtZTQ2MjBlMDljMDk4IiwidCI6Ijk1M2IwZjgzLTFjZTYtNDVjMy04MmM5LTFkODQ3ZTM3MjMzOSIsImMiOjh9',
}, },
style: { width: `100%`, height: `100%` }, style: { width: `100%`, height: `600px` },
}, },
}; };

View File

@ -4,6 +4,6 @@ import { Meta } from '@storybook/blocks';
# Welcome to the PortalJS components guide # Welcome to the PortalJS components guide
**Official Website:** [portaljs.org](https://portaljs.org) **Official Website:** [portaljs.com](https://portaljs.com)
**Docs:** [portaljs.org/docs](https://portaljs.org/docs) **Docs:** [portaljs.com/opensource](https://portaljs.com/opensource)
**GitHub:** [github.com/datopian/portaljs](https://github.com/datopian/portaljs) **GitHub:** [github.com/datopian/portaljs](https://github.com/datopian/portaljs)

View File

@ -30,11 +30,15 @@ Must be an object with one of the following properties: `url` or `values` \n\n \
}, },
yAxis: { yAxis: {
description: description:
'Name of the column header or object property that represents the Y-axis on the data.', 'Name of the column headers or object properties that represent the Y-axis on the data.',
}, },
yAxisType: { yAxisType: {
description: 'Type of the Y-axis', description: 'Type of the Y-axis',
}, },
symbol: {
description:
'Name of the column header or object property that represents a series for multiple series.',
},
}, },
}; };
@ -60,6 +64,51 @@ export const FromDataPoints: Story = {
}, },
}; };
export const MultiSeries: Story = {
name: 'Line chart with multiple series (specifying symbol)',
args: {
data: {
values: [
{ year: '1850', value: -0.41765878, z: 'A' },
{ year: '1851', value: -0.2333498, z: 'A' },
{ year: '1852', value: -0.22939907, z: 'A' },
{ year: '1853', value: -0.27035445, z: 'A' },
{ year: '1854', value: -0.29163003, z: 'A' },
{ year: '1850', value: -0.42993882, z: 'B' },
{ year: '1851', value: -0.30365549, z: 'B' },
{ year: '1852', value: -0.27905189, z: 'B' },
{ year: '1853', value: -0.22939704, z: 'B' },
{ year: '1854', value: -0.25688013, z: 'B' },
{ year: '1850', value: -0.4757164, z: 'C' },
{ year: '1851', value: -0.41971018, z: 'C' },
{ year: '1852', value: -0.40724799, z: 'C' },
{ year: '1853', value: -0.45049156, z: 'C' },
{ year: '1854', value: -0.41896583, z: 'C' },
],
},
xAxis: 'year',
yAxis: 'value',
symbol: 'z',
},
};
export const MultiColumns: Story = {
name: 'Line chart with multiple series (with multiple columns)',
args: {
data: {
values: [
{ year: '1850', A: -0.41765878, B: -0.42993882, C: -0.4757164 },
{ year: '1851', A: -0.2333498, B: -0.30365549, C: -0.41971018 },
{ year: '1852', A: -0.22939907, B: -0.27905189, C: -0.40724799 },
{ year: '1853', A: -0.27035445, B: -0.22939704, C: -0.45049156 },
{ year: '1854', A: -0.29163003, B: -0.25688013, C: -0.41896583 },
],
},
xAxis: 'year',
yAxis: ['A', 'B', 'C'],
},
};
export const FromURL: Story = { export const FromURL: Story = {
name: 'Line chart from URL', name: 'Line chart from URL',
args: { args: {
@ -71,3 +120,42 @@ export const FromURL: Story = {
yAxis: 'Price', yAxis: 'Price',
}, },
}; };
// export const FromURLMulti: Story = {
// name: 'Line chart from URL Multi Column',
// args: {
// data: {
// url: 'https://raw.githubusercontent.com/datasets/sea-level-rise/refs/heads/main/data/epa-sea-level.csv',
// },
// title: 'Sea Level Rise (1880-2023)',
// xAxis: 'Year',
// yAxis: ["CSIRO Adjusted Sea Level", "NOAA Adjusted Sea Level"],
// xAxisType: 'temporal',
// xAxisTimeUnit: 'year',
// yAxisType: 'quantitative'
// },
// };
// export const MultipleSeriesMissingValues: Story = {
// name: 'Line chart with missing values',
// args: {
// data: {
// values: [
// { year: '2020', seriesA: 10, seriesB: 15 },
// { year: '2021', seriesA: 20 }, // seriesB missing
// { year: '2022', seriesA: 15 }, // seriesB missing
// { year: '2023', seriesB: 30 }, // seriesA missing
// { year: '2024', seriesA: 25, seriesB: 35 },
// { year: '2024', seriesA: 20, seriesB: 40 },
// { year: '2024', seriesB: 45 },
// ],
// },
// title: 'Handling Missing Data Points',
// xAxis: 'year',
// yAxis: ['seriesA', 'seriesB'],
// xAxisType: 'temporal',
// xAxisTimeUnit: 'year',
// yAxisType: 'quantitative'
// },
// };

View File

@ -43,6 +43,10 @@ type Story = StoryObj<MapProps>;
export const GeoJSONPolygons: Story = { export const GeoJSONPolygons: Story = {
name: 'GeoJSON polygons map', name: 'GeoJSON polygons map',
args: { args: {
tileLayerName:'MapBox',
tileLayerOptions:{
accessToken : 'pk.eyJ1Ijoid2lsbHktcGFsbWFyZWpvIiwiYSI6ImNqNzk5NmRpNDFzb2cyeG9sc2luMHNjajUifQ.lkoVRFSI8hOLH4uJeOzwXw',
},
layers: [ layers: [
{ {
data: { data: {

View File

@ -53,7 +53,7 @@ export const Nav: React.FC<Props> = ({
<nav className="flex justify-between"> <nav className="flex justify-between">
{/* Mobile navigation */} {/* Mobile navigation */}
<div className="mr-2 sm:mr-4 flex lg:hidden"> <div className="mr-2 sm:mr-4 flex lg:hidden">
<NavMobile links={links}>{children}</NavMobile> <NavMobile {...{title, links, social, search, defaultTheme, themeToggleIcon}}>{children}</NavMobile>
</div> </div>
{/* Non-mobile navigation */} {/* Non-mobile navigation */}
<div className="flex flex-none items-center"> <div className="flex flex-none items-center">

View File

@ -4,20 +4,16 @@ import { useRouter } from "next/router.js";
import { useEffect, useState } from "react"; import { useEffect, useState } from "react";
import { SearchContext, SearchField } from "../Search"; import { SearchContext, SearchField } from "../Search";
import { MenuIcon, CloseIcon } from "../Icons"; import { MenuIcon, CloseIcon } from "../Icons";
import { NavLink, SearchProviderConfig } from "../types"; import type { NavConfig, ThemeConfig } from "./Nav";
interface Props extends React.PropsWithChildren { interface Props extends NavConfig, ThemeConfig, React.PropsWithChildren {}
author?: string;
links?: Array<NavLink>;
search?: SearchProviderConfig;
}
// TODO why mobile navigation only accepts author and regular nav accepts different things like title, logo, version // TODO: Search doesn't appear
export const NavMobile: React.FC<Props> = ({ export const NavMobile: React.FC<Props> = ({
children, children,
title,
links, links,
search, search,
author,
}) => { }) => {
const router = useRouter(); const router = useRouter();
const [isOpen, setIsOpen] = useState(false); const [isOpen, setIsOpen] = useState(false);
@ -77,8 +73,8 @@ export const NavMobile: React.FC<Props> = ({
legacyBehavior legacyBehavior
> >
{/* <Logomark className="h-9 w-9" /> */} {/* <Logomark className="h-9 w-9" /> */}
<div className="font-extrabold text-primary dark:text-primary-dark text-2xl ml-6"> <div className="font-extrabold text-primary dark:text-primary-dark text-lg ml-6">
{author} {title}
</div> </div>
</Link> </Link>
</div> </div>
@ -106,9 +102,7 @@ export const NavMobile: React.FC<Props> = ({
))} ))}
</ul> </ul>
)} )}
{/* <div className="pt-6 border border-t-2"> <div className="pt-6">{children}</div>
{children}
</div> */}
</Dialog.Panel> </Dialog.Panel>
</Dialog> </Dialog>
</> </>

View File

@ -1,42 +1,203 @@
import { toMarkdown } from "mdast-util-wiki-link"; import { isSupportedFileFormat } from './isSupportedFileFormat';
import { syntax, SyntaxOptions } from "./syntax";
import { fromMarkdown, FromMarkdownOptions } from "./fromMarkdown";
let warningIssued = false; const defaultWikiLinkResolver = (target: string) => {
// for [[#heading]] links
if (!target) {
return [];
}
let permalink = target.replace(/\/index$/, '');
// TODO what to do with [[index]] link?
if (permalink.length === 0) {
permalink = '/';
}
return [permalink];
};
type RemarkWikiLinkOptions = FromMarkdownOptions & SyntaxOptions; export interface FromMarkdownOptions {
pathFormat?:
| 'raw' // default; use for regular relative or absolute paths
| 'obsidian-absolute' // use for Obsidian-style absolute paths (with no leading slash)
| 'obsidian-short'; // use for Obsidian-style shortened paths (shortest path possible)
permalinks?: string[]; // list of permalinks to match possible permalinks of a wiki link against
wikiLinkResolver?: (name: string) => string[]; // function to resolve wiki links to an array of possible permalinks
newClassName?: string; // class name to add to links that don't have a matching permalink
wikiLinkClassName?: string; // class name to add to all wiki links
hrefTemplate?: (permalink: string) => string; // function to generate the href attribute of a link
}
function remarkWikiLink(opts: RemarkWikiLinkOptions = {}) { export function getImageSize(size: string) {
const data = this.data(); // this is a reference to the processor // eslint-disable-next-line prefer-const
let [width, height] = size.split('x');
function add(field, value) { if (!height) height = width;
if (data[field]) data[field].push(value);
else data[field] = [value]; return { width, height };
}
// mdas-util-from-markdown extension
// https://github.com/syntax-tree/mdast-util-from-markdown#extension
function fromMarkdown(opts: FromMarkdownOptions = {}) {
const pathFormat = opts.pathFormat || 'raw';
const permalinks = opts.permalinks || [];
const wikiLinkResolver = opts.wikiLinkResolver || defaultWikiLinkResolver;
const newClassName = opts.newClassName || 'new';
const wikiLinkClassName = opts.wikiLinkClassName || 'internal';
const defaultHrefTemplate = (permalink: string) => permalink;
const hrefTemplate = opts.hrefTemplate || defaultHrefTemplate;
function top(stack) {
return stack[stack.length - 1];
} }
if ( function enterWikiLink(token) {
!warningIssued && this.enter(
((this.Parser && {
this.Parser.prototype && type: 'wikiLink',
this.Parser.prototype.blockTokenizers) || data: {
(this.Compiler && isEmbed: token.isType === 'embed',
this.Compiler.prototype && target: null, // the target of the link, e.g. "Foo Bar#Heading" in "[[Foo Bar#Heading]]"
this.Compiler.prototype.visitors)) alias: null, // the alias of the link, e.g. "Foo" in "[[Foo Bar|Foo]]"
) { permalink: null, // TODO shouldn't this be named just "link"?
warningIssued = true; exists: null, // TODO is this even needed here?
console.warn( // fields for mdast-util-to-hast (used e.g. by remark-rehype)
"[remark-wiki-link] Warning: please upgrade to remark 13 to use this plugin" hName: null,
hProperties: null,
hChildren: null,
},
},
token
); );
} }
// add extensions to packages used by remark-parse function exitWikiLinkTarget(token) {
// micromark extensions const target = this.sliceSerialize(token);
add("micromarkExtensions", syntax(opts)); const current = top(this.stack);
// mdast-util-from-markdown extensions current.data.target = target;
add("fromMarkdownExtensions", fromMarkdown(opts)); }
// mdast-util-to-markdown extensions
add("toMarkdownExtensions", toMarkdown(opts)); function exitWikiLinkAlias(token) {
const alias = this.sliceSerialize(token);
const current = top(this.stack);
current.data.alias = alias;
}
function exitWikiLink(token) {
const wikiLink = top(this.stack)
const {
data: {isEmbed, target, alias},
} = wikiLink;
this.exit(token);
// eslint-disable-next-line no-useless-escape
const wikiLinkWithHeadingPattern = /^(.*?)(#.*)?$/u;
const [, path, heading = ''] = target.match(wikiLinkWithHeadingPattern);
const possibleWikiLinkPermalinks = wikiLinkResolver(path);
const matchingPermalink = permalinks.find((e) => {
return possibleWikiLinkPermalinks.find((p) => {
if (pathFormat === 'obsidian-short') {
if (e === p || e.endsWith(p)) {
return true;
}
} else if (pathFormat === 'obsidian-absolute') {
if (e === '/' + p) {
return true;
}
} else {
if (e === p) {
return true;
}
}
return false;
});
});
// TODO this is ugly
const link =
matchingPermalink ||
(pathFormat === 'obsidian-absolute'
? '/' + possibleWikiLinkPermalinks[0]
: possibleWikiLinkPermalinks[0]) ||
'';
wikiLink.data.exists = !!matchingPermalink;
wikiLink.data.permalink = link;
// remove leading # if the target is a heading on the same page
const displayName = alias || target.replace(/^#/, '');
const headingId = heading.replace(/\s+/g, '-').toLowerCase();
let classNames = wikiLinkClassName;
if (!matchingPermalink) {
classNames += ' ' + newClassName;
}
if (isEmbed) {
const [isSupportedFormat, format] = isSupportedFileFormat(target);
if (!isSupportedFormat) {
// Temporarily render note transclusion as a regular wiki link
if (!format) {
wikiLink.data.hName = 'a';
wikiLink.data.hProperties = {
className: classNames + ' ' + 'transclusion',
href: hrefTemplate(link) + headingId,
};
wikiLink.data.hChildren = [{ type: 'text', value: displayName }];
} else {
wikiLink.data.hName = 'p';
wikiLink.data.hChildren = [
{
type: 'text',
value: `![[${target}]]`,
},
];
}
} else if (format === 'pdf') {
wikiLink.data.hName = 'iframe';
wikiLink.data.hProperties = {
className: classNames,
width: '100%',
src: `${hrefTemplate(link)}#toolbar=0`,
};
} else {
const hasDimensions = alias && /^\d+(x\d+)?$/.test(alias);
// Take the target as alt text except if alt name was provided [[target|alt text]]
const altText = hasDimensions || !alias ? target : alias;
wikiLink.data.hName = 'img';
wikiLink.data.hProperties = {
className: classNames,
src: hrefTemplate(link),
alt: altText
};
if (hasDimensions) {
const { width, height } = getImageSize(alias as string);
Object.assign(wikiLink.data.hProperties, {
width,
height,
});
}
}
} else {
wikiLink.data.hName = 'a';
wikiLink.data.hProperties = {
className: classNames,
href: hrefTemplate(link) + headingId,
};
wikiLink.data.hChildren = [{ type: 'text', value: displayName }];
}
}
return {
enter: {
wikiLink: enterWikiLink,
},
exit: {
wikiLinkTarget: exitWikiLinkTarget,
wikiLinkAlias: exitWikiLinkAlias,
wikiLink: exitWikiLink,
},
};
} }
export default remarkWikiLink; export { fromMarkdown };
export { remarkWikiLink };

View File

@ -38,6 +38,5 @@ const defaultPathToPermalinkFunc = (
.replace(markdownFolder, "") // make the permalink relative to the markdown folder .replace(markdownFolder, "") // make the permalink relative to the markdown folder
.replace(/\.(mdx|md)/, "") .replace(/\.(mdx|md)/, "")
.replace(/\\/g, "/") // replace windows backslash with forward slash .replace(/\\/g, "/") // replace windows backslash with forward slash
.replace(/\/index$/, ""); // remove index from the end of the permalink
return permalink.length > 0 ? permalink : "/"; // for home page return permalink.length > 0 ? permalink : "/"; // for home page
}; };

View File

@ -1,9 +1,6 @@
import * as path from "path"; import * as path from "path";
// import * as url from "url";
import { getPermalinks } from "../src/utils"; import { getPermalinks } from "../src/utils";
// const __dirname = url.fileURLToPath(new URL(".", import.meta.url));
// const markdownFolder = path.join(__dirname, "/fixtures/content");
const markdownFolder = path.join( const markdownFolder = path.join(
".", ".",
"test/fixtures/content" "test/fixtures/content"
@ -12,12 +9,12 @@ const markdownFolder = path.join(
describe("getPermalinks", () => { describe("getPermalinks", () => {
test("should return an array of permalinks", () => { test("should return an array of permalinks", () => {
const expectedPermalinks = [ const expectedPermalinks = [
"/", // /index.md "/README",
"/abc", "/abc",
"/blog/first-post", "/blog/first-post",
"/blog/Second Post", "/blog/Second Post",
"/blog/third-post", "/blog/third-post",
"/blog", // /blog/index.md "/blog/README",
"/blog/tutorials/first-tutorial", "/blog/tutorials/first-tutorial",
"/assets/Pasted Image 123.png", "/assets/Pasted Image 123.png",
]; ];
@ -28,35 +25,4 @@ describe("getPermalinks", () => {
expect(expectedPermalinks).toContain(permalink); expect(expectedPermalinks).toContain(permalink);
}); });
}); });
test("should return an array of permalinks with custom path -> permalink converter function", () => {
const expectedPermalinks = [
"/", // /index.md
"/abc",
"/blog/first-post",
"/blog/second-post",
"/blog/third-post",
"/blog", // /blog/index.md
"/blog/tutorials/first-tutorial",
"/assets/pasted-image-123.png",
];
const func = (filePath: string, markdownFolder: string) => {
const permalink = filePath
.replace(markdownFolder, "") // make the permalink relative to the markdown folder
.replace(/\.(mdx|md)/, "")
.replace(/\\/g, "/") // replace windows backslash with forward slash
.replace(/\/index$/, "") // remove index from the end of the permalink
.replace(/ /g, "-") // replace spaces with hyphens
.toLowerCase(); // convert to lowercase
return permalink.length > 0 ? permalink : "/"; // for home page
};
const permalinks = getPermalinks(markdownFolder, [/\.DS_Store/], func);
expect(permalinks).toHaveLength(expectedPermalinks.length);
permalinks.forEach((permalink) => {
expect(expectedPermalinks).toContain(permalink);
});
});
}); });

View File

@ -286,56 +286,6 @@ describe("micromark-extension-wiki-link", () => {
}); });
}); });
test("parses wiki links to index files", () => {
const serialized = micromark("[[/some/folder/index]]", "ascii", {
extensions: [syntax()],
htmlExtensions: [html() as any], // TODO type fix
});
expect(serialized).toBe(
'<p><a href="/some/folder" class="internal new">/some/folder/index</a></p>'
);
});
describe("other", () => {
test("parses a wiki link to some index page in a folder with no matching permalink", () => {
const serialized = micromark("[[/some/folder/index]]", "ascii", {
extensions: [syntax()],
htmlExtensions: [html() as any], // TODO type fix
});
expect(serialized).toBe(
'<p><a href="/some/folder" class="internal new">/some/folder/index</a></p>'
);
});
test("parses a wiki link to some index page in a folder with a matching permalink", () => {
const serialized = micromark("[[/some/folder/index]]", "ascii", {
extensions: [syntax()],
htmlExtensions: [html({ permalinks: ["/some/folder"] }) as any], // TODO type fix
});
expect(serialized).toBe(
'<p><a href="/some/folder" class="internal">/some/folder/index</a></p>'
);
});
test("parses a wiki link to home index page with no matching permalink", () => {
const serialized = micromark("[[/index]]", "ascii", {
extensions: [syntax()],
htmlExtensions: [html() as any], // TODO type fix
});
expect(serialized).toBe(
'<p><a href="/" class="internal new">/index</a></p>'
);
});
test("parses a wiki link to home index page with a matching permalink", () => {
const serialized = micromark("[[/index]]", "ascii", {
extensions: [syntax()],
htmlExtensions: [html({ permalinks: ["/"] }) as any], // TODO type fix
});
expect(serialized).toBe('<p><a href="/" class="internal">/index</a></p>');
});
});
describe("transclusions", () => { describe("transclusions", () => {
test("parsers a transclusion as a regular wiki link", () => { test("parsers a transclusion as a regular wiki link", () => {
const serialized = micromark("![[Some Page]]", "ascii", { const serialized = micromark("![[Some Page]]", "ascii", {

View File

@ -485,109 +485,6 @@ describe("remark-wiki-link", () => {
}); });
}); });
test("parses wiki links to index files", () => {
const processor = unified().use(markdown).use(wikiLinkPlugin);
let ast = processor.parse("[[/some/folder/index]]");
ast = processor.runSync(ast);
expect(select("wikiLink", ast)).not.toEqual(null);
visit(ast, "wikiLink", (node: Node) => {
expect(node.data?.exists).toEqual(false);
expect(node.data?.permalink).toEqual("/some/folder");
expect(node.data?.alias).toEqual(null);
expect(node.data?.hName).toEqual("a");
expect((node.data?.hProperties as any).className).toEqual("internal new");
expect((node.data?.hProperties as any).href).toEqual("/some/folder");
expect((node.data?.hChildren as any)[0].value).toEqual(
"/some/folder/index"
);
});
});
describe("other", () => {
test("parses a wiki link to some index page in a folder with no matching permalink", () => {
const processor = unified().use(markdown).use(wikiLinkPlugin);
let ast = processor.parse("[[/some/folder/index]]");
ast = processor.runSync(ast);
visit(ast, "wikiLink", (node: Node) => {
expect(node.data?.exists).toEqual(false);
expect(node.data?.permalink).toEqual("/some/folder");
expect(node.data?.alias).toEqual(null);
expect(node.data?.hName).toEqual("a");
expect((node.data?.hProperties as any).className).toEqual(
"internal new"
);
expect((node.data?.hProperties as any).href).toEqual("/some/folder");
expect((node.data?.hChildren as any)[0].value).toEqual(
"/some/folder/index"
);
});
});
test("parses a wiki link to some index page in a folder with a matching permalink", () => {
const processor = unified()
.use(markdown)
.use(wikiLinkPlugin, { permalinks: ["/some/folder"] });
let ast = processor.parse("[[/some/folder/index]]");
ast = processor.runSync(ast);
visit(ast, "wikiLink", (node: Node) => {
expect(node.data?.exists).toEqual(true);
expect(node.data?.permalink).toEqual("/some/folder");
expect(node.data?.alias).toEqual(null);
expect(node.data?.hName).toEqual("a");
expect((node.data?.hProperties as any).className).toEqual("internal");
expect((node.data?.hProperties as any).href).toEqual("/some/folder");
expect((node.data?.hChildren as any)[0].value).toEqual(
"/some/folder/index"
);
});
});
test("parses a wiki link to home index page with no matching permalink", () => {
const processor = unified().use(markdown).use(wikiLinkPlugin);
let ast = processor.parse("[[/index]]");
ast = processor.runSync(ast);
visit(ast, "wikiLink", (node: Node) => {
expect(node.data?.exists).toEqual(false);
expect(node.data?.permalink).toEqual("/");
expect(node.data?.alias).toEqual(null);
expect(node.data?.hName).toEqual("a");
expect((node.data?.hProperties as any).className).toEqual(
"internal new"
);
expect((node.data?.hProperties as any).href).toEqual("/");
expect((node.data?.hChildren as any)[0].value).toEqual("/index");
});
});
test("parses a wiki link to home index page with a matching permalink", () => {
const processor = unified()
.use(markdown)
.use(wikiLinkPlugin, { permalinks: ["/"] });
let ast = processor.parse("[[/index]]");
ast = processor.runSync(ast);
visit(ast, "wikiLink", (node: Node) => {
expect(node.data?.exists).toEqual(true);
expect(node.data?.permalink).toEqual("/");
expect(node.data?.alias).toEqual(null);
expect(node.data?.hName).toEqual("a");
expect((node.data?.hProperties as any).className).toEqual("internal");
expect((node.data?.hProperties as any).href).toEqual("/");
expect((node.data?.hChildren as any)[0].value).toEqual("/index");
});
});
});
describe("transclusions", () => { describe("transclusions", () => {
test("replaces a transclusion with a regular wiki link", () => { test("replaces a transclusion with a regular wiki link", () => {
const processor = unified().use(markdown).use(wikiLinkPlugin); const processor = unified().use(markdown).use(wikiLinkPlugin);

View File

@ -12,7 +12,7 @@ export default function JSONLD({
return <></>; return <></>;
} }
const baseUrl = process.env.NEXT_PUBLIC_SITE_URL || 'https://portaljs.org'; const baseUrl = process.env.NEXT_PUBLIC_SITE_URL || 'https://portaljs.com';
const pageUrl = `${baseUrl}/${meta.urlPath}`; const pageUrl = `${baseUrl}/${meta.urlPath}`;
const imageMatches = source.match( const imageMatches = source.match(

View File

@ -81,7 +81,6 @@ export default function Layout({
} }
return section.children.findIndex(isActive) > -1; return section.children.findIndex(isActive) > -1;
} }
return ( return (
<> <>
{title && <NextSeo title={title} description={description} />} {title && <NextSeo title={title} description={description} />}

View File

@ -22,11 +22,41 @@ const items = [
sourceUrl: 'https://github.com/FCSCOpendata/frontend', sourceUrl: 'https://github.com/FCSCOpendata/frontend',
}, },
{ {
title: 'Datahub Open Data', title: 'Frictionless Data',
href: 'https://opendata.datahub.io/', href: 'https://datahub.io/core/co2-ppm',
image: '/images/showcases/datahub.webp', repository: 'https://github.com/datopian/datahub/tree/main/examples/dataset-frictionless',
description: 'Demo Data Portal by DataHub', image: '/images/showcases/frictionless-capture.png',
description: 'Progressive open-source framework for building data infrastructure - data management, data integration, data flows, etc. It includes various data standards and provides software to work with data.',
}, },
{
title: "OpenSpending",
image: "/images/showcases/openspending.png",
href: "https://www.openspending.org",
repository: 'https://github.com/datopian/datahub/tree/main/examples/openspending',
description: "OpenSpending is a free, open and global platform to search, visualise and analyse fiscal data in the public sphere."
},
{
title: "FiveThirtyEight",
image: "/images/showcases/fivethirtyeight.png",
href: "https://fivethirtyeight.portaljs.org/",
repository: 'https://github.com/datopian/datahub/tree/main/examples/fivethirtyeight',
description: "This is a replica of data.fivethirtyeight.com using PortalJS."
},
{
title: "Github Datasets",
image: "/images/showcases/github-datasets.png",
href: "https://example.portaljs.org/",
repository: 'https://github.com/datopian/datahub/tree/main/examples/github-backed-catalog',
description: "A simple data catalog that get its data from a list of GitHub repos that serve as datasets."
},
{
title: "Hatespeech Data",
image: "/images/showcases/turing.png",
href: "https://hatespeechdata.com/",
repository: 'https://github.com/datopian/datahub/tree/main/examples/turing',
description: "Datasets annotated for hate speech, online abuse, and offensive language which are useful for training a natural language processing system to detect this online abuse."
},
]; ];
export default function Showcases() { export default function Showcases() {

View File

@ -1,10 +1,6 @@
export default function ShowcasesItem({ item }) { export default function ShowcasesItem({ item }) {
return ( return (
<a <div className="rounded overflow-hidden group relative border-1 shadow-lg">
className="rounded overflow-hidden group relative border-1 shadow-lg"
target="_blank"
href={item.href}
>
<div <div
className="bg-cover bg-no-repeat bg-top aspect-video w-full group-hover:blur-sm group-hover:scale-105 transition-all duration-200" className="bg-cover bg-no-repeat bg-top aspect-video w-full group-hover:blur-sm group-hover:scale-105 transition-all duration-200"
style={{ backgroundImage: `url(${item.image})` }} style={{ backgroundImage: `url(${item.image})` }}
@ -16,9 +12,48 @@ export default function ShowcasesItem({ item }) {
<div className="text-center text-primary-dark"> <div className="text-center text-primary-dark">
<span className="text-xl font-semibold">{item.title}</span> <span className="text-xl font-semibold">{item.title}</span>
<p className="text-base font-medium">{item.description}</p> <p className="text-base font-medium">{item.description}</p>
</div> <div className="flex justify-center mt-2 gap-2 ">
</div> {item.href && (
</div> <a
target="_blank"
className=" text-white w-8 h-8 p-1 bg-primary rounded-full hover:scale-110 transition cursor-pointer z-50"
rel="noreferrer"
href={item.href}
>
<svg
xmlns="http://www.w3.org/2000/svg"
viewBox="0 0 420 420"
stroke="white"
fill="none"
>
<path stroke-width="26" d="M209,15a195,195 0 1,0 2,0z" />
<path
stroke-width="18"
d="m210,15v390m195-195H15M59,90a260,260 0 0,0 302,0 m0,240 a260,260 0 0,0-302,0M195,20a250,250 0 0,0 0,382 m30,0 a250,250 0 0,0 0-382"
/>
</svg>
</a> </a>
)}
{item.repository && (
<a
target="_blank"
rel="noreferrer"
className="w-8 h-8 bg-black rounded-full p-1 hover:scale-110 transition cursor-pointer z-50"
href={item.repository}
>
<svg
aria-hidden="true"
viewBox="0 0 16 16"
fill="currentColor"
>
<path d="M8 0C3.58 0 0 3.58 0 8C0 11.54 2.29 14.53 5.47 15.59C5.87 15.66 6.02 15.42 6.02 15.21C6.02 15.02 6.01 14.39 6.01 13.72C4 14.09 3.48 13.23 3.32 12.78C3.23 12.55 2.84 11.84 2.5 11.65C2.22 11.5 1.82 11.13 2.49 11.12C3.12 11.11 3.57 11.7 3.72 11.94C4.44 13.15 5.59 12.81 6.05 12.6C6.12 12.08 6.33 11.73 6.56 11.53C4.78 11.33 2.92 10.64 2.92 7.58C2.92 6.71 3.23 5.99 3.74 5.43C3.66 5.23 3.38 4.41 3.82 3.31C3.82 3.31 4.49 3.1 6.02 4.13C6.66 3.95 7.34 3.86 8.02 3.86C8.7 3.86 9.38 3.95 10.02 4.13C11.55 3.09 12.22 3.31 12.22 3.31C12.66 4.41 12.38 5.23 12.3 5.43C12.81 5.99 13.12 6.7 13.12 7.58C13.12 10.65 11.25 11.33 9.47 11.53C9.76 11.78 10.01 12.26 10.01 13.01C10.01 14.08 10 14.94 10 15.21C10 15.42 10.15 15.67 10.55 15.59C13.71 14.53 16 11.53 16 8C16 3.58 12.42 0 8 0Z" />
</svg>
</a>
)}
</div>
</div>
</div>
</div>
</div>
); );
} }

View File

@ -7,17 +7,17 @@ filetype: 'blog'
This post walks you though adding maps and geospatial visualizations to PortalJS. This post walks you though adding maps and geospatial visualizations to PortalJS.
Are you interested in building rich and interactive data portals? Do you find value in the power and flexibility of JavaScript, Nextjs, and React? If so, [PortalJS](https://portaljs.org/) is for you. It's a state-of-the-art framework leveraging these technologies to help you build rich data portals. Are you interested in building rich and interactive data portals? Do you find value in the power and flexibility of JavaScript, Nextjs, and React? If so, [PortalJS](https://portaljs.com/) is for you. It's a state-of-the-art framework leveraging these technologies to help you build rich data portals.
Effective data visualization lies in the use of various data components. Within [PortalJS](https://portaljs.org/), we take data visualization a step further. It's not just about displaying data - it's about telling a story through combining a variety of data components. Effective data visualization lies in the use of various data components. Within [PortalJS](https://portaljs.com/), we take data visualization a step further. It's not just about displaying data - it's about telling a story through combining a variety of data components.
In this post we will share our latest enhancement to PortalJS: maps, a powerful tool for visualizing geospatial data. In this post, we will to take you on a tour of our experiments and progress in enhancing map functionalities on PortalJS. The journey is still in its early stages, with new facets being unveiled and refined as we perfect our API. In this post we will share our latest enhancement to PortalJS: maps, a powerful tool for visualizing geospatial data. In this post, we will to take you on a tour of our experiments and progress in enhancing map functionalities on PortalJS. The journey is still in its early stages, with new facets being unveiled and refined as we perfect our API.
## Exploring Map Formats ## Exploring Map Formats
Maps play a crucial role in geospatial data visualization. Several formats exist for storing and sharing this type of data, with GeoJSON, KML, and shapefiles being among the most popular. As a prominent figure in the field of open-source data portal platforms, [PortalJS](https://portaljs.org/) strives to support as many map formats as possible. Maps play a crucial role in geospatial data visualization. Several formats exist for storing and sharing this type of data, with GeoJSON, KML, and shapefiles being among the most popular. As a prominent figure in the field of open-source data portal platforms, [PortalJS](https://portaljs.com/) strives to support as many map formats as possible.
Taking inspiration from the ckanext-geoview extension, we currently support KML and GeoJSON formats in [PortalJS](https://portaljs.org/). This remarkable extension is a plugin for CKAN, the worlds leading open source data management system, that enables users to visualize geospatial data in diverse formats on an interactive map. Apart from KML and GeoJSON formats support, our roadmap entails extending compatibility to encompass all other formats supported by ckanext-geoview. Rest assured, we are committed to empowering users with a wide array of map format options in the future. Taking inspiration from the ckanext-geoview extension, we currently support KML and GeoJSON formats in [PortalJS](https://portaljs.com/). This remarkable extension is a plugin for CKAN, the worlds leading open source data management system, that enables users to visualize geospatial data in diverse formats on an interactive map. Apart from KML and GeoJSON formats support, our roadmap entails extending compatibility to encompass all other formats supported by ckanext-geoview. Rest assured, we are committed to empowering users with a wide array of map format options in the future.
So, what makes these formats special? So, what makes these formats special?
@ -27,7 +27,7 @@ So, what makes these formats special?
## Unveiling the Power of Leaflet and OpenLayers ## Unveiling the Power of Leaflet and OpenLayers
To display maps in [PortalJS](https://portaljs.org/), we utilize two powerful JavaScript libraries for creating interactive maps based on different layers: Leaflet and OpenLayers. Each offers distinct advantages (and disadvantages), inspiring us to integrate both and give users the flexibility to choose. To display maps in [PortalJS](https://portaljs.com/), we utilize two powerful JavaScript libraries for creating interactive maps based on different layers: Leaflet and OpenLayers. Each offers distinct advantages (and disadvantages), inspiring us to integrate both and give users the flexibility to choose.
Leaflet is the leading open-source JavaScript library known for its mobile-friendly, interactive maps. With its compact size (just 42 KB of JS), it provides all the map features most developers need. Leaflet is designed with simplicity, performance and usability in mind. It works efficiently across all major desktop and mobile platforms. Leaflet is the leading open-source JavaScript library known for its mobile-friendly, interactive maps. With its compact size (just 42 KB of JS), it provides all the map features most developers need. Leaflet is designed with simplicity, performance and usability in mind. It works efficiently across all major desktop and mobile platforms.
@ -59,8 +59,8 @@ Users can also choose a region of focus, which will depend on the data, by setti
Through our ongoing enhancements to the [PortalJS library](https://storybook.portaljs.org/), we aim to empower users to create engaging and informative data portals featuring diverse map formats and data components. Through our ongoing enhancements to the [PortalJS library](https://storybook.portaljs.org/), we aim to empower users to create engaging and informative data portals featuring diverse map formats and data components.
Why not give [PortalJS](https://portaljs.org/) a try today and discover the possibilities for your own data portals? To get started, check out our comprehensive documentation here: [PortalJS Documentation](https://portaljs.org/docs). Why not give [PortalJS](https://portaljs.com/) a try today and discover the possibilities for your own data portals? To get started, check out our comprehensive documentation here: [PortalJS Documentation](https://portaljs.com/opensource).
Have questions or comments about using [PortalJS](https://portaljs.org/) for your data portals? Feel free to share your thoughts on our [Discord channel](https://discord.com/invite/EeyfGrGu4U). We're here to help you make the most of your data. Have questions or comments about using [PortalJS](https://portaljs.com/) for your data portals? Feel free to share your thoughts on our [Discord channel](https://discord.com/invite/EeyfGrGu4U). We're here to help you make the most of your data.
Stay tuned for more exciting developments as we continue to enhance [PortalJS](https://portaljs.org/)! Stay tuned for more exciting developments as we continue to enhance [PortalJS](https://portaljs.com/)!

View File

@ -4,7 +4,7 @@ authors: ['Luccas Mateus']
date: 2021-04-20 date: 2021-04-20
--- ---
We have created a full data portal demo using PortalJS all backed by a CKAN instance storing data and metadata, you can see below a screenshot of the homepage and of an individual dataset page. We have created a full data portal demo using DataHub PortalJS all backed by a CKAN instance storing data and metadata, you can see below a screenshot of the homepage and of an individual dataset page.
![](https://i.imgur.com/ai0VLS4.png) ![](https://i.imgur.com/ai0VLS4.png)
![](https://i.imgur.com/3RhXOW4.png) ![](https://i.imgur.com/3RhXOW4.png)
@ -14,7 +14,7 @@ We have created a full data portal demo using PortalJS all backed by a CKAN inst
To create a Portal app, run the following command in your terminal: To create a Portal app, run the following command in your terminal:
```console ```console
npx create-next-app -e https://github.com/datopian/portaljs/tree/main/examples/ckan npx create-next-app -e https://github.com/datopian/datahub/tree/main/examples/ckan
``` ```
> NB: Under the hood, this uses the tool called create-next-app, which bootstraps an app for you based on our CKAN example. > NB: Under the hood, this uses the tool called create-next-app, which bootstraps an app for you based on our CKAN example.

View File

@ -30,12 +30,12 @@ https://github.com/datopian/markdowndb
## 📚 The Guide ## 📚 The Guide
https://portaljs.org/guide https://portaljs.com/opensource
Ive sketched overviews for two upcoming tutorials: Ive sketched overviews for two upcoming tutorials:
1. **Collaborating with others on your website**: Learn how to make your website projects a team effort. [See it here](https://portaljs.org/guide#tutorial-3-collaborating-with-others-on-your-website-project) 1. **Collaborating with others on your website**: Learn how to make your website projects a team effort. [See it here](https://portaljs.com/guide#tutorial-3-collaborating-with-others-on-your-website-project)
2. **Customising your website and previewing your changes locally**: Customize and preview your site changes locally, without headaches. [See it here](https://portaljs.org/guide#tutorial-4-customising-your-website-locally-and-previewing-your-changes-locally) 2. **Customising your website and previewing your changes locally**: Customize and preview your site changes locally, without headaches. [See it here](https://portaljs.com/guide#tutorial-4-customising-your-website-locally-and-previewing-your-changes-locally)
## 🌐 LifeItself.org ## 🌐 LifeItself.org

View File

@ -11,7 +11,7 @@ In our last article, we explored [the Open Spending revamp](https://www.datopian
## The Core: PortalJS ## The Core: PortalJS
At the core of the revamped OpenSpending website is [PortalJS](https://portaljs.org), a JavaScript library that's a game-changer in building powerful data portals with data visualizations. What makes it so special? Well, it's packed with reusable React components that make our lives - and yours - a whole lot easier. Take, for example, our sleek CSV previews; they're brought to life by PortalJS' [FlatUI Component](https://storybook.portaljs.org/?path=/story/components-flatuitable--from-url). It helps transform raw numbers into visuals that you can easily understand and use. Curious to know more? Check out the [official PortalJS website](https://portaljs.org). At the core of the revamped OpenSpending website is [PortalJS](https://portaljs.com), a JavaScript library that's a game-changer in building powerful data portals with data visualizations. What makes it so special? Well, it's packed with reusable React components that make our lives - and yours - a whole lot easier. Take, for example, our sleek CSV previews; they're brought to life by PortalJS' [FlatUI Component](https://storybook.portaljs.org/?path=/story/components-flatuitable--from-url). It helps transform raw numbers into visuals that you can easily understand and use. Curious to know more? Check out the [official PortalJS website](https://portaljs.com).
![Data visualization](/assets/blog/2023-10-13-the-open-spending-revamp-behind-the-scenes/data-visualization.png) ![Data visualization](/assets/blog/2023-10-13-the-open-spending-revamp-behind-the-scenes/data-visualization.png)

View File

@ -11,19 +11,18 @@ const config = {
authorUrl: 'https://datopian.com/', authorUrl: 'https://datopian.com/',
navbarTitle: { navbarTitle: {
// logo: "/images/logo.svg", // logo: "/images/logo.svg",
text: '🌀 PortalJS', text: '🌀 DataHub PortalJS',
// version: "Alpha", // version: "Alpha",
}, },
navLinks: [ navLinks: [
{ name: 'Docs', href: '/docs' }, { name: 'Docs', href: '/docs' },
// { name: "Components", href: "/docs/components" }, // { name: "Components", href: "/docs/components" },
{ name: 'Blog', href: '/blog' }, { name: 'Blog', href: '/blog' },
{ name: 'Showcases', href: '/#showcases' },
{ name: 'Howtos', href: '/howtos' }, { name: 'Howtos', href: '/howtos' },
{ name: 'Guide', href: '/guide' }, { name: 'Guide', href: '/guide' },
{ {
name: 'Examples', name: 'Showcases',
href: '/examples/' href: '/showcases/'
}, },
{ {
name: 'Components', name: 'Components',
@ -45,6 +44,7 @@ const config = {
{ rel: 'icon', href: '/favicon.ico' }, { rel: 'icon', href: '/favicon.ico' },
{ rel: 'apple-touch-icon', href: '/icon.png', sizes: '120x120' }, { rel: 'apple-touch-icon', href: '/icon.png', sizes: '120x120' },
], ],
canonical: 'https://portaljs.com/',
openGraph: { openGraph: {
type: 'website', type: 'website',
title: title:
@ -68,8 +68,8 @@ const config = {
cardType: 'summary_large_image', cardType: 'summary_large_image',
}, },
}, },
github: 'https://github.com/datopian/portaljs', github: 'https://github.com/datopian/datahub',
discord: 'https://discord.gg/xfFDMPU9dC', discord: 'https://discord.gg/KrRzMKU',
tableOfContents: true, tableOfContents: true,
analytics: 'G-96GWZHMH57', analytics: 'G-96GWZHMH57',
// editLinkShow: true, // editLinkShow: true,

View File

@ -1,249 +0,0 @@
# Authentication
## Introduction
The core function of authentication is to **Identify** Users of the Portal (in a federated way) so we can base access on their identity.
There are 3 major conceptual components: Identity, Accounts and Sessions which come together in the following stages:
* **Root Identity Determination:** Determine Identity often via Delegation
* **Sessions:** Persistence of the identity in the web application in a secure way (without new identity determination on each request! I don't want to have to login via third party service every time)
* **Account (aka profile):** Storing Related Account/Profile Information in our application (not in third party identity) eg. email, name (other preferences)
* This will get auto-created usually at first Identification
* In limited case this can be seen as a cache of info from Identity system (e.g. your email)
* However often richer info that is app specific that is generated (relevant for personalization)
### Root Identity Determination options :key:
The identity determination can be done in multiple ways. In this article we're considering following 3 options that we believe are widely used:
- Password authentication - traditional username and password pair
- Single Sign-on (SSO) via protocols such as OAuth, SAML, OpenID Connect
- One-time password (OTP) via email or SMS (aka passwordless connection)
#### Password authentication
Traditional way of authentication of users. When signing up user provides at least username and password pair which is then stored in a database for future authentication processes. Normally, additional information such as email address, full name etc. is also requested when registering.
Examples of password authentication in popular services:
- GitHub - https://github.com/join
- GitLab - https://gitlab.com/users/sign_up
- NPM - https://www.npmjs.com/signup
#### Single Sign-on (SSO)
The way of delegating identity determination process to some third-party service. Normally, popular social network services are used, e.g., Google, Facebook, Twitter etc. SSO implementations can be done using OAuth or SAML protocols. In addition, there is OpenID Connect protocol which is an extension of OAuth2.0.
- OAuth
- JWT based
- JSON based
- 'webby'
- SAML
- XML based
- SOAP based
- 'enterprisey'
List of OAuth providers:
https://en.wikipedia.org/wiki/List_of_OAuth_providers
Examples of SSO in popular projects:
- https://datahub.io/login
- https://vercel.com/signup
#### One-time password (OTP)
Also known as dynamic password, OTP also solves limitations of traditional password authentication method. Usually, the one time passwords are received via email or SMS.
### Account (aka profile)
- Storage of user profile information (email, fullname, gravatar etc.)
- Retrieving user profile information via API
- Updating profile
- Deleting profile
### Sessions
- Log out: DePersisting the Session
- Invalidating all Sessions: e.g. if a security issue
- Sessions outside of browsers
## Key Job Stories
When a user signs in, I want to know her/his identity so that I can limit access and editing based on who she/he is.
When a user visits the data portal for the first time, I want to provide him/her a way to register easily/quickly so that more people uses the data portal.
When I visit the data portal for the first time, I want to sign up using my existing social network account so that I don't need to remember yet another credentials.
When I'm using the CLI app (or anything else outside browser), I want to be able to login so that I can work from the terminal (e.g., have write access: editing datasets etc.).
[More job stories](#more-job-stories).
## CKAN 2 (CKAN Classic)
### Basic CKAN authentication
In classic system, we have basic CKAN authentication. Below is how registration page looks like:
![CKAN Classic register page](/static/img/docs/dms/ckan-register.png)
Registration flow in CKAN Classic:
```mermaid
sequenceDiagram
user->>ckan: fill in the form and submit
ckan->>ckan: check access (if user can create user)
ckan->>ckan: parse params
ckan->>ckan: check recaptcha
ckan->>ckan: call 'user_create' action
ckan->>ckan.model: add a new user into db
ckan->>ckan: create an activity
ckan->>ckan: log the user
ckan->>user: redirect to dashboard
```
We can extend basic CKAN authentication with:
- LDAP
- https://extensions.ckan.org/extension/ldap/
- https://github.com/NaturalHistoryMuseum/ckanext-ldap
- OAuth - see below
- SAML - https://extensions.ckan.org/extension/saml2/
### CKAN Classic as OAuth client
CKAN Classic can also be used as OAuth client:
- https://github.com/conwetlab/ckanext-oauth2 - this is the only one that's maintained.
- https://github.com/etalab/ckanext-oauth2 - outdated, the one above is based on this.
- https://github.com/okfn/ckanext-oauth - last commit 9 years ago.
- https://github.com/ckan/ckanext-oauth2waad - Windows Azure Active Directory specific and outdated.
How it works:
```mermaid
sequenceDiagram
user->>ckan: request for login via OAuth provider
ckan->>ckan.oauth: raise 401 and call `challenge` function
ckan.oauth->>user: redirect the user to the 3rd party log in page
user->>3rdparty: perform login
3rdparty->>ckan.oauth: redirect to /oauth2/callback with token
ckan.oauth->>3rdparty: call `authenticate` with token
3rdparty->>ckan.oauth: return user info
ckan.oauth->>ckan: if doesn't exist save that info in db or update it
ckan.oauth->>ckan.oauth: add cookies
ckan.oauth->>user: redirect to dashboard
```
## CKAN 3 (Next Gen)
We have considered some of popular and/or modern solutions for identity management that we can implement in CKAN 3:
https://docs.google.com/spreadsheets/d/1qXZyzAbA2NtpnoSZRJ2K_EbaWJnvxkrKVzQ_2rD5eQw/edit#gid=0
Shortlist based on scores from the spreadsheet above:
- Auth0
- AuthN
- Ory/Kratos
Recommendation:
All projects from the shortlist can be considered for a project. It worth to give a try for each of them and find out what works best for your project's needs. Testing out Auth0 should be straightforward and take less than an hour. AuthN and Ory/Kratos would require to build docker images and to run it locally but overall it should not be time consuming.
### Existing work
In datahub.io we have implemented SSO via Google/Github. Below is sequence diagram showing the auth flow with datopian/auth + frontend express app (similar to CKAN 3 frontend):
```mermaid
sequenceDiagram
frontend.login->>auth.authenticate: authenticate(jwt=None,next=/success/...)
auth.authenticate->>frontend.login: failed + here are urls for logging on 3rd party including success
frontend.login->>user: login form with login urls to 3rd party including next url in state
user->>3rdparty: login
3rdparty->>auth.oauth_response: success
auth.oauth_response->>frontend.success: redirect to next url
frontend.success->>auth.authenticate: with valid jwt
auth.authenticate->>frontend.success: valid + here is profile
frontend.success->>frontend.success: decode jwt, check it, then see localstorage
frontend.success->>frontend.dashboard: redirect to dashboard
```
## CKAN 2 to CKAN 3 (aka Next Gen)
How does this conceptual framework map to an evolution of CKAN 2 to CKAN 3?
```mermaid
graph TD
subgraph "CKAN Classic"
Signup["Classic signup, e.g., self-service or by sysadmin"]
Login["Classic login if you're using the classic UI"]
OAuth["OAuth2(ORY/Hydra)"]
end
subgraph "Authentication service (ORY/Kratos)"
SSO["Social Sign-On: Github, Google, Facebook"]
CC["CKAN Classic"]
Admins["Sysadmin users"]
Curators["Data curators"]
Users["Regular users"]
end
subgraph "Frontend v3"
SignupFront["Signup via Kratos"]
LoginFront["Login via Kratos"]
end
SignupFront --"Regular user"--> SSO
LoginFront --"Regular user"--> SSO
LoginFront --"Data curator"--> CC
CC --> Admins
CC --> Curators
SSO --> Users
CC --"Redirect"--> OAuth
OAuth --> Login
```
Sequence diagram of login process:
[![](https://mermaid.ink/img/eyJjb2RlIjoic2VxdWVuY2VEaWFncmFtXG5cdEJyb3dzZXItPj5Gcm9udGVuZDogUmVxdWVzdCB0byBgL2F1dGgvbG9naW5gXG4gIEZyb250ZW5kLT4-S3JhdG9zOiBBdXRoIHJlcXVlc3RcbiAgS3JhdG9zLT4-QnJvd3NlcjogUmVkaXJlY3QgdG8gYC9hdXRoL2xvZ2luP3JlcXVlc3Q9e2lkfWAgcGFyYW1cbiAgQnJvd3Nlci0-PkZyb250ZW5kOiBHZXQgYC9hdXRoL2xvZ2luP3JlcXVlc3Q9e2lkfWBcbiAgRnJvbnRlbmQtPj5LcmF0b3M6IEZldGNoIGRhdGEgZm9yIHJlbmRlcmluZyB0aGUgZm9ybVxuICBLcmF0b3MtPj5Gcm9udGVuZDogTG9naW4gb3B0aW9uc1xuICBGcm9udGVuZC0-PkJyb3dzZXI6IFJlbmRlciB0aGUgbG9naW4gZm9ybSB3aXRoIGF2YWlsYWJsZSBvcHRpb25zXG4gIEJyb3dzZXItPj5Gcm9udGVuZDogU3VwcGx5IGZvcm0gZGF0YVxuICBGcm9udGVuZC0-PktyYXRvczogVmFsaWRhdGUgYW5kIGxvZ2luXG4gIEtyYXRvcy0-PkZyb250ZW5kOiBTZXQgc2Vzc2lvblxuICBGcm9udGVuZC0-PkJyb3dzZXI6IFJlZGlyZWN0IHRvIC9kYXNoYm9hcmRcblxuXG5cdFx0XHRcdFx0IiwibWVybWFpZCI6eyJ0aGVtZSI6ImRlZmF1bHQifSwidXBkYXRlRWRpdG9yIjpmYWxzZX0)](https://mermaid-js.github.io/mermaid-live-editor/#/edit/eyJjb2RlIjoic2VxdWVuY2VEaWFncmFtXG5cdEJyb3dzZXItPj5Gcm9udGVuZDogUmVxdWVzdCB0byBgL2F1dGgvbG9naW5gXG4gIEZyb250ZW5kLT4-S3JhdG9zOiBBdXRoIHJlcXVlc3RcbiAgS3JhdG9zLT4-QnJvd3NlcjogUmVkaXJlY3QgdG8gYC9hdXRoL2xvZ2luP3JlcXVlc3Q9e2lkfWAgcGFyYW1cbiAgQnJvd3Nlci0-PkZyb250ZW5kOiBHZXQgYC9hdXRoL2xvZ2luP3JlcXVlc3Q9e2lkfWBcbiAgRnJvbnRlbmQtPj5LcmF0b3M6IEZldGNoIGRhdGEgZm9yIHJlbmRlcmluZyB0aGUgZm9ybVxuICBLcmF0b3MtPj5Gcm9udGVuZDogTG9naW4gb3B0aW9uc1xuICBGcm9udGVuZC0-PkJyb3dzZXI6IFJlbmRlciB0aGUgbG9naW4gZm9ybSB3aXRoIGF2YWlsYWJsZSBvcHRpb25zXG4gIEJyb3dzZXItPj5Gcm9udGVuZDogU3VwcGx5IGZvcm0gZGF0YVxuICBGcm9udGVuZC0-PktyYXRvczogVmFsaWRhdGUgYW5kIGxvZ2luXG4gIEtyYXRvcy0-PkZyb250ZW5kOiBTZXQgc2Vzc2lvblxuICBGcm9udGVuZC0-PkJyb3dzZXI6IFJlZGlyZWN0IHRvIC9kYXNoYm9hcmRcblxuXG5cdFx0XHRcdFx0IiwibWVybWFpZCI6eyJ0aGVtZSI6ImRlZmF1bHQifSwidXBkYXRlRWRpdG9yIjpmYWxzZX0)
From ORY/Kratos:
[![](https://mermaid.ink/img/eyJjb2RlIjoic2VxdWVuY2VEaWFncmFtXG4gIHBhcnRpY2lwYW50IEIgYXMgQnJvd3NlclxuICBwYXJ0aWNpcGFudCBLIGFzIE9SWSBLcmF0b3NcbiAgcGFydGljaXBhbnQgQSBhcyBZb3VyIEFwcGxpY2F0aW9uXG5cblxuICBCLT4-SzogSW5pdGlhdGUgTG9naW5cbiAgSy0-PkI6IFJlZGlyZWN0cyB0byB5b3VyIEFwcGxpY2F0aW9uJ3MgL2xvZ2luIGVuZHBvaW50XG4gIEItPj5BOiBDYWxscyAvbG9naW5cbiAgQS0tPj5LOiBGZXRjaGVzIGRhdGEgdG8gcmVuZGVyIGZvcm1zIGV0Y1xuICBCLS0-PkE6IEZpbGxzIG91dCBmb3JtcywgY2xpY2tzIGUuZy4gXCJTdWJtaXQgTG9naW5cIlxuICBCLT4-SzogUE9TVHMgZGF0YSB0b1xuICBLLS0-Pks6IFByb2Nlc3NlcyBMb2dpbiBJbmZvXG5cbiAgYWx0IExvZ2luIGRhdGEgdmFsaWRcbiAgICBLLS0-PkI6IFNldHMgc2Vzc2lvbiBjb29raWVcbiAgICBLLT4-QjogUmVkaXJlY3RzIHRvIGUuZy4gRGFzaGJvYXJkXG4gIGVsc2UgTG9naW4gZGF0YSBpbnZhbGlkXG4gICAgSy0tPj5COiBSZWRpcmVjdHMgdG8geW91ciBBcHBsaWNhaXRvbidzIC9sb2dpbiBlbmRwb2ludFxuICAgIEItPj5BOiBDYWxscyAvbG9naW5cbiAgICBBLS0-Pks6IEZldGNoZXMgZGF0YSB0byByZW5kZXIgZm9ybSBmaWVsZHMgYW5kIGVycm9yc1xuICAgIEItLT4-QTogRmlsbHMgb3V0IGZvcm1zIGFnYWluLCBjb3JyZWN0cyBlcnJvcnNcbiAgICBCLT4-SzogUE9TVHMgZGF0YSBhZ2FpbiAtIGFuZCBzbyBvbi4uLlxuICBlbmRcbiIsIm1lcm1haWQiOnsidGhlbWUiOiJuZXV0cmFsIiwic2VxdWVuY2VEaWFncmFtIjp7ImRpYWdyYW1NYXJnaW5YIjoxNSwiZGlhZ3JhbU1hcmdpblkiOjE1LCJib3hUZXh0TWFyZ2luIjowLCJub3RlTWFyZ2luIjoxNSwibWVzc2FnZU1hcmdpbiI6NDUsIm1pcnJvckFjdG9ycyI6dHJ1ZX19fQ)](https://mermaid-js.github.io/mermaid-live-editor/#/edit/eyJjb2RlIjoic2VxdWVuY2VEaWFncmFtXG4gIHBhcnRpY2lwYW50IEIgYXMgQnJvd3NlclxuICBwYXJ0aWNpcGFudCBLIGFzIE9SWSBLcmF0b3NcbiAgcGFydGljaXBhbnQgQSBhcyBZb3VyIEFwcGxpY2F0aW9uXG5cblxuICBCLT4-SzogSW5pdGlhdGUgTG9naW5cbiAgSy0-PkI6IFJlZGlyZWN0cyB0byB5b3VyIEFwcGxpY2F0aW9uJ3MgL2xvZ2luIGVuZHBvaW50XG4gIEItPj5BOiBDYWxscyAvbG9naW5cbiAgQS0tPj5LOiBGZXRjaGVzIGRhdGEgdG8gcmVuZGVyIGZvcm1zIGV0Y1xuICBCLS0-PkE6IEZpbGxzIG91dCBmb3JtcywgY2xpY2tzIGUuZy4gXCJTdWJtaXQgTG9naW5cIlxuICBCLT4-SzogUE9TVHMgZGF0YSB0b1xuICBLLS0-Pks6IFByb2Nlc3NlcyBMb2dpbiBJbmZvXG5cbiAgYWx0IExvZ2luIGRhdGEgdmFsaWRcbiAgICBLLS0-PkI6IFNldHMgc2Vzc2lvbiBjb29raWVcbiAgICBLLT4-QjogUmVkaXJlY3RzIHRvIGUuZy4gRGFzaGJvYXJkXG4gIGVsc2UgTG9naW4gZGF0YSBpbnZhbGlkXG4gICAgSy0tPj5COiBSZWRpcmVjdHMgdG8geW91ciBBcHBsaWNhaXRvbidzIC9sb2dpbiBlbmRwb2ludFxuICAgIEItPj5BOiBDYWxscyAvbG9naW5cbiAgICBBLS0-Pks6IEZldGNoZXMgZGF0YSB0byByZW5kZXIgZm9ybSBmaWVsZHMgYW5kIGVycm9yc1xuICAgIEItLT4-QTogRmlsbHMgb3V0IGZvcm1zIGFnYWluLCBjb3JyZWN0cyBlcnJvcnNcbiAgICBCLT4-SzogUE9TVHMgZGF0YSBhZ2FpbiAtIGFuZCBzbyBvbi4uLlxuICBlbmRcbiIsIm1lcm1haWQiOnsidGhlbWUiOiJuZXV0cmFsIiwic2VxdWVuY2VEaWFncmFtIjp7ImRpYWdyYW1NYXJnaW5YIjoxNSwiZGlhZ3JhbU1hcmdpblkiOjE1LCJib3hUZXh0TWFyZ2luIjowLCJub3RlTWFyZ2luIjoxNSwibWVzc2FnZU1hcmdpbiI6NDUsIm1pcnJvckFjdG9ycyI6dHJ1ZX19fQ)
Kratos to Hydra in CKAN Classic:
WIP
Questions
* Does CKAN Classic allow us to store arbitrary account information (are there "extras")
* How would we avoid having to support identity persistence, delegation etc in both NG frontend and Classic Admin UI?
* Can we share cookies (e.g. via using subdomains)
* How is login, identity determination etc done at least for frontend in DataHub.io
* Should account UI really be in NG frontend vs Classic Admin UI?
* how can we handle "invite a user" to my org set up ... (it's basically post processing after sign up ...)
## Appendix
### More job stories
When a user visits the data portal, I want to provide multiple options for him/her to sign up so that I have more users registered and using the data portal.
When a user needs to change his/her profile info, I want to make sure it is possible, so that I have the up-to-date information about users.
When my personal info (email etc.) is changed, I want to edit it in my profile so that I provide up-to-date information about me and I receive messages (eg, notifications) properly.
When I decide to stop using the data portal, I want to be able to delete my account, so that my personal details aren't stored in the service that I don't need anymore.

View File

@ -1,215 +0,0 @@
# Blob Storage
## Introduction
DMS and data portals often need to *store* data as well as metadata. As such, they require a system for doing this. This page focuses on Blob Storage aka Bulk or Raw storage (see [storage](/docs/dms/storage) page for an overview of all types of storage).
Blob storage is for storing "blobs" of data, that is a raw stream of bytes like files on a filesystem. For blob storage think local filesystem or cloud storage like S3, GCS, etc.
Blob Storage in a DMS can be provided via:
* Local file system: storing on disk or storage directly connected to the instance
* Cloud storage like S3, Google Cloud Storage, Azure storage etc
Today, cloud storage would be the default in most cases.
### Features
* Storage: Persistent, cost-efficient storage
* Download: Fast, reliable download (possibly even with support for edge distribution)
* Upload: reliable and rapid upload
* Direct upload to (cloud) storage by clients i.e. without going via the DMS. Why? Because cloud storage has many features that it would be costly replicate (e.g. multipart, resumable etc), excellent performance and reliability for upload. It also cuts out the middleman of the DMS backend thereby saving bandwidth, reducing load on the DMS backend and improving performance
* Upload UI: having an excellent UI for doing upload. NB: this UI is considered part of the [publish feature](/docs/dms/publish)
* Cloud: integrate with cloud storage
* Permissions: restricting access to data stored in blob storage based on the permissions of the DMS. For example, if Joe does not have access to a dataset on the DMS he should not be able to access associated blob data in the storage system
## Flows
### Direct to Cloud Upload
Want: Direct upload to cloud storage ... But you need to authorize that ... So give them a token from your app
A sequence diagram illustrating the process for a direct to cloud upload:
```mermaid
sequenceDiagram
participant Browser as Client (Browser / Code)
participant Authz as Authz Server
participant BitStore as Storage Access Token Service
participant Storage as Cloud Storage
Browser->>Authz: Give me a BitStore access token
Authz->>Browser: Token
Browser->>BitStore: Get a signed upload URL (access token, file metdata)
BitStore->>Browser: Signed URL
Browser->>Storage: Upload file (signed URL)
Storage->>Browser: OK (storage metadata)
```
Here's a more elaborate version showing storage of metadata into the MetaStore afterwards (and skipping the Authz service):
```mermaid
sequenceDiagram
participant browser as Client (Browser / Code)
participant vfts as MetaStore
participant bitstore as Storage Access Token Service
participant storage as Cloud Storage
browser->>browser: Select files to upload
browser->>browser: calculate file hashes (if doing content addressable)
browser->bitstore: get signed URLs(file1.csv URL, file2.csv URL, auth info)
bitstore->>browser: signed URLs
browser->>storage: upload file1.csv
storage->>browser: OK
browser->>storage: upload file2.csv
storage->>browser: OK
browser->>browser: Compose datapackage.json
browser->>vfts: create dataset(datapackage.json, file1.csv pointer, file2.csv pointer, jwt token, ...)
vfts->>browser: OK
```
## CKAN 2 (Classic)
Blob Storage is known as the FileStore in CKAN v2 and below. The default is local disk storage.
There is support for cloud storage via a variety of extensions the most prominent of which is `ckanext-cloudstorage`: https://github.com/TkTech/ckanext-cloudstorage
There are a variety of issues:
* Cloud storage is not a first class citizen in CKAN: CKAN defaults to local file storage but cloud storage is the default in the world and has much better scalability, performance as well as integratability with cloud deployment
* The FileStore interface definition has a poor separation of concerns (for example, blob storage file paths is set in the FileStore component not in core CKAN) which makes it hard / hacky to extend and use for key use cases e.g. versioning.
* `ckanext-cloudstorage` (the default cloud storage extension) is ok but has many issues e.g.
* No direct to cloud upload: it uses CKAN backend as a middleman so all data must go via ckan backend
* Implements its own (sometimes unreliable) version of multipart upload (which means additional code which isn't as reliable as cloud storage providers interface)
* No access to advanced features such as resumability etc
Generally, we at Datopian have seen a lot of issues around multipart / large file upload stability with clients and are still seeing issues when a lot of large files are uploaded via scripts. Fixing and refactoring code related to storage is very costly, and tends to result in client specific "hacks".
## CKAN v3
An approach to blob storage that leverages cloud blob storage directly (i.e. without having to upload and serve all files via the CKAN web server), unlocking the performance characteristics of the storage backend directly. It is designed with a microservice approach and supports direct to cloud uploads and downloads. The key components are listed in the next section. You can read more about the overall design approach in the [design section below](#Design).
It is backwards compatible with CKAN v2 and has been successfully deployed with CKAN v2.8 and v2.9.
**Status: Production.**
### Components
* [ckanext-blob-storage](https://github.com/datopian/ckanext-blob-storage) (formerly known as ckanext-external-storage)
* Hooking CKAN to Giftless replacing resource storage
* Depends on giftless-client and ckanext-authz-service
* Doesn't implement IUploader - completely overrides upload / download routes for resources
* [Giftless](https://github.com/datopian/giftless) - Git LFS compatible implementation for storage with some extras on top. This hands out access tokens to store data in cloud storage.
* Docs at https://giftless.datopian.com
* Backends for Azure, Google Cloud Storage and local
* Multipart support (on top of standard LFS protocol)
* Accepts JWT tokens for authentication and authorization
* [ckanext-authz-service](https://github.com/datopian/ckanext-authz-service/) - This extension uses CKANs built-in authentication and authorization capabilities to: a) Generate JWT tokens and provide them via CKANs Web API to clients and b) Validate JWT tokens.
* Allows hooking CKAN's authentication and authorization capabilities to generate signed JWT tokens, to integrate with external systems
* Not specific for Giftless, but this is what it was built for
* [ckanext-asset-storage](https://github.com/datopian/ckanext-asset-storage) - this takes care of storing non-data assets e.g. organization images etc.
* CKAN IUploader for assets (not resources!)
* Pluggable backends - currently local and Azure
* Much cleaner than older implementations (ckanext-cloudstorage etc.)
Clients:
* [giftless-client-py](https://github.com/datopian/giftless-client) - Python client for Git LFS and Giftless-specific features
* Used by ckanext-blob-storage and other tools
* [giftless-client-js](https://github.com/datopian/giftless-client-js) - Javascript client for Git LFS and Giftless-specific features
* Used by ckanext-blob-storage and other tools for creating uploaders in the UI
## Design
### Purpose
The goal of this project is to create a more **_flexible_** system for storing **_data files_** (AKA “resources”) for **_CKAN_ and _other implementations_** of a data portal so that CKAN can support versioning, large file upload (and great file upload UX), plug easily into cloud and local file storage backends and, in general, is easy to customize both for storage layer and for CKAN client code of that layer
### Features
* Do one thing and do it well: provide an API to store and retrieve files from storage, in a way that is pluggable into a micro-services based application and to existing CKAN (2.8 / 2.9)
* Does not force, and in fact is not aware of, a specific file naming logic (i.e. resource file names could be based on a user given name, a content hash, a revision ID or any mixture of these - it is up to the using system to decide)
* Does not force a specific storage backend; Should support Amazon S3, Azure Storage and local file storage in some way initially but in general backend should be pluggable
* Does not force a specific authentication scheme; Expects a signed JWT token, does not care who signed it and how the user got authenticated
* Does not force complex authorization scheme; Leave it to external system to do complex authorization if needed;
* By default, the system can work in an “admin party” mode where all authenticated users have full access to all files. This will be “good enough” for many DMS implementations including CKAN.
* Potentially, allow plugging in a more complex authorization logic that relies on JWT claims to perform granular authorization checks
### For Data Files (i.e. Blobs)
This system is about storing and providing access to blobs, or streams of bytes; It is not about providing access to the data stored within (i.e. it is not meant to replace CKANs datastore).
### For CKAN whilst not necessarily CKAN Specific
While the systems design should not be CKAN specific in any way, our current client needs require us to provide a CKAN extension that integrates with this system.
CKANs current IUploader interface has been identified to be too narrow to provide the functionality required by complex projects (resource versioning, direct cloud uploads and downloads, large file support and multipart support). While some of these needs could be and have been “hacked” through the IUploader interface, the implementations have been over complex and hard to debug.
Our goal should be to provide a CKAN extension that provides the following functionality directly:
* Uploading and downloading resource files directly from the client if supported by the storage backend
* Multipart upload support if supported by storage backend
* Handling of signed URLs for uploads and private downloads
* Client side code for handling multipart uploads
* TBD: If storage backend does not support direct uploads / downloads, fall back to …
In addition, this extension should provide an API for other extensions to do things like:
* Set the file naming scheme (We need this for ckanext-versions)
* Lower level file access, e.g. move and delete files. We may need this in the future to optimize storage and deduplicate files as proposed for ckanext-versions
In addition, this extension must “play nice” with common CKAN features such as the datastore extension and related datapusher / xloader extensions.
### Usable For other DMS implementations
There should be nothing in this system, except for the CKAN extension described above, that is specific to CKAN. That will allow to re-use and re-integrate this system as a micro-service in other DMS implementations such as ckan-ng and others.
In fact, the core part of this system should be a generic, abstract storage service with a light authorization layer. This could make it useful in a host of situations where storage micro-service is needed.
### High Level Principles
Common Principles
* Uploads and downloads directly from cloud provides to browser
* Signed uploads / downloads - for private / authorized only data access
* Support for AWS, Azure and potentially GCP storage
* Support for local (non cloud) storage, potentially through a system like [https://min.io/](https://min.io/)
* Multipart / large file upload support (a few GB in size should be supported for Gates)
* Not opinionated about file naming / paths; Allow users to set file locations under some pre-defined patchs / buckets
* Client side support - browser widgets / code for uploading and downloading files / multipart uploads directly to different backends
* Well-documented flow for using from API (not browser)
* Provided API for deleting and moving files
* Provided API for accessing storage-level metadata (e.g. file MD5) (do we need this could be useful for processes that do things like deduplicate storage)
* Provided API for managing storage-level object level settings (e.g. “Content-disposition” / “Content-type” headers, etc.)
* Authorization based on some kind of portable scheme (JWT)
CKAN integration specific (implemented as a CKAN extension)
* JWT generation based on current CKAN user permissions
* Client widgets integration (or CKAN specific widgets) in right places in CKAN templates
* Hook into resource upload / download / deletion controllers in CKAN
* API to allow other extensions to control storage level object metadata (headers, path)
* API to allow other extensions to hook into lifecycle events - upload completion, download request, deletion etc.
### Components
The Decoupled Storage solution should be split into several parts, with some parts being independent of others:
* [External] Cloud Storage service (or API similar if local file system) e.g. S3, GCS, Azure Storage, Min.io (for local file system)
* Cloud Storage Access Service
* [External] Permissions Service for granting general permission tokens that give access to Cloud Storage Access Service
* JWT tokens can be generated by any party that has the right signing key. Thus, we can initially do without this if JWT signing is implemented as part of the CKAN extension
* Browser based Client for Cloud Storage (compatible with #1 and with different cloud vendors)
* CKAN extension that wraps the two parts above to provide a storage solution for CKAN
### Questions
* What is file structure in cloud ... i.e. What is the file path for uploaded files? Options:
* Client chooses a name/path
* Content addressable i.e. the name is given by the content? How? Use a hash.]
* Beauty of that: standard way to name things. The same thing has the same name (modulo collisions)
* Goes with versioning => same file = same name, diff file = diff name
* And do you enforce that from your app
* Request for token needs to include the destination file path

View File

@ -1,503 +0,0 @@
# CKAN Client Guide
Guide to interacting with [CKAN](/docs/dms/ckan) for power users such as data scientists, data engineers and data wranglers.
This guide is about adding and managing data in CKAN programmatically and it assumes:
* You are familiar with key concepts like metadata, data, etc.
* You are working programmatically with a programming language such as Python, JavaScript or R (_coming soon_).
## Frictionless Formats
Clients use [Frictionless formats](https://specs.frictionlessdata.io/) by default for describing dataset and resource objects passed to client methods. Internally, we then use the a *CKAN {'<=>'} Frictionless Mapper* (both [in JavaScript]( https://github.com/datopian/frictionless-ckan-mapper-js ) and [in Python](https://github.com/frictionlessdata/frictionless-ckan-mapper)) to convert objects to CKAN formats before calling the API. **Thus, you can use _Frictionless Formats_ by default with the client**.
>[!tip]As CKAN moves to Frictionless to default this will gradually become unnecessary.
## Quick start
Most of this guide has Python programming language in mind, including its [convention regading using _snake case_ for instances and methods names](https://www.python.org/dev/peps/pep-0008/#descriptive-naming-styles).
If needed, you can adapt the instructions to JavaScript and R (coming soon) by using _camel case_ instead — for example, if in the Python code we have `client.push_blob(…)`, in JavaScript it would be `client.pushBlob(…)`.
### Prerequisites
Install the client for your language of choice:
* Python: https://github.com/datopian/ckan-client-py#install
* JavaScript: https://github.com/datopian/ckan-client-js#install
* R: _coming soon_
### Create a client
#### Python
```python
from ckanclient import Client
api_key = '771a05ad-af90-4a70-beea-cbb050059e14'
api_url = 'http://localhost:5000'
organization = 'datopian'
dataset = 'dailyprices'
lfs_url = 'http://localhost:9419'
client = Client(api_url, organization, dataset, lfs_url)
```
#### JavaScript
```javascript
const { Client } = require('ckanClient')
apiKey = '771a05ad-af90-4a70-beea-cbb050059e14'
apiUrl = 'http://localhost:5000'
organization = 'datopian'
dataset = 'dailyprices'
const client = Client(apiKey, organization, dataset, apiUrl)
```
### Upload a resource
That is to say, upload a file, implicitly creating a new dataset.
#### Python
```python
from frictionless import describe
resource = describe('my-data.csv')
client.push_blob(resource)
```
### Create a new empty Dataset with metadata
#### Python
```python
client.create('my-data')
client.push(resource)
```
### Adding a resource to an existing Dataset
>[!note]Not implemented yet.
```python
client.create('my-data')
client.push_resource(resource)
```
### Edit a Dataset's metadata
>[!note]Not implemented yet.
```python
dataset = client.retrieve('sample-dataset')
client.update_metadata(
dataset,
metadata: {'maintainer_email': 'sample@datopian.com'}
)
```
For details of metadata see the [metadata reference below](#metadata-reference).
## API - Porcelain
### `Client.create`
Expects as a single argument: a _string_, or a _dict_ (in Python), or an _object_ (in JavaScript). This argument is either a valid dataset name or dictionary with metadata for the dataset in Frictionless format.
### `Client.push`
Expects a single argument: a _dict_ (in Python) or an _object_ (in JavaScript) with a dataset metadata in Frictionless format.
### `Client.retrieve`
Expects a single argument: a string with a dataset name or uniquer ID. Returns a Frictionless resource as a _dict_ (in Python) or as an _Promisse .&lt;object&gt;_ (in JavaScript).
### `Client.push_blob`
Expects a single argument: a _dict_ (in Python) or an _object_ (in JavaScript) with a Frictionless resource.
## API - Plumbing
### `Client.action`
This method bridges access to the CKAN API _action endpoint_.
#### In Python
Arguments:
| Name | Type | Default | Description |
| -------------------- | ---------- | ---------- | ------------------------------------------------------------ |
| `name` | `str` | (required) | The action name, for example, `site_read`, `package_show`… |
| `payload` | `dict` | (required) | The payload being sent to CKAN. When a payload is provided to a GET request, it will be converted to URL parameters and each key will be converted to snake case. |
| `http_get` | `bool` | `False` | Optional, if `True` will make `GET` request, otherwise `POST`. |
| `transform_payload` | `function` | `None` | Function to mutate the `payload` before making the request (useful to convert to and from CKAN and Frictionless formats). |
| `transform_response` | `function` | `None` | function to mutate the response data before returning it (useful to convert to and from CKAN and Frictionless formats). |
>[!note]The CKAN API uses the CKAN dataset and resource formats (rather than Frictionless formats).
In other words, to stick to Frictionless formats, you can pass `frictionless_ckan_mapper.frictionless_to_ckan` as `transform_payload`, and `frictionless_ckan_mapper.ckan_to_frictionless` as `transform_response`.
#### In JavaScript
Arguments:
| Name | Type | Default | Description |
| ------------ | ------------------- | ------------------ | ------------------------------------------------------------ |
| `actionName` | <code>string</code> | (required) | The action name, for example, `site_read`, `package_show`… |
| `payload` | <code>object</code> | (required) | The payload being sent to CKAN. When a payload is provided to a GET request, it will be converted to URL parameters and each key will be converted to snake case. |
| `useHttpGet` | <code>object</code> | <code>false</code> | Optional, if `True` will make `GET` request, otherwise `POST`. |
>[!note]The JavaScript implementation uses the CKAN dataset and resource formats (rather than Frictionless formats).
In other words, to stick to Frictionless formats, you need to convert from Frictionless to CKAN before calling `action` , and from CKAN to Frictionless after calling `action`.
## Metadata reference
>[!info]Your site may have custom metadata that differs from the example set below.
### Profile
**(`string`)** Defaults to _data-resource_.
The profile of this descriptor.
Every Package and Resource descriptor has a profile. The default profile, if none is declared, is `data-package` for Package and `data-resource` for Resource.
#### Examples
- `{"profile":"tabular-data-package"}`
- `{"profile":"http://example.com/my-profiles-json-schema.json"}`
### Name
**(`string`)**
An identifier string. Lower case characters with `.`, `_`, `-` and `/` are allowed.
This is ideally a url-usable and human-readable name. Name `SHOULD` be invariant, meaning it `SHOULD NOT` change when its parent descriptor is updated.
#### Example
- `{"name":"my-nice-name"}`
### Path
A reference to the data for this resource, as either a path as a string, or an array of paths as strings. of valid URIs.
The dereferenced value of each referenced data source in `path` `MUST` be commensurate with a native, dereferenced representation of the data the resource describes. For example, in a *Tabular* Data Resource, this means that the dereferenced value of `path` `MUST` be an array.
#### Validation
##### It must satisfy one of these conditions
###### Path
**(`string`)**
A fully qualified URL, or a POSIX file path..
Implementations need to negotiate the type of path provided, and dereference the data accordingly.
**Examples**
- `{"path":"file.csv"}`
- `{"path":"http://example.com/file.csv"}`
**(`array`)**
**Examples**
- `["file.csv"]`
- `["http://example.com/file.csv"]`
#### Examples
- `{"path":["file.csv","file2.csv"]}`
- `{"path":["http://example.com/file.csv","http://example.com/file2.csv"]}`
- `{"path":"http://example.com/file.csv"}`
### Data
Inline data for this resource.
### Schema
**(`object`)**
A schema for this resource.
### Title
**(`string`)**
A human-readable title.
#### Example
- `{"title":"My Package Title"}`
### Description
**(`string`)**
A text description. Markdown is encouraged.
#### Example
- `{"description":"# My Package description\nAll about my package."}`
### Home Page
**(`string`)**
The home on the web that is related to this data package.
#### Example
- `{"homepage":"http://example.com/"}`
### Sources
**(`array`)**
The raw sources for this resource.
#### Example
- `{"sources":[{"title":"World Bank and OECD","path":"http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"}]}`
### Licenses
**(`array`)**
The license(s) under which the resource is published.
This property is not legally binding and does not guarantee that the package is licensed under the terms defined herein.
#### Example
- `{"licenses":[{"name":"odc-pddl-1.0","path":"http://opendatacommons.org/licenses/pddl/","title":"Open Data Commons Public Domain Dedication and License v1.0"}]}`
### Format
**(`string`)**
The file format of this resource.
`csv`, `xls`, `json` are examples of common formats.
#### Example
- `{"format":"xls"}`
### Media Type
**(`string`)**
The media type of this resource. Can be any valid media type listed with [IANA](https://www.iana.org/assignments/media-types/media-types.xhtml).
#### Example
- `{"mediatype":"text/csv"}`
### Encoding
**(`string`)** Defaults to _utf-8_.
The file encoding of this resource.
#### Example
- `{"encoding":"utf-8"}`
### Bytes
**(`integer`)**
The size of this resource in bytes.
#### Example
- `{"bytes":2082}`
### Hash
**(`string`)**
The MD5 hash of this resource. Indicate other hashing algorithms with the {'{algorithm}'}:{'{hash}'} format.
#### Examples
- `{"hash":"d25c9c77f588f5dc32059d2da1136c02"}`
- `{"hash":"SHA256:5262f12512590031bbcc9a430452bfd75c2791ad6771320bb4b5728bfb78c4d0"}`
## Generating templates
You can use [`jsv`](https://github.com/datopian/jsv) to generate a template script in Python, JavaScript, and R.
To install it:
```
$ npm install -g git+https://github.com/datopian/jsv.git
```
### Python
```
$ jsv data-resource.json --output py
```
**Output**
```python
dataset_metadata = {
"profile": "data-resource", # The profile of this descriptor.
# [example] "profile": "tabular-data-package"
# [example] "profile": "http://example.com/my-profiles-json-schema.json"
"name": "my-nice-name", # An identifier string. Lower case characters with `.`, `_`, `-` and `/` are allowed.
"path": ["file.csv","file2.csv"], # A reference to the data for this resource, as either a path as a string, or an array of paths as strings. of valid URIs.
# [example] "path": ["http://example.com/file.csv","http://example.com/file2.csv"]
# [example] "path": "http://example.com/file.csv"
"data": None, # Inline data for this resource.
"schema": None, # A schema for this resource.
"title": "My Package Title", # A human-readable title.
"description": "# My Package description\nAll about my package.", # A text description. Markdown is encouraged.
"homepage": "http://example.com/", # The home on the web that is related to this data package.
"sources": [{"title":"World Bank and OECD","path":"http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"}], # The raw sources for this resource.
"licenses": [{"name":"odc-pddl-1.0","path":"http://opendatacommons.org/licenses/pddl/","title":"Open Data Commons Public Domain Dedication and License v1.0"}], # The license(s) under which the resource is published.
"format": "xls", # The file format of this resource.
"mediatype": "text/csv", # The media type of this resource. Can be any valid media type listed with [IANA](https://www.iana.org/assignments/media-types/media-types.xhtml).
"encoding": "utf-8", # The file encoding of this resource.
# [example] "encoding": "utf-8"
"bytes": 2082, # The size of this resource in bytes.
"hash": "d25c9c77f588f5dc32059d2da1136c02", # The MD5 hash of this resource. Indicate other hashing algorithms with the {algorithm}:{hash} format.
# [example] "hash": "SHA256:5262f12512590031bbcc9a430452bfd75c2791ad6771320bb4b5728bfb78c4d0"
}
```
### JavaScript
```
$ jsv data-resource.json --output js
```
**Output**
```javascript
const datasetMetadata = {
// The profile of this descriptor.
profile: "data-resource",
// [example] profile: "tabular-data-package"
// [example] profile: "http://example.com/my-profiles-json-schema.json"
// An identifier string. Lower case characters with `.`, `_`, `-` and `/` are allowed.
name: "my-nice-name",
// A reference to the data for this resource, as either a path as a string, or an array of paths as strings. of valid URIs.
path: ["file.csv", "file2.csv"],
// [example] path: ["http://example.com/file.csv","http://example.com/file2.csv"]
// [example] path: "http://example.com/file.csv"
// Inline data for this resource.
data: null,
// A schema for this resource.
schema: null,
// A human-readable title.
title: "My Package Title",
// A text description. Markdown is encouraged.
description: "# My Package description\nAll about my package.",
// The home on the web that is related to this data package.
homepage: "http://example.com/",
// The raw sources for this resource.
sources: [
{
title: "World Bank and OECD",
path: "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD",
},
],
// The license(s) under which the resource is published.
licenses: [
{
name: "odc-pddl-1.0",
path: "http://opendatacommons.org/licenses/pddl/",
title: "Open Data Commons Public Domain Dedication and License v1.0",
},
],
// The file format of this resource.
format: "xls",
// The media type of this resource. Can be any valid media type listed with [IANA](https://www.iana.org/assignments/media-types/media-types.xhtml).
mediatype: "text/csv",
// The file encoding of this resource.
encoding: "utf-8",
// [example] encoding: "utf-8"
// The size of this resource in bytes.
bytes: 2082,
// The MD5 hash of this resource. Indicate other hashing algorithms with the {algorithm}:{hash} format.
hash: "d25c9c77f588f5dc32059d2da1136c02",
// [example] hash: "SHA256:5262f12512590031bbcc9a430452bfd75c2791ad6771320bb4b5728bfb78c4d0"
};
```
### R
```
$ jsv data-resource.json --output r
```
**Output**
```r
# The profile of this descriptor.
profile <- "data-resource"
# [example] profile <- "tabular-data-package"
# [example] profile <- "http://example.com/my-profiles-json-schema.json"
# An identifier string. Lower case characters with `.`, `_`, `-` and `/` are allowed.
name <- "my-nice-name"
# A reference to the data for this resource, as either a path as a string, or an array of paths as strings. of valid URIs.
path <- ["file.csv","file2.csv"]
# [example] path <- ["http://example.com/file.csv","http://example.com/file2.csv"]
# [example] path <- "http://example.com/file.csv"
# Inline data for this resource.
data <- NA
# A schema for this resource.
schema <- NA
# A human-readable title.
title <- "My Package Title"
# A text description. Markdown is encouraged.
description <- "# My Package description\nAll about my package."
# The home on the web that is related to this data package.
homepage <- "http://example.com/"
# The raw sources for this resource.
sources <- [{"title":"World Bank and OECD","path":"http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"}]
# The license(s) under which the resource is published.
licenses <- [{"name":"odc-pddl-1.0","path":"http://opendatacommons.org/licenses/pddl/","title":"Open Data Commons Public Domain Dedication and License v1.0"}]
# The file format of this resource.
format <- "xls"
# The media type of this resource. Can be any valid media type listed with [IANA](https://www.iana.org/assignments/media-types/media-types.xhtml).
mediatype <- "text/csv"
# The file encoding of this resource.
encoding <- "utf-8"
# [example] encoding <- "utf-8"
# The size of this resource in bytes.
bytes <- 2082L
# The MD5 hash of this resource. Indicate other hashing algorithms with the {algorithm}:{hash} format.
hash <- "d25c9c77f588f5dc32059d2da1136c02"
# [example] hash <- "SHA256:5262f12512590031bbcc9a430452bfd75c2791ad6771320bb4b5728bfb78c4d0"
```
## Design Principles
The client **should** use Frictionless formats by default for describing dataset and resource objects passed to client methods.
In addition, where more than metadata is needed (e.g., we need to access the data stream, or get the schema) we expect the _Dataset_ and _Resource_ objects to follow the [Frictionless Data Lib pattern](https://github.com/frictionlessdata/project/blob/master/rfcs/0004-frictionless-data-lib-pattern.md).

View File

@ -1,108 +0,0 @@
# CKAN Enterprise
## Introduction
CKAN Enterprise is our name for what we plan would become our standard "base" distribution for CKAN going forward:
* It is a CKAN standard code base with micro-services.
* Enterprise grade data catalog and portal targeted at Gov (open data portals) and Enterprise (Data Catalogs +).
* It is also known as [Datopian DMS](https://www.datopian.com/datopian-dms/).
## Roadmap 2021 and beyond
| | Current | CKAN Enterprise |
|-------------------|--------------------------------------------------------------------------------------------|-----------------------------------------------------------------|
| Raw storage | Filestore | Giftless |
| Data Loader (db) | DataPusher extension | Aircan |
| Data Storage (db) | Postgres | Any database engine. By default, Postgres |
| Data API (read) | Built-in DataStore extension's API including SQL endpoint | GraphQL based standalone micro-service |
| Frontend (public) | Build-in frontend into CKAN Classic python app (some projects are using nodejs app) | PortalJS or nodejs app |
| Data Explorer | ReclineJS (some projects that uses nodejs app for frontend have React based Data Explorer) | GraphQL based Data Explorer |
| Auth | Traditional login/password + extendable with CKAN Classic extensions | SSO with default Google, Github, Facebook and Microsoft options |
| Permissions | CKAN Classic based permissions | Existing permissions exposed via JWT based authz API |
## Timeline 2021
To develop a base distribution of CKAN Enterprise, we want to build a demo project with the features from the roadmap. This way we can:
* understand its advantages/limitations;
* compare against other instances of CKAN;
* demonstrate for the potential clients.
High level overview of the planned features with ETA:
| Name | Description | Effort | ETA |
| ----------------------------- | ------------------------------------ | ------ | --- |
| [Init](#Init) | Select CKAN version and deploy to DX | xs | Q2 |
| [Blobstore](#Blobstore) | Integrate Giftless for raw storage | s | Q2 |
| [Versioning](#Versioning) | Develop/integrate new versioning sys | l | Q3 |
| [DataLoader](#DataLoader) | Develop/integrate Aircan | xl | Q3 |
| [Data API](#Data-API) | Integrate new Data API (read) | m | Q2 |
| [Frontend](#Frontend) | Build a theme using PortalJS | s | Q2 |
| [DataExplorer](#DataExplorer) | Integrate into PortalJS | s | Q2 |
| [Permissions](#Permissions) | Develop permissions in read frontend | m | Q4 |
| [Auth](#Auth) | Integrate | s | Q4 |
### Init
Initialize a new project for development of CKAN Enterprise.
Tasks:
* Boot project in Datopian-DX cluster
* Use CKAN v2.8.x (latest patch) or 2.9.x
* Don't setup DataPusher
* Namespace: `ckan-enterprise`
* Domain: `enterprise.ckan.datopian.com`
### Blobstore
See [blob storage](/docs/dms/blob-storage#ckan-v3)
### Versioning
See [versioning](/docs/dms/versioning#ckan-v3)
### DataLoader
See [DataLoader](/docs/dms/load)
### Data API
* Install new [Data API service](https://github.com/datopian/data-api) in the project
* Install Hasura service in the project
* Set it up to work with DB of CKAN Enterprise
* Read more about Data API [here](/docs/dms/data-api#read-api-3)
Notes:
* We could experiment and use various features of Hasura, eg:
* Setting up row/column limits per user role (permissions)
* Subscriptions to auto load new data rows
### Frontend
PortalJS for the read frontend of CKAN Enterprise. [Read more](/docs/dms/frontend/#frontend).
### DataExplorer
A new Data Explorer based on GraphQL API: https://github.com/datopian/data-explorer-graphql
### Permissions
See [permissions](/docs/dms/permissions#permissions-authorization).
### Auth
Next generation, Kratos based, authentication (mostly SSO with no Traditional login by default) with following options out of the box:
* GitHub
* Google
* Facebook
* Microsoft
Easy to add:
* Discord
* GitLab
* Slack

View File

@ -1,365 +0,0 @@
# CKAN v3
## Introduction
This document describes the architectures of CKAN v2 ("CKAN Classic"), CKAN v3 (also known as "CKAN Next Gen" for Next Generation), and CKAN v3 hybrid. The latter is an intermediate approach towards v3, where we still use CKAN v2 and common extensions, and only create microservices for new features.
You will also find out how to do common tasks such as theming or testing, in each of the architectures.
*Note: this blog post has an overview of the more decoupled, microservices approach at the core of v3: https://www.datopian.com/2021/05/17/a-more-decoupled-ckan/*
## CKAN v2, CKAN v3 and Why v3
In yellow, you see one single Python process:
```mermaid
graph TB
subgraph ckanclassic["CKAN Classic"]
ckancore["Core"]
end
```
When you want to extend core functionality of CKAN v2 (Classic), you write a Python package that must be installed in CKAN. This way, the extension will also run in the same process as the core functionality. This is known as a monolithic architecture.
```mermaid
graph TB
subgraph ckanclassic["CKAN Classic"]
ckancore["Core"] --> ckanext["CKAN Extension 1"]
end
```
When you start to add multiple features, through extensions, what you get is one single Python process running many non-related functionalities.
```mermaid
graph TB
subgraph ckanclassic["CKAN Classic"]
ckancore["Core"] --> ckanext["CKAN Extension 1"]
ckancore --> ckanext2["CKAN Extension 2"]
ckancore --> ckanext3["CKAN Extension 3"]
ckancore --> ckanext4["CKAN Extension 4"]
ckancore --> ckanext5["CKAN Extension 5"]
end
```
This monolithic approach has advantages in terms of simplicity of development and deployment, especially when the system is small. However, as it grows in scale and scope, there are an increasing number of issues.
In this approach, an optional extension has the ability to crash the whole CKAN instance. Every new feature must be written in the same language and framework (e.g. Python, leveraging Flask or Django). And, perhaps most fundamentally, the overall system is highly coupled, making it complex and hard to understand, debug, extend, and evolve.
### Microservices and CKAN v3
The main way to address these problems while gaining extra benefits is to move to a microservices-based architecture.
Thus, we recommend building the next version of CKAN CKAN v3 on a microservices approach.
[!tip]CKAN v3 is sometimes also referred to as CKAN Next Gen(eration).
With microservices, each piece of functionality runs in its own service and process.
```mermaid
graph TB
subgraph ckanapi3["CKAN API 3"]
ckanapi31["API 3"]
end
subgraph ckanapi2["CKAN API 2"]
ckanapi21["API 2"]
end
subgraph ckanapi1["CKAN API 1"]
ckanapi11["API 1"]
end
subgraph ckanfrontend["CKAN frontend"]
ckanfrontend1["Frontend"]
end
ckanfrontend1 --> ckanapi11
ckanfrontend1 --> ckanapi21
ckanfrontend1 --> ckanapi31
```
### Incremental Evolution Hybrid v3
One of the other advantages of the microservices approach is that it can also be used to extend and evolve current CKAN v2 solutions in an incremental way. We term these kinds of solutions "Hybrid v3," as they are a mix of v2 and v3 together.
For example, a Hybrid v3 data portal could use a new microservice written in Node for the frontend, and combine that with CKAN v2 (with v2 extensions).
```mermaid
graph TB
subgraph ckanapi3["CKAN API 3"]
ckanapi31["API 3"]
end
subgraph ckanapi2["CKAN API 2"]
ckanapi21["API 2"]
end
subgraph ckanapi1["CKAN API 1"]
ckanapi11["API 1"]
end
subgraph ckanfrontend["CKAN frontend"]
ckanfrontend1["Frontend"]
end
subgraph ckanclassic["CKAN Classic"]
ckancore["Core"] --> ckanext["CKAN Extension 1"]
ckancore --> ckanext2["CKAN Extension 2"]
end
ckanfrontend1 --> ckancore
ckanfrontend1 --> ckanapi11
ckanfrontend1 --> ckanapi21
ckanfrontend1 --> ckanapi31
```
The hybrid approach means we can evolve CKAN v2 "Classic" to CKAN v3 "Next Gen" incrementally. In particular, it allows people to keep using their existing v2 extensions, and upgrade them to new microservices gradually.
### Comparison of Approaches
| | CKAN v2 (Classic) | CKAN v3 (Next Gen) | CKAN v3 Hybrid |
| ------------ | ------------------| -------------------| ---------------|
| Architecture | Monolithic | Microservice | Microservice with v2 core |
| Language | Python | You can write services in any language you like.<br/><br/>Frontend default: JS.<br/>Backend default: Python | Python and any language you like for microservices. |
| Frontend (and theming) | Python with Python CKAN extension | Flexible. Default is modern JS/NodeJS based | Can use old frontend but default to new JS-based frontend. |
| Data Packages | Add-on, no integration | Default internal and external format | Data Packages with converter to old CKAN format. |
| Extension | Extensions are libraries that are added to core runtime. They must therefore be built in python and are loaded into the core process at build time. "Template/inheritance" model where hooks are in core and it is core that loads and calls plugins. This means that if a hook does not exist in core then the extension is stymied. | Extensions are microservices and can be written in any language. They are loaded into the url space via kubernetes routing manager. Extensions hook into "core" via APIs (rather than in code). Follows a "composition" model rather than inheritance model | Can use old style extensions or microservices. |
| Resource Scaling | You have a single application so scaling is of the core application. | You can scale individual microservices as needed. | Mix of v2 and v3 |
## Why v3: Long Version
What are the problems with CKAN v2's monolithic architecture in relation to microservices v3?
* **Poor Developer Experience (DX), innovability, and scalability due to coupling**. Monolithic means "one big system" => Coupling & Complexity => hard to understand, change and extend. Changes in one area can unexpectedly affect other areas.
* DX to develop a small new API requires wiring into CKAN core via an extension. Extensions can interact in unexpected ways.
* The core of people who fully understand CKAN has stayed small for a reason: there's a lot of understand.
* https://github.com/ckan/ckan/issues/5333 is an example of a small bug that's hard to track down due to various paths involved.
* Harder to make incremental changes due to coupling (e.g. Python 3 upgrade requires *everything* to be fixed at once - can't do rolling releases).
* **Stability**. One bad extension crashes or slows down the whole system
* **One language => Less developer flexibility (Poor DX)**. Have to write *everything* in Python, including the frontend. This is an issue especially for the frontend: almost all modern frontend development is heavily Javascript-based and theme is the #1 thing people want to customize in CKAN. At the moment, that requires installing *all* of CKAN core (using Docker) plus some familiarity with Python and Jinja templating. This is a big ask.
* **Extension stablity and testing**. Testing of extensions is painful (at least without careful factoring in a separate mini library) and are therefore often not tested; they don't have Continuous Integration (CI) or Continuous Deployment (CD). As an example, a highly experienced Python developer at Datopian was still struggling to get extension tests working 6 months into their CKAN work.
* **DX is poor especially when getting started**. Getting CKAN up and running requires multiple external services (database, Solr, Redis, etc.) making Docker the only viable way for bootstraping a local development environment. This makes getting started with CKAN daunting and painful.
* **Vertical scalability is poor**. Scaling the system is costly as you have to replicate the whole core process in every machine.
* **System is highly coupled.** Extensions b/c in process tend to end up with significant coupling to core which makes them brittle (has improved with plugins.toolkit)
* Upgrading core to Python 3 requires upgrading *all* extensions because they run in the same process.
* Search Index is not a separate API, but in Core. So replacing Solr is hard.
The top 2 customizations of CKAN are slow and painful and require deep knowledge of CKAN:
* Theming a site.
* Customizing the metadata.
## Architectures
### CKAN v2 (Classic)
This diagram is based on the file `docker-compose.yml` of [github.com/okfn/docker-ckan](https://github.com/okfn/docker-ckan) (`docker-compose.dev.yml` has the same components, but different configuration).
A difference from this diagram to the file is that we are not including DataPusher, as it is not a required dependency.
>[!tip]Databases may run as Docker containers, or rely on third-party services such as Amazon Relational Database Service (RDS).
```mermaid
graph LR
CKAN[CKAN web app]
CKAN --> DB[(Database)]
CKAN --> Solr[(Solr)]
CKAN --> Redis[(Redis)]
subgraph Docker container
CKAN
end
```
Same setup showing some of the key extensions explicitly:
```mermaid
graph LR
core[CKAN Core] --> DB[(Database)]
datastore --> DB2[(Database - DataStore)]
core --> Solr[(Solr)]
core --> Redis[(Redis)]
subgraph Docker container
core
datastore
datapusher
imageview
...
end
```
CKAN ships with several core extensions that are built-in. Here, together with the list of main components, we list a couple of them:
Name | Type | Repository | Description
-----|------|------------|------------
CKAN | Application (API + Worker) | [Link](https://github.com/ckan/ckan) | Data management system (DMS) for powering data hubs and data portals. It's a monolithical web application that includes several built-in extensions and dependencies, such as a job queue service. In theory, it's possible to run it without any extensions.
datapusher | CKAN Extension | [Link](https://github.com/ckan/ckan/tree/master/ckanext/datapusher) | It could also be called "datapusher-connect." It's a glue code to connect with a separate microservice called DataPusher, which performs actions when new data arrives.
datastore | CKAN Extension | [Link](https://github.com/ckan/ckan/tree/master/ckanext/datastore) | The interface between CKAN and the structure database, the one receiving datasets and resources (CSVs). It includes an API for the database and an administrative UI.
imageview | CKAN Extension | [Link](https://github.com/ckan/ckan/tree/master/ckanext/imageview) | It provides an interface for creating HTML templates for image resources.
multilingual | CKAN Extension | [Link](https://github.com/ckan/ckan/tree/master/ckanext/multilingual) | It provides an interface for translation and localization.
Database | Database | | People tend to use a single PostgreSQL instance for this. Separated in multiple databases, it's the place where CKAN stores its own information (sometimes referred as "MetaStore" and "HubStore"), rows of resources (StructuredStore or DataStore), and raw datasets and resources ("BlobStore" or "FileStore"). The latter may store data in the local filesystem or cloud providers, via extensions.
Solr | Database | | It provides indexing and full-text search for CKAN.
Redis | Database | | Lightweight key-value store, used for caching and job queues.
### CKAN v3 (Next Gen)
CKAN Next Gen is still a DMS, as CKAN Classic; but rather than a monolithical architecture, it follows the microservices approach. CKAN Classic is not a dependency anymore, as we have smaller services providing functionality that we may or many not choose to include. This description is based on [Datopian's Technical Documentation](/docs/dms/ckan-v3/next-gen/#roadmap).
```mermaid
graph LR
subgraph api3["..."]
api31["API"]
end
subgraph api2["Administration"]
api21["API"]
end
subgraph api1["Authentication"]
api11["API"]
end
subgraph frontend["Frontend"]
frontendapi["API"]
end
subgraph storage["Raw Resources Storage"]
storageapi["API"]
end
storageapi --> cloudstorage[(Cloud Storage)]
frontendapi --> storageapi
frontendapi --> api11
frontendapi --> api21
frontendapi --> api31
```
At this moment, many important features are only available through CKAN extensions, so that brings us to the hybrid approach.
### CKAN Hybrid v3 (Next Gen)
We may sometimes make an explit distinction between CKAN v3 "hybrid" and "pure." The reason is because we want to ensure that we're not there yet we have many opportunities to extract features out of CKAN and CKAN Extensions.
In this approach, we still rely on CKAN Classic and all its extensions. Many already had many tests and bugs fixed, so we can deliver more if not forced to rewrite everything from scratch.
```mermaid
graph TB
subgraph ckanapi3["CKAN API 3"]
ckanapi31["API 3"]
end
subgraph ckanapi2["CKAN API 2"]
ckanapi21["API 2"]
end
subgraph ckanapi1["CKAN API 1"]
ckanapi11["API 1"]
end
subgraph ckanfrontend["Frontend"]
ckanfrontend1["Frontend v2"]
theme["[Project-specific theme]"]
end
subgraph ckanclassic["CKAN Classic"]
ckancore["Core"] --> ckanext["CKAN Extension 1"]
ckancore --> ckanext2["[Project-specific extension]"]
end
ckanfrontend1 --> ckancore
ckanfrontend1 --> ckanapi11
ckanfrontend1 --> ckanapi21
ckanfrontend1 --> ckanapi31
```
Name | Type | Repository | Description
-----|------|------------|------------
Frontend v2 | Application | [Link](https://github.com/datopian/frontend-v2) | Node application for Data Portals. It communicates with a CKAN Classic instance, through its API, to get data and render HTML. It is written to be extensible, such as connecting to other applications and theming.
[Project-specific theme] | Frontend Theme | e.g., [Link](https://github.com/datopian/frontend-oddk) | Extension to Frontend v2 where you can personalize the interface, create different pages, and connect with other APIs.
[API 1] | Application | e.g., [Link](https://github.com/datopian/data-subscriptions) | Any application with an API to communicate with the user-facing Frontend v2 or to run tasks in background. Given the current architecture, often, this API is usually designed to work with CKAN interfaces. Over time, we may choose to make it more generic, and even replace CKAN Core with other applications.
## Job Stories
In this spreadsheet, you will find a list of common job stories in CKAN projects. Also, how you can accomplish them in CKAN v2, v3, and Hybrid v3.
https://docs.google.com/spreadsheets/d/1cLK8xylprmVsoQIbdphqz9-ccSpdDABQExvKdvNJqaQ/edit#gid=757361856
## Glossary
### API
An HTTP API, usually following the REST style.
### Application
A Python package, an API, a worker... It may have other applications as dependencies.
### CKAN Extension
A Python package following specification from [CKAN Extending guide](https://docs.ckan.org/en/2.8/extensions/index.html).
### Database
An organized collection of data.
### Dataset
A group of resources made to be distributed together.
### Frontend Theme
A Node project specializing behavior present in [Frontend v2](https://github.com/datopian/frontend-v2).
### Resource
A data blob. Common formats are CSV, JSON, and PDF.
### System
A group of applications and databases that work together to accomplish a set of tasks.
### Worker
An application that runs tasks in background. They may run recurrently according to a given schedule, or as soon as it's requested by another application.
## Appendix
### Architecture - CKAN v2 with DataPusher
```mermaid
graph TB
subgraph DataPusher
datapusherapi["DataPusher API"]
datapusherworker["CKAN Service Provider"]
SQLite[(SQLite)]
end
subgraph CKAN
core
datapusher[datapusher ext]
datastore
...
end
core[CKAN Core] --> datastore
datastore --> DB[(Database)]
datapusherapi --> core
datapusher --> datapusherapi
```
Name | Type | Repository | Description
-----|------|------------|------------
DataPusher | System | [Link](https://github.com/ckan/datapusher) | Microservice that parses data files and uploads them to the datastore.
DataPusher API | API | [Link](https://github.com/ckan/datapusher) | HTTP API written in Flask. It is called from the built-in `datapusher` CKAN extension whenever a resource is created (and has the right type).
CKAN Service Provider | Worker | [Link](https://github.com/ckan/ckan-service-provider) | Library for making web services that make functions available as synchronous or asynchronous jobs.
SQLite | Database | | Unknown use. Possibly a worker dependency.
### Old Next Gen Page
Prior to this page, we had one called "Next Gen." It has intersections with this article, although it focuses more on the benefits of microservices. For the time being, the page still exists in [/ckan-v3/next-gen](/docs/dms/ckan-v3/next-gen), although it may get merged with this one in the future.

View File

@ -1,203 +0,0 @@
# Next Gen
“Next Gen” (NG) is our name for the evolution of CKAN from its current state as “CKAN Classic”.
Next Gen has a decoupled, microservice architecture in contrast to CKAN Classic's monolithic architecture. It is also built from the ground up on the Frictionless Data principles and specifications which provide a simple, well-defined and widely adopted set of core interfaces and tooling for managing data.
## Classic to Next Gen
CKAN classic: monolithic architecture -- everything is one big python application. Extension is done at code level and "compiled in" at compile/run-time (i.e. you end up with one big docker file).
CKAN Next Gen: decoupled, service-oriented -- services connected by network calls. Extension is done by adding new services,
```mermaid
graph LR
subgraph "CKAN Classic"
plugins
end
subgraph "CKAN Next Gen"
microservices
end
plugins --> microservices
```
You can read more about monolithic vs microservice architectures in the [Appendix below](#appendix-monolithic-vs-microservice-architecture).
## Next Gen lays the foundation for the future and brings major immediate benefits
Next Gen's new approach is important in several major ways.
### Microservices are the Future
First, decoupled microservices have become *the* way to design and deploy (web) applications after first being pioneered by the likes of Amazon in the early 2000s. And in the last five to ten years have brought microservices "for the masses" with relevant tooling and technology standardized, open-sourced and widely deployed -- not only with containerization such as Docker, Kubernetes but also in programming languages like (server-side) Javascript and Golang.
By adopting a microservice approach CKAN can reap the the benefits of what is becoming a mature and standard way to design and build (web) applications. This includes the immediate advantages of being aligned with the technical paradigm such as tooling and developer familiarity.
### Microservices bring Scalability, Reliability, Extensibility and Flexibility
In addition, and even more importantly, the microservices approach brings major benefits in:
1. **Scalability**: dramatically easier and cheaper to scale up -- and down -- in size *and* complexity. Size-wise this is because you can replicate individual services rather than the whole application. Complexity-wise this is because monolithic architectures tend to become "big" where service-oriented encourages smaller lightweight components with cleaner interfaces. This means you can have a much smaller core making it easier to install, setup and extend. It also means you can use what you need making solutions easier to maintain and upgrade.
2. **Reliability**: easier (and cheaper) to build highly reliable, high availability solutions because microservices make isolation and replication easier. For example, in a microservice architecture a problem in CKAN's harvester won't impact your main portal because they run in separate containers. Similarly, you can scale the harvester system separately from the web frontend.
3. **Extensibility**: much easier to create and maintain extensions because they are a decoupled service and interfaces are leaner and cleaner.
4. **Flexibility** aka "Bring your own tech": services can be written in any language so, for example, you can write your frontend in javascript and your backend in Python. In a monolithic architecture all parts must be written in the same language because everything is compiled together. This flexibility makes it easier to use the best tool for the job. It also makes it much easier for teams to collaborate and cooperate and fewer bottlenecks in development.
ASIDE: decoupled microservices reflect the "unix" way of building networked applications. As with the "unix way" in general, whilst this approach better -- and simpler -- in the long-run, in the short-run it often needs sustantial foundational work (those Unix authors were legends!). It may also be, at least initially, more resource intensive and more complex infrastructurally. Thus, whilst this approach is "better" it was not suprising that it was initially used for for complex and/or high end applications e.g. Amazon. This also explains why it took a while for this approach to get adoption -- it is only in the last few year that we have robust, lightweight, easy to use tooling and patterns for microservices -- "microservices for the masses" if you like.
In summary, the Next Gen approach provides an essential foundation for the continuing growth and evolution of CKAN as a platform for building world-class data portal and data management solutions.
## Evolution not Revolution: Next Gen Components Work with CKAN Classic
*Gradual evolution from CKAN classic (keep what is working, keep your investments, incremental change)*
Next Gen components are specifically designed to work with CKAN "Classic" in its current form. This means existing CKAN users can immediately benefit from Next Gen components and features whilst retaining the value of their existing investment. New (or existing) CKAN-based solutions can adopt a "hybrid" approach using components from both Classic and Next Gen. It also means that the owner of a CKAN-based solution can incrementally evolve from "Classic" to "Next Gen" by replacing one component one at a time, gaining new functionality without sacrificing existing work.
ASIDE: we're fortunate that CKAN Classic itself was ahead of its time in its level of "service-orientation". From the start, it had a very rich and robust API and it has continued to develop this with almost almost all functionality exposed via the API. It is this rich API and well factored design that makes it relatively straightforward to evolve CKAN in its current "Classic" form towards Next Gen.
## New Features plus Existing Functionality Improved
In addition to its architecture, Next Gen provides a variety of improvements and extensions to CKAN Classic's functionality. For example:
* Theming and Frontend Customization: theming and customizing CKAN's frontend has got radically easier and quicker. See [Frontend section &raquo;][frontend]
* DMS + CMS unified: integrate the full power of a modern CMS into your data portal and have one unified interface for data and content. See [Frontend section &raquo;][frontend]
* Data Explorer: the existing CKAN data preview/explorer has been completely rewritten in modern React-based Javascript (ReclineJS is now 7y old!). See [Data Explorer section &raquo;][explorer]
* Dashboards: build rich data-driven dashboards and integrate. See [Dashboards section &raquo;][dashboards]
* Harvesting: simpler, more powerful harvesting built on modern ETL. See [Harvesting section &raquo;][harvesting]
And each of these features is easily deployed into an existing CKAN solution!
[frontend]: /docs/dms/frontend
[explorer]: /docs/dms/data-explorer
[dashboards]: /docs/dms/dashboards
[harvesting]: /docs/dms/harvesting
## Roadmap
The journey to Next Gen from Classic can proceed step by step -- it does not need to be a big bang. Like refurbishing and extending a house, we can add a room here or renovate a room there whilst continuing to live happily in the building (and benefitting as our new bathroom comes online, or we get a new conservatory!).
Here's an overview of the journey to Next Gen and current implementation status. More granular information on particular features may sometimes be found on the individual feature page, for example for [Harvesting here](/docs/dms/harvesting#design).
```mermaid
graph LR
start[Start]
themefe[Read Frontend]
authfe[Authentication in FE]
authzfe[Authorization in FE]
previews[Previews]
explorer[Explorer]
permsserv[Permissions Service]
orgs[Organizations]
subgraph Start
start
end
subgraph Frontend
start --> themefe
themefe --> authfe
authfe --> authzfe
themefe --> revisioningfe[Revision UI]
end
subgraph Harvesting
start --> harvestetl[Harvesting ETL + Runner]
harvestetl --> harvestui[Harvest UI]
end
subgraph "Admin UI"
managedataset[Manage Dataset]
manageorg[Manage Organization]
manageuser[Manage Users]
manageconfig[Manage Config]
start --> managedataset
start --> manageorg
managedataset --> manageconfig
end
subgraph "Backend (API)"
start --> permsserv
start --> revision[Backend Revisioning]
end
datastore[DataStore]
subgraph DataStore
start --> datastore
datastore --> dataload[Data Load]
end
subgraph Explorer
themefe --> previews
previews --> explorer
end
subgraph Organizations
start --> orgs
end
subgraph Key
done[Done]
nearlydone[Nearly Done]
inprogress[In Progress]
next[Next Up]
end
classDef done fill:#21bf73,stroke:#333,stroke-width:3px;
classDef nearlydone fill:lightgreen,stroke:#333,stroke-width:3px;
classDef inprogress fill:orange,stroke:#333,stroke-width:2px;
classDef next fill:pink,stroke:#333,stroke-width:1px;
class done,themefe,previews,explorer,harvestetl done;
class nearlydone,authfe,harvestui nearlydone;
class inprogress,dataload inprogress;
class next,permsserv next;
```
## Appendix: Monolithic vs Microservice architecture
Monolithic: Libraries or modules communicate via function calls (inside one big application)
Microservices: Services communicate over a network
The best introduction and definition of microservices comes from Martin Fowler https://martinfowler.com/microservices/
> Microservice architectures will use libraries, but their primary way of componentizing their own software is by breaking down into services. We define libraries as components that are linked into a program and called using in-memory function calls, while services are out-of-process components who communicate with a mechanism such as a web service request, or remote procedure call. https://martinfowler.com/articles/microservices.html
### Monolithic
```mermaid
graph TD
subgraph "Monolithic - all inside"
a
b
c
end
a --in-memory function all--> b
a --in-memory function all--> c
```
### Microservice
```mermaid
graph TD
subgraph "A Container"
a
end
subgraph "B Container"
b
end
subgraph "C Container"
c
end
a -.network call.-> b
a -.network call.-> c
```

View File

@ -1,23 +0,0 @@
---
sidebar: auto
---
# CKAN Classic
CKAN (Classic) already has great documentation at: https://docs.ckan.org/
This material is a complement to those docs as well as details of our particular setup. Here, among other things, you'll learn how to:
* [Get Started with CKAN for Development -- install and run CKAN on your local machine](/docs/dms/ckan/getting-started)
* [Play around with a CKAN instance including importing and visualising data](/docs/dms/ckan/play-around)
* [Install Extensions](/docs/dms/ckan/install-extension)
* [Create Your Own Extension](/docs/dms/ckan/create-extension)
* [Client Guide](/docs/dms/ckan-client-guide)
* [FAQ](/docs/dms/ckan/faq)
[start]: /docs/dms/ckan/getting-started
[play]: /docs/dms/ckan/play-around
[CKAN]: https://ckan.org/
[docs]: https://docs.ckan.org/

View File

@ -1,162 +0,0 @@
---
sidebar: auto
---
# Introduction
A CKAN extension is a Python package that modifies or extends CKAN. Each extension contains one or more plugins that must be added to your CKAN config file to activate the extensions features.
## Creating and Installing extensions
1. Boot up your docker compose
```
docker-compose -f docker-compose.dev.yml up
```
2. To create an extension template using this docker composition execute:
```
docker-compose -f docker-compose.dev.yml exec ckan-dev /bin/bash -c "paster --plugin=ckan create -t ckanext ckanext-example_extension -o /srv/app/src_extensions"
```
This command will create an extension template in your local `./src` folder that is mounted inside the containers in the `/srv/app/src_extension` directory. Any extension cloned on the `src` folder will be installed in the CKAN container when booting up Docker Compose (`docker-compose up`). This includes installing any requirements listed in a `requirements.txt` (or `pip-requirements.txt`) file and running `python setup.py develop`.
3. Add the plugin to the `CKAN__PLUGINS` setting in your `.env` file.
```
CKAN__PLUGINS=stats text_view recline_view example_extension
```
4. Restart your docker-compose:
```
# Shut down your instance with crtl+c and then run it again with:
docker-compose -f docker-compose.dev.yml up
```
> [!tip]CKAN will be started running on the paster development server with the '--reload' option to watch changes in the extension files.
You should see the following output in the console:
```
...
ckan-dev_1 | Installed /srv/app/src_extensions/ckanext-example_extension
...
```
## Edit the extension
Let's edit a template to change the way CKAN is displayed to the user!
1. First you will need write permissions to the extension folder since it was created by the user running docker. Replace `your_username` and execute the following command:
> [!tip]You can find out your current username by typing 'echo $USER' in the terminal.
```
sudo chown -R <your_username>:<your_username> src/ckanext-example_extension
```
2. The previous comamand creates all the files and folder structure needed for our extension. Open `src/ckanext-example_extension/ckanext/example_extension/plugin.py` to see the main file of our extension that we will edit to add custom functionality:
```python
import ckan.plugins as plugins
import ckan.plugins.toolkit as toolkit
class Example_ExtensionPlugin(plugins.SingletonPlugin):
plugins.implements(plugins.IConfigurer)
# IConfigurer
def update_config(self, config_):
toolkit.add_template_directory(config_, 'templates')
toolkit.add_public_directory(config_, 'public')
toolkit.add_resource('fanstatic', 'example_theme')
```
3. We will create a custom Flask Blueprint to extend our CKAN instance with more endpoints. In order to create a new blueprint and add an endpoint we need to:
- Import Blueprint and render_template from the flask module.
- Create the functions that will be used as endpoints
- Implement the IBlueprint interface in our plugin and add the new endpoint.
4. From flask import Blueprint and render_template,
```python
import ckan.plugins as plugins
import ckan.plugins.toolkit as toolkit
from flask import Blueprint, render_template
class Example_ExtensionPlugin(plugins.SingletonPlugin):
plugins.implements(plugins.IConfigurer)
# IConfigurer
def update_config(self, config_):
toolkit.add_template_directory(config_, 'templates')
toolkit.add_public_directory(config_, 'public')
toolkit.add_resource('fanstatic', 'example_extension')
```
5. Create a new function: hello_plugin
```python
import ckan.plugins as plugins
import ckan.plugins.toolkit as toolkit
from flask import Blueprint, render_template
def hello_plugin():
u'''A simple view function'''
return u'Hello World, this is served from an extension'
class Example_ExtensionPlugin(plugins.SingletonPlugin):
plugins.implements(plugins.IConfigurer)
# IConfigurer
def update_config(self, config_):
toolkit.add_template_directory(config_, 'templates')
toolkit.add_public_directory(config_, 'public')
toolkit.add_resource('fanstatic', 'example_extension')
```
6. Implement the IBlueprint interface in our plugin and add the new endpoint.
```python
import ckan.plugins as plugins
import ckan.plugins.toolkit as toolkit
from flask import Blueprint, render_template
def hello_plugin():
u'''A simple view function'''
return u'Hello World, this is served from an extension'
class Example_ExtensionPlugin(plugins.SingletonPlugin):
plugins.implements(plugins.IConfigurer)
plugins.implements(plugins.IBlueprint)
# IConfigurer
def update_config(self, config_):
toolkit.add_template_directory(config_, 'templates')
toolkit.add_public_directory(config_, 'public')
toolkit.add_resource('fanstatic', 'example_extension')
# IBlueprint
def get_blueprint(self):
u'''Return a Flask Blueprint object to be registered by the app.'''
# Create Blueprint for plugin
blueprint = Blueprint(self.name, self.__module__)
blueprint.template_folder = u'templates'
# Add plugin url rules to Blueprint object
blueprint.add_url_rule('/hello_plugin', '/hello_plugin', hello_plugin)
return blueprint
```
6. Go back to the browser and navigate to http://ckan:5000/hello_plugin. You should see the value returned by our view!
![New Blueprint output](https://i.imgur.com/AZjTDbN.png)
Now that you have added a new view and endpoint to your plugin you are ready for the next step of the tutorial! You can also check the complete code of this plugin in the [ckan repository](https://github.com/ckan/ckan/tree/master/ckanext/example_flask_iblueprint).

View File

@ -1,110 +0,0 @@
---
sidebar: auto
---
# FAQ
This page provides answers to some frequently asked questions.
## How to create an extension template in my local machine
You can use the `paster` command in the same way as a source install. To create an extension execute the following command:
```
docker-compose -f docker-compose.dev.yml exec ckan-dev /bin/bash -c "paster --plugin=ckan create -t ckanext ckanext-myext -o /srv/app/src_extensions"
```
This will create an extension template inside the container's folder `/srv/app/src_extensions` which is mapped to your local `src/` folder.
Now you can navigate to your local folder `src/` and see the extension created by the previous command and open the project in your favorite IDE.
## How to separate that extension in a new git repository so I can have the independence to install it in other instances
Crucial thing is to understand that extensions get their repositories on GitHub (or elsewhere). You can first create a repository for extension and later clone in `src/` or do opposite as following:
* Create the Extension, for example: `ckanext-myext`.
```
docker-compose -f docker-compose.dev.yml exec ckan-dev /bin/bash -c "paster --plugin=ckan create -t ckanext ckanext-myext -o /srv/app/src_extensions"
```
* Init your new git repository into the extension folder `src/ckanext-myext`
```
cd src/ckanext-myext
git init
```
* Configure remote/origin
```
git remote add origin <remote_repository_url>
```
* Add your files and push the first commit
```
git add .
git commit -m 'Initial Commit'
git push
```
**Note:** The `src/` folder is gitignored in `okfn/docker-ckan` repository, so initializing new git repositories inside is ok.
## How to quickly refresh the changes in my extension into the dockerized environment so I can have quick feedback of my changes
This docker-compose setup for dev environment is already configured so that it sets `debug=True` inside configuration file and auto reloads on python and templates related changes. You do not have to reload when making changes to HTML, javascript or configuration files - you just need to refresh the page in the browser.
See the CKAN images section of the [repository documentation](https://github.com/okfn/docker-ckan#ckan-images) for more detail
## How to run tests for my extension in the dockerized environment so I can have a quick test-development cycle
We write and store unit tests inside the `ckanext/myext/tests` directory. To run unit tests you need to be running the `ckan-dev` service of this docker-compose setup.
* Once running, in another terminal window run the test command:
```
docker-compose -f docker-compose.dev.yml exec ckan-dev nosetests --ckan-dev --nologcapture --reset-db -s -v --with-pylons=/srv/app/src_extensions/ckanext-myext/test.ini /srv/app/src_extensions/ckanext-myext/
```
You can also pass nosetest arguments to debug
```
--ipdb --ipdb-failure
```
**Note:** Right now all tests will be run, it is not possible to choose a specific file or test.
## How to debug my methods in the dockerized environment so I can have a better understanding of whats going on with my logic
To run a container and be able to add a breakpoint with `pdb`, run the `ckan-dev` container with the `--service-ports` option:
```
docker-compose -f docker-compose.dev.yml run --service-ports ckan-dev
```
This will start a new container, displaying the standard output in your terminal. If you add a breakpoint in a source file in the `src` folder (`import pdb; pdb.set_trace()`) you will be able to inspect it in this terminal next time the code is executed.
## How to debug core CKAN code
Currently, this docker-compose setup doesn't allow us to debug core CKAN code since it lives inside the container. However, we can do some hacks so the container uses a local clone of the CKAN core hosted in our machine. To do it:
- Create a new folder called `ckan_src` in this `docker-ckan` folder at the same level of the `src/`
- Clone ckan and checkout the version you want to debug/edit
```
git https://github.com/ckan/ckan/ ckan_src
cd ckan_src
git checkout ckan-2.8.3
```
- Edit `docker-compose.dev.yml` and add an entry to ckan-dev's and ckan-worker-dev's volumes. This will allow the docker container to access the CKAN code hosted in our machine.
```
- ./ckan_src:/srv/app/ckan_src
```
- Create a script in `ckan/docker-entrypoint.d/z_install_ckan.sh` to install CKAN inside the container from the cloned repository (instead of the one installed in the Dockerfile)
```
#!/bin/bash
echo "*********************************************"
echo "overriding with ckan installation with ckan_src"
pip install -e /srv/app/ckan_src
echo "*********************************************"
```
That's it. This will install CKAN inside the container in development mode, from the shared folder. Now you can open the `ckan_src/` folder from your favorite IDE and start working on CKAN.

View File

@ -1,77 +0,0 @@
# CKAN: Getting Started for Development
## Prerequisites
CKAN has a rich tech stack so we have opted to standardize our instructions with Docker Compose, which will help you spin up every service in a few commands.
If you already have Docker-compose, you are ready to go!
If not, please, follow instructions on [how to install docker-compose](https://docs.docker.com/compose/install/).
On Ubuntu you can run:
```
sudo apt-get update
sudo apt-get install docker-compose
```
## Cloning the repo
```
git clone https://github.com/okfn/docker-ckan
# or git clone git@github.com:okfn/docker-ckan.git
cd docker-ckan
```
## Booting CKAN
Create a local environment file:
```
cp .env.example .env
```
Build and Run the instances:
> [!tip]'docker-compose' must be run with 'sudo'. If you want to change this, you can follow the steps below. NOTE: The 'docker' group grants privileges equivalent to the 'root' user.
Create the `docker` group: `sudo groupadd docker`
Add your user to the `docker` group: `sudo usermod -aG docker $USER`
Change the storage directory ownership from `root` to `ckan` by adding the commads below to the `ckan/Dockerfile.dev`
```
RUN mkdir -p /var/lib/ckan/storage/uploads
RUN chown -R ckan:ckan /var/lib/ckan/storage
```
At this point, you can log out and log back in for these changes to apply. You can also use the command `newgrp docker` to temporarily enable the new group for the current terminal session.
```
docker-compose -f docker-compose.dev.yml up --build
```
When you see this log message:
![](https://i.imgur.com/WUIiNRt.png)
You can navigate to `http://localhost:5000`
![CKAN Home Page](https://i.imgur.com/T5LWo8A.png)
and log in with the credentials that docker-compose setup created for you [user: `ckan_admin` password:`test1234`].
>[!tip]To learn key concepts about CKAN, including what it is and how it works, you can read the User Guide.
[CKAN User Guide](https://docs.ckan.org/en/2.8/user-guide.html).
## Next Steps
[Play around with CKAN portal](/docs/dms/ckan/play-around).
## Troubleshooting
Login / Logout button breaks the experience:
- Change the URL from `http://ckan:5000` to `http://localhost:5000`. A complete fix is described in the [Play around with CKAN portal](/docs/dms/ckan/play-around). (Your next step. ;))

View File

@ -1,76 +0,0 @@
---
sidebar: auto
---
# Installing extensions
A CKAN extension is a Python package that modifies or extends CKAN. Each extension contains one or more plugins that must be added to your CKAN config file to activate the extensions features.
In this sections we will teach you only how to install existing extensions. See [next steps](/docs/dms/ckan/create-extension) in case you need to create or modify extensions
## Add new extension
Lets install [Hello World](https://github.com/rclark/ckanext-helloworld) on the portal. For that we need to do 2 thing:
1. Install extension when building docker image
2. Add new extension to CKAN plugins
### Install extension on docker build
For this we need to modify Dockerfile for ckan service. Let's edit it:
```
vi ckan/Dockerfile.dev
# Add following
RUN pip install -e git+https://github.com/rclark/ckanext-helloworld.git#egg=ckanext-helloworld
```
*Note:* In this example we use vi editor, but you can choose any of your choice.
### Add new extension to plugins
We need to modify .env file for that - Search for `CKAN_PLUGINS` and add new extension to the existing list:
```
vi .env
CKAN__PLUGINS=helloworld envvars image_view text_view recline_view datastore datapusher
```
## Check extension is installed
After modifying configuration files you will need to restart the portal. If your CKAN protal is up and running bring it down and re-start
```
docker-compose -f docker-compose.dev.yml stop
docker-compose -f docker-compose.dev.yml up --build
```
### Check what extensions you already have:
http://ckan:5000/api/3/action/status_show
Response should include list of all extensions including `helloworld` in it.
```
"extensions": [
"envvars",
"helloworld",
"image_view",
"text_view",
"recline_view",
"datastore",
"datapusher"
]
```
### Check the extension is actually working
This extension simply adds new route `/hello/world/name` to the base ckan and says hello
http://ckan:5000/hello/world/John-Doe
## Next steps
[Create your own extension](/docs/dms/ckan/create-extension)

View File

@ -1,285 +0,0 @@
---
sidebar: auto
---
# How to play around with CKAN
In this section, we are going to show some basic functionality of CKAN focused on the API.
## Prerequisites
- We assume you've already completed the [Getting Started Guide](/docs/dms/ckan/getting-started).
- You have a basic understanding of Key data portal concepts:
CKAN is a tool for making data portals to manage and publish datasets. You can read about the key concepts such as Datasets and Organizations in the User Guide -- or you can just dive in and play around!
https://docs.ckan.org/en/2.9/user-guide.html
>[!tip]
Install a [JSON formatter plugin for Chrome](https://chrome.google.com/webstore/detail/json-formatter/bcjindcccaagfpapjjmafapmmgkkhgoa?hl=en) or browser of your choice.
If you are familiar with the command line tool `curl`, you can use that.
In this tutorial, we will be using `curl`, but for most of the commands, you can paste a link in your browser. For POST commands, you can use [Postman](https://www.getpostman.com/) or [Google Chrome Plugin](https://chrome.google.com/webstore/detail/postman/fhbjgbiflinjbdggehcddcbncdddomop).
## First steps
>[!tip]
By default the portal is accessible on http://localhost:5000. Let's update your `/etc/hosts` to access it on http://ckan:5000:
```
vi /etc/hosts # You can use the editor of your choice
# add following
127.0.0.1 ckan
```
At this point, you should be able to access the portal on http://ckan:5000.
![CKAN Home Page](https://i.imgur.com/T5LWo8A.png)
Let's add some fixtures to it. For software, a fixture is something used consistently (in this case, data for you to play around with). Run the following from your terminal (do NOT cut the previous docker process as this one depends on the already launched docker, run in another terminal):
```sh
docker-compose -f docker-compose.dev.yml exec ckan-dev ckan seed basic
```
Optionally you can `exec` into a running container using
```sh
docker exec -it [name of container] sh
```
and run the `ckan` command there
```sh
ckan seed basic
```
You should be able to see 2 new datasets on home page:
![CKAN with data](https://i.imgur.com/BiSifyb.png)
To get more details on ckan commands please visit [CKAN Commands Reference](https://docs.ckan.org/en/2.9/maintaining/cli.html#ckan-commands-reference).
### Check CKAN API
This tutorial focuses on the CKAN API as that is central to development work and requires more guidance. We also invite you to explore the user interface which you can do directly yourself by visiting http://ckan:5000/.
#### Let's check the portal status
Go to http://ckan:5000/api/3/action/status_show.
You should see something like this:
```json
{
"help": "https://ckan:5000/api/3/action/help_show?name=status_show",
"success": true,
"result": {
"ckan_version": "2.9.x",
"site_url": "https://ckan:5000",
"site_description": "Testing",
"site_title": "CKAN Demo",
"error_emails_to": null,
"locale_default": "en",
"extensions": [
"envvars",
...
"demo"
]
}
}
```
This means everything is OK: the CKAN portal is up and running, the API is working as expected. In case you see an internal server error, please check the logs in your terminal.
### A Few useful API endpoints to start with
CKAN's Action API is a powerful, RPC-style API that exposes all of CKAN's core features to API clients. All of a CKAN website's core functionality (everything you can do with the web interface and more) can be used by external code that calls the CKAN API.
#### Get a list of all datasets on the portal
http://ckan:5000/api/3/action/package_list
```json
{
"help": "http://ckan:5000/api/3/action/help_show?name=package_list",
"success": true,
"result": ["annakarenina", "warandpeace"]
}
```
#### Search for a dataset
http://ckan:5000/api/3/action/package_search?q=russian
```json
{
"help": "http://ckan:5000/api/3/action/help_show?name=package_search",
"success": true,
"result": {
"count": 2,
...
}
}
```
#### Get dataset details
http://ckan:5000/api/3/action/package_show?id=annakarenina
```json
{
"help": "http://ckan:5000/api/3/action/help_show?name=package_show",
"success": true,
"result": {
"license_title": "Other (Open)",
...
}
}
```
#### Search for a resource
http://ckan:5000/api/3/action/resource_search?query=format:plain%20text
```json
{
"help": "http://ckan:5000/api/3/action/help_show?name=resource_search",
"success": true,
"result": {
"count": 1,
"results": [
{
"mimetype": null,
...
}
]
}
}
```
#### Get resource details
http://ckan:5000/api/3/action/resource_show?id=288455e8-c09c-4360-b73a-8b55378c474a
```json
{
"help": "http://ckan:5000/api/3/action/help_show?name=resource_show",
"success": true,
"result": {
"mimetype": null,
...
}
}
```
*Note:* These are only a few examples. You can find a full list of API actions in the [CKAN API guide](https://docs.ckan.org/en/2.9/api/#action-api-reference).
### Create Organizations, Datasets and Resources
There are 4 steps:
- Get an API key;
- Create an organization;
- Create dataset inside an organization (you can't create a dataset without a parent organization);
- And add resources to the dataset.
#### Get a Sysadmin Key
To create your first dataset, you need an API key.
You can see sysadmin credentials in the file `.env`. By default, they should be
- Username: `ckan_admin`
- Password: `test1234`
1. Navigate to http://ckan:5000/user/login and login.
2. Click on your username (`ckan_admin`) in the upright corner.
3. Scroll down until you see `API Key` on the left side of the screen and copy its value. It should look similar to `c7325sd4-7sj3-543a-90df-kfifsdk335`.
#### Create Organization
You can create an organization from the browser easily, but let's use [CKAN API](https://docs.ckan.org/en/2.9/api/#ckan.logic.action.create.organization_create) to do so.
```sh
curl -X POST http://ckan:5000/api/3/action/organization_create -H "Authorization: 9c04a69d-79f4-4b4b-b4e1-f2ac31ed961c" -d '{
"name": "demo-organization",
"title": "Demo Organization",
"description": "This is my awesome organization"
}'
```
Response:
```json
{
"help": "http://ckan:5000/api/3/action/help_show?name=organization_create",
"success": true,
"result": {"users": [
{
"email_hash":
...
}
]}
}
```
#### Create Dataset
Now, we are ready to create our first dataset.
```sh
curl -X POST http://ckan:5000/api/3/action/package_create -H "Authorization: 9c04a69d-79f4-4b4b-b4e1-f2ac31ed961c" -d '{
"name": "my-first-dataset",
"title": "My First Dataset",
"description": "This is my first dataset!",
"owner_org": "demo-organization"
}'
```
Response:
```json
{
"help": "http://ckan:5000/api/3/action/help_show?name=package_create",
"success": true,
"result": {
"license_title": null,
...
}
}
```
This will create an empty (draft) dataset.
#### Add a resource to it
```sh
curl -X POST http://ckan:5000/api/3/action/resource_create -H "Authorization: 9c04a69d-79f4-4b4b-b4e1-f2ac31ed961c" -d '{
"package_id": "my-first-dataset",
"url": "https://raw.githubusercontent.com/frictionlessdata/test-data/master/files/csv/100kb.csv",
"description": "This is the best resource ever!" ,
"name": "brand-new-resource"
}'
```
Response:
```json
{
"help": "http://ckan:5000/api/3/action/help_show?name=resource_create",
"success": true,
"result": {
"cache_last_updated": null,
...
}
}
```
That's it! Now you should be able to see your dataset on the portal at http://ckan:5000/dataset/my-first-dataset.
## Next steps
* [Install Extensions](/docs/dms/ckan/install-extension).

View File

@ -1,81 +0,0 @@
---
sidebar: auto
---
# Content Management System (CMS) for Data Portals
## Summary
When selecting a CMS solution for Data Portals, we always recommend using headless CMS solution as it provides full flexibility when building your system. Headless CMS means only content (no HTML, CSS, JS) is created in the CMS backend and delivered to Frontend via API.
> The traditional CMS approach to managing content put everything in one big bucket — content, images, HTML, CSS. This made it impossible to reuse the content because it was commingled with code. Read more - https://www.contentful.com/r/knowledgebase/what-is-headless-cms/.
## Features
Core features:
* Create and manage blog posts (or news), e.g., `/news/abcd`
* Create and manage static pages, e.g., `/about`, `/privacy` etc.
Important features:
* User management, e.g., ability to manage editors so that multiple users can edit content.
* User roles, e.g., ability to assign different roles for users so that we can have admins, editors, reviewers.
* Draft content, e.g., useful when working on content development for review/feedback loop. However, this is not essential if you have multiple environments.
* A syntax for writing content with text formatting, multi-level headings, links, images, videos, bullet points. For example, markdown.
* User-friendly interface (text editor) to write content.
```mermaid
graph LR
CMS -.-> Blog["Blog or news section"]
CMS -.-> IndBlog["Individual blog post"]
CMS -.-> About["About page content"]
CMS -.-> TC["Terms and conditions page content"]
CMS -.-> Privacy["Privacy policy"]
CMS -.-> Other["Other static pages"]
```
## Options
Headless CMS options:
* WordPress (headless option)
* Drupal (headless option)
* TinaCMS - https://tina.io/
* Git-based CMS - custom soltion based on Git repository.
* Strapi - https://docs.strapi.io/developer-docs/latest/getting-started/introduction.html
* Ghost - https://ghost.org/docs/
* CKAN Pages (built-in CMS option) - https://github.com/ckan/ckanext-pages
*Note, there are loads of CMS available both in open-source and proprietary software. We are only considering few of them in this article and our requirement is that we should be able to fetch content via API (headless CMS). Readers are welcome to add more options into the list.*
Comparison criteria:
* Self-hosting (note this isn't criteria for most of projects and using managed hosting is a better option sometimes)
* Free and open source
* Multi language posts (unnecessary if your portal is single language)
Comparison:
| Options | Hosting | Free | Multi language |
| -------- | -------- | -------- | -------------- |
| Drupal | Tedious | Yes | Not straigtforward|
| WordPress| Tedious | Yes | Terrible UX |
| TinaCMS | Medium | Yes | Limited |
| Git-based| Easy | Yes | Custom |
| Strapi | Medium | Yes | Simple |
| Ghost | Medium | Yes | Simple |
| CKAN Pages| Easy | Yes | ? |
## Conclusion and recommendation
Final decision should be made based on the following items:
* How often editors will create content? E.g., daily, weekly, monthly, occasionally.
* How much content you already have and need to migrate?
* How many content editors you are planning to have? What are their technical expertise?
* Is there any specific requirements, e.g., you must host in your cloud?
By default, we would recommend considering options such as Strapi, TinaCMS and Git-based CMS. We can even start with simple CKAN's built-in Pages and only move to sophisticated CMS once it is required.

View File

@ -1,163 +0,0 @@
# Dashboards
## What you can do?
* Describe vizualizations in JSON and create interactive widgets
* Customize dashboard layout using well-known HTML
* Style dashboard design with TailwindCSS utility classes
* Rapidly create basic charts using "simple" graphing specification
* Create advanced widgets by utilizing "vega" visualization grammar
## How?
To create a dashboard you need to have some basic knowledge of:
* git
* JSON
* HTML
Before proceeding further, make sure you have forked the dashboards repository - https://github.com/datopian/dashboards.
### Create a directory for your dashboard
In the root of the project, create a directory for your dashboard. Name of this directory is the name of your dashboard so make it short and meaningful. Here is some good examples:
* population
* environment
* housing
So that your dashboard will be available at https://domain.com/dashboards/your-dashboard-name.
Note that your dashboard directory will contain 2 files:
* `index.html` - [HTML template](#Set-up-your-layout)
* `config.json` - [configurations for widgets](#Configure-vizualizations)
### Set up your layout
You need to prepare HTML template for your dashboard. No need to create entire HTML page but only snippet that is needed to inject the widgets:
```html
<h1>My example dashboard</h1>
<div id="widget1"></div>
<div id="widget2"></div>
```
In the example above, we've created 2 div elements that we can reference by id when configuring vizualizations.
Note that you can add any HTML tags and make your layout stand out. In the next section we'll explain how you do some stylings.
### Style it
This step is optional but if you have a dashboard with lots of widgets and metadata, you might want to style it so it appears nicely:
* Use TailwindCSS utility classes **(recommended)**
* Official docs - https://tailwindcss.com/
* Cheat sheet - https://nerdcave.com/tailwind-cheat-sheet
* Add inline CSS
Example of using TailwindCSS utility classes:
```html
<h1 class="text-gray-700 text-lg">My example dashboard</h1>
<div class="inline-block bg-gray-200 m-10" id="widget1"></div>
<div class="inline-block bg-gray-200 m-10" id="widget2"></div>
```
### Configure vizualizations
In your config file `config.json` you can describe your dashboard in the following way:
```json
{
"widgets": [],
"datasets": []
}
```
* `widgets` - a list of objects where each object contains information about where a widget should be injected and how it should look like (see below for examples).
* `datasets` - a list of dataset URLs.
Example of a minimal widget object:
```json
{
"elementId": "widget1",
"view": {
"resources": [
{
"datasetId": "",
"name": ""
}
],
"specType": "",
"spec": {}
}
}
```
where:
* `elementId` - is "id" of the HTML tag you want to use as a container of your widget. See [how we defined it here](#Set-up-your-layout).
* `view` - descriptor of a vizualization (widget).
* `resources` - a list of resources needed for a widget and required manipulations (transformations).
* `datasetId` - the id (name) of the dataset from which the resource is extracted.
* `name` - name of the resource.
* `transform` - transformations required for a resource (optional). If you want to learn more about transforms:
* Filtering data and applying formula: https://datahub.io/examples/transform-examples-on-co2-fossil-global#readme
* Sampling: https://datahub.io/examples/example-sample-transform-on-currency-codes#readme
* Aggregating data: https://datahub.io/examples/transform-example-gdp-uk#readme
* `specType` - type of a widget, e.g., `simple`, `vega` or `figure`.
* `spec` - specification for selected widget type. See below for examples.
* `title`, `legend`, `footer` - these are optional metadata for a widget. All must be a string.
#### Basic charts
Simple graph spec is the easiest and quickest way to specify a vizualization. Using simple graph spec you can generate line and bar charts:
https://frictionlessdata.io/specs/views/#simple-graph-spec
#### Advanced vizualizations
Please check this instructions to create advanced graphs via Vega specification:
https://frictionlessdata.io/specs/views/#vega-spec
#### Figure widget
The figure widget is used to display a single value from a dataset. For example, you might want to show latest unemployment rate in your dashboard so that it indicates current status of your cities economy. See left-hand side widgets here - https://london.datahub.io/.
A specification for the figure widget would have the following structure:
```
{
"fieldName": "",
"suffix": "",
"prefix": ""
}
```
where "fieldName" attribute will be used to extract specific value from a row. The "suffix" and "prefix" attributes are optional strings that is used to surround a figure, e.g., you can prepend a percent sign to indicate the number's value.
Note that the first row of the data is used which means you need to transform data to show the relevant value. See this example for details - https://github.com/datopian/dashboard-js/blob/master/example/script.js#L12-L22.
#### Example
Check out carbon emission per capita dashboard as an example of creating advanced vizualizations:
https://github.com/datopian/dashboards/tree/master/co2-emission-by-nation
## Share it with the world!
To make your dashboard live on the data portal, you need to:
1. Simply create a pull request
2. Wait until your work gets reviewed and merged into "master" branch.
3. Implement any requested changes in your work.
4. Done! Your dashboard is now available at https://domain.com/dashboards/your-dashboard-name
## Research
* http://dashing.io/ - no longer maintained as of 2016
* Replaced by https://smashing.github.io/

View File

@ -1,358 +0,0 @@
# HDX Technical Architecture for Quick Dashboards
Notes from analysis and discussion in 2018.
# Concepts
* Bite (View): a description of an individual chart / map / fact and its data (source)
* bite (for Simon): title, desc, data (compiled), uniqueid, map join info
* view (Data Package views): title, desc, data sources (on parent data package), transforms, ...
* compiled view: title, desc, data (compiled)
* Data source:
* Single HXL file (Currently, Simon's approach requires that all the data is in a single table so there is always a single data source.)
* Data Package(s)
* Creator / Editor: creating and editing the dashboard (given the source datasets)
* Renderer: given dashboard config render the dashboard
# Dashboard Creator
```mermaid
graph LR
datahxl[data+hxl]
layouts[Layout options]
dashboard["Dashboard (config)<br/><br/>(Layout, Data Sources, Selected Bites)"]
editor[Editor]
bites[Bites<br /><em>potential charts, maps etc</em>]
datahxl --suggester--> bites
bites --> editor
layouts --> editor
editor --save--> dashboard
```
## Bite generation
```mermaid
graph LR
data[data with hxl] --> inferbites(("Iterate Recipes<br/>and see what<br/>matches"))
inferbites --> possmatches[List of potential bites]
possmatches --no map info--> done[Bite finished]
possmatches --lat+lon--> done
possmatches --geo info--> maplink(("Check pcodes<br/> and link<br/>map server url"))
maplink -.-> fuzzy((Fuzzy Matcher))
fuzzy --> done
maplink --> done
maplink --error--> nobite[No Bite]
```
## Extending to non-HXL data
It is easy to extend this to non-HXL data by using base HXL types and inference e.g.
```
date => #date
geo => #geo+lon
geo => #geo+lat
string/category => #indicator
```
```mermaid
graph LR
data[data + syntax]
datahxl[data+hxl]
layouts[layout options]
dashboard["Dashboard (config)"]
editor[Editor]
bites[Bites<br /><em>potential charts, maps etc</em>]
data --infer--> datahxl
datahxl --suggester--> bites
bites --> editor
layouts --> editor
editor --save--> dashboard
```
# Dashboard Renderer
Rendering the dashboard involves:
```mermaid
graph LR
bites[Compiled Bites/Views]
renderer["Renderer<br/>(Layout + charting / mapping libs)"]
data[Data]
subgraph Dashboard Config
bitesconf[Bites/Views Config]
layoutconf[Layout Config]
end
bitecompiler[Bite/View Compiler]
bitecompiler --> bites
bitesconf --> bitecompiler
data --> bitecompiler
layoutconf --> renderer
bites --> renderer
renderer --> dashboard[HTML Dashboard]
```
## Compiled View generation
See https://docs.datahub.io/developers/views/
----
# Architecture Proposal
* data loader library
* File: rows, fields (rows, columns)
* type inference (?)
* syntax: table schema infer
* semantics (not now)
* data transform library (include hxl support)
* suggester library
* renderer library
Interfaces / Objects
* File
* (Dataset)
* Transform
* Algorithm / Recipe
* Bite / View
* Ordered Set of Bites
* Dashboard
## File (and Dataset)
http://okfnlabs.org/blog/2018/02/15/design-pattern-for-a-core-data-library.html
https://github.com/datahq/data.js
File
rows
descriptor
schema
schema
## Recipe
```json=
{
'id':'chart0001',
'type':'chart',
'subType':'row',
'ingredients':[{'name':'what','tags':['#activity-code-id','#sector']}],
'criteria':['what > 4', 'what < 11'],
'variables': ['what', 'count()'],
'chart':'',
'title':'Count of {1}',
'priority': 8,
}
```
## Bite / Compiled View
```json=
{
bite: array [...data for chart...],
id: string "...chart bite ID...",
priority: number,
subtype: string "...bite subtype - row, pie...",
title: string "...title of bite...",
type: string "...bite type...",
uniqueID: string "...unique ID combining bite and data structure",
}
```
=>
## Dashboard
```json=
{
"title":"",
"subtext":"",
"filtersOn":true,
"filters":[],
"headlinefigures":0,
"headlinefigurecharts":[
],
"grid":"grid5",
"charts":[
{
"data":"https://proxy.hxlstandard.org/data.json?filter01=append&append-dataset01-01=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1FLLwP6nxERjo1xLygV7dn7DVQwQf0_5tIdzrX31HjBA%2Fedit%23gid%3D0&filter02=select&select-query02-01=%23status%3DFunctional&url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1R9zfMTk7SQB8VoEp4XK0xAWtlsQcHgEvYiswZsj9YA4%2Fedit%23gid%3D0",
"chartID":""
},
{
"data":"https://proxy.hxlstandard.org/data.json?filter01=append&append-dataset01-01=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1FLLwP6nxERjo1xLygV7dn7DVQwQf0_5tIdzrX31HjBA%2Fedit%23gid%3D0&filter02=select&select-query02-01=%23status%3DFunctional&url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1R9zfMTk7SQB8VoEp4XK0xAWtlsQcHgEvYiswZsj9YA4%2Fedit%23gid%3D0",
"chartID":""
}
]
}
```
```
var config = {
layout: 2x2 // in city-indicators dashboard is handcrafted in layout
widgets: [
{
elementId / data-id: ...
view: {
metadata: { title, sources: "World Bank"}
resources: rule for creating compiled list of resources. [ { datasetId: ..., resourceId: ..., transform: ...} ]
specType:
viewspec:
}
},
{
},
]
datasets: [
list of data package urls ...
]
}
```
Simon's example
https://simonbjohnson.github.io/hdx-iom-dtm/
```javascript=
{
// metadata for dashboard
"title":"IOM DTM Example",
"subtext":" ....",
"headlinefigures": 3,
"grid": "grid5", // user chosen layout for dashboard. Choice of 10 grids
"headlinefigurecharts": [ //widgets - headline widget
{
"data": "https://beta.proxy.hxlstandard.org/data/1d0a79/download/africa-dtm-baseline-assessments-topline.csv",
"chartID": "text0013/#country+name/1" // bite Id
// elementId: ... // implicit from order in grid ...
},
{
"data": "https://beta.proxy.hxlstandard.org/data/1d0a79/download/africa-dtm-baseline-assessments-topline.csv",
"chartID": "text0012/#affected+hh+idps/5"
},
{
"data": "https://beta.proxy.hxlstandard.org/data/1d0a79/download/africa-dtm-baseline-assessments-topline.csv",
"chartID":"text0012/#affected+idps+ind/6"
}
],
"charts": [ // chart widgets
{
"data": "https://beta.proxy.hxlstandard.org/data/1d0a79/download/africa-dtm-baseline-assessments-topline.csv",
"chartID": "map0002/#adm1+code/4/#affected+idps+ind/6",
"scale":"log" // chart config ...
},
{
"data": "https://beta.proxy.hxlstandard.org/data/1d0a79/download/africa-dtm-baseline-assessments-topline.csv",
"chartID": "chart0009/#country+name/1/#affected+idps+ind/6",
"sort":"descending"
}
]
}
```
Algorithm
1. Extract the data references to a common list of datasets and fetch them
2. You generate compiled data via hxl.js plus own code transforming to final data for charting etc
```
function transformChart(rawSourceData (csv parsed), bite) => [ [ ...], [...]] - data for chart
hxl.js
custom code
function transformMap
function transformText ...
```
https://github.com/SimonbJohnson/hxlbites.js
https://github.com/SimonbJohnson/hxlbites.js/blob/master/hxlBites.js#L957
```
hb.reverse(bite) => compiled bite (see above) (data, chartConfig)
```
3. generate dashboard html and compute element ids in actual page element ids computed from grid setup
4. Now have a final dashboard config
```
widgets: [
{
data: [ [...], [...]]
widgetType: text, chart, map ...
elementId: // element to bind to ...
}
]
```
5. Now use specific renderer libraries e.g. leaflet, plotly/chartist etc to render out into page
https://github.com/SimonbJohnson/hxldash/blob/master/js/site.js#L294
### Notes
"Source" version of dashboard with data uncompiled.
Compiled version of dashboard with final data inline ...
hxl.js takes an array of arrays ... and outputs array of arrays ...
```
{
schema: [...]
data: [...]
}
```
# Renderer
* Renderer for the dashboard
* Renderer for each widget
```
function createChart(bite, elementId) => svg in bite
```
## Charts
* Data Package View => svg/png etc
* plotly
* vega (d3)
* https://github.com/frictionlessdata/datapackage-render-js
* chartist
* react-charts
## Map
* Leaflet
* react-leaflet
## Tables
...

View File

@ -1,270 +0,0 @@
# Data APIs (and the DataStore)
## Introduction
A Data API provides *API* access to data stored in a [DMS][]. APIs provide granular, per record access to datasets and their component data files. They offer rich querying functionality to select the records you want, and, potentially, other functionality such as aggregation. Data APIs can also provide write access, though this has traditionally been rarer.[^rarer]
Furthermore, much of the richer functionality of a DMS or Data Portal such as data visualization and exploration require API data access rather than bulk download.
[DMS]: /docs/dms/dms
[^rarer]: It is rarer because write access usually means a) the data for this dataset is a structured database rather than a data file (which is normally more expensive both in terms b) the Data Portal has now become the primary (or semi-primary) home of this dataset rather simply being the host of a dataset whose home and maintenance is elsewhere.
### API vs Bulk Access
Direct download of a whole data file is the default method of access for data in a DMS. API access complements this direct download in "bulk" approach. In some situations API access may be the primary access option (so-called "API first"). In other cases, structured storage and API read/write may be the *only* way the data is stored and there is no bulk storage -- for example, this would be a natural approach for time series data which is being rapidly updated e.g. every minute.
*Fig 1: Contrasting Download and API based access*
```bash
# simple direct file access. You download
https://my-data-portal.org/my-dataset/my-csv-file.csv
# API based access. Find the first 5 records with 'awesome'
https://my-data-portal.org/data-api/my-dataset/my-csv-file-identifier?q=awesome&limit=5
```
In addition, to differing volume of access, APIs often differ from bulk download in their data format: following web conventions data APIs usually return the data in a standard format such as JSON (and can also provide various other formats e.g. XML). By contrast, direct data access necessarily supplies the data in whatever data format it was created in.
### Limitations of APIs
Whilst Data APIs are in many ways more flexible than direct download they have disadvantages:
* APIs are much more costly and complex to create and maintain than direct download
* API queries are slow and limited in size because they run in real-time in memory. Thus, for bulk access e.g. of the entire dataset direct download is much faster and more efficient (download a 1GB CSV directly is easy and takes seconds but attempting to do so via the API may crash the server and be very slow).
{/*
TODO: do more to compare and contrast download vs API access (e.g. what each is good for, formats, etc)
*/}
### Why Data APIs?
Data APIs underpin the following valuable functionality on the "read" side:
* **Data (pre)viewing**: reliably and richly (e.g. with querying, mapping etc). This makes the data much more accessible to non-technical users.
* **Visualization and analytics**: rich visualization and analytics may need a data API (because they need easily to query and aggregate parts of dataset).
* **Rich Data Exploration**: when exploring the data you will want to explore through a dataset quickly only pulling parts of the data and drilling down further as needed.
* **(Thin) Client applications**: with a data API third party users of the portal can build apps on top of the portal data easily and quickly (and without having to host the data themselves)
Corresponding job stories would be like:
* When building a visualization I want to select only some part of a dataset that I need for my visualization so that I can load the data quickly and efficiently.
* When building a Data Explorer or Data Driven app I want to slice/dice/aggregate my data (without downloading it myself) so that I can display that in my explorer / app.
On the write side they provide support for:
* **Rapidly updating data e.g. timeseries**: if you are updating a dataset every minute or every second you want an append operation and don't want to store the whole file every update just to add a single record
* **Datasets stored as structured data by default** and which can therefore be updated in part, a few records at a time, rather than all at once (as with blob storage)
## Domain Model
The functionality associated to the Data APIs can be divided in 6 areas:
* **Descriptor**: metadata describing and specifying the API e.g. general metadata e.g. name, title, description, schema, and permissions
* **Manager** for creating and editing APIs.
* API: for creating and editing Data API's descriptors (which triggers creation of storage and service endpoint)
* UI: for doing this manually
* **Service** (read): web API for accessing structured data (i.e. per record) with querying etc. *When we simply say "Data API" this is usually what we are talking about*
* Custom API & Complex functions: e.g. aggregations, join
* Tracking & Analytics: rate-limiting etc
* Write API: usually secondary because of its limited performance vs bulk loading
* Bulk export of query results especially large ones (or even export of the whole dataset in the case where the data is stored directly in the DataStore rather than the FileStore). This is an increasingly important featurea lower priority but if required it is substantive feature to implement.
* **Data Loader**: bulk loading data into the system that powers the data API. **This is covered in a [separate Data Load page](/docs/dms/load/).**
* Bulk Load: bulk import of individual data files
* Maybe includes some ETL => this takes us more into data factory
* **Storage (Structured)**: the underlying structured store for the data (and its layout). For example, Postgres and its table structure.This could be considered a separate component that the Data API uses or as part of the Data API -- in some cases the store and API are completely wrapped together, e.g. ElasticSearch is both a store and a rich Web API.
>[!tip]Visualization is not part of the API but the demands of visualization are important in designing the system.
## Job Stories
### Read API
When I'm building a client application or extracting data I want to get data quickly and reliably via an API so that I can focus on building the app rather than manging the data
* Performance: Querying data is **quick**
* Filtering: I want to filter data easily so that I can get the slice of data that I need.
* ❗ unlimited query size for downloading eg, can download filtered data with millions of rows
* can get results in 3 formats: CSV, JSON and Excel.
* API formats
* "Restful" API (?)
* SQL API (?)
* ❗ GraphQL API (?)
* ❗ custom views/cubes (including pivoting)
* Query UI
:exclamation: = something not present atm
#### Retrieve records via an API with filtering (per resource) (if tabular?)
When I am building a web app, a rich viz, display the data, etc I want to have an API to data (returns e.g. JSON, CSV) [in a resource] so that I can get precise chunks of data to use without having to download and store the whole dataset myself
* I want examples
* I want a playground interface …
#### Bulk Export
When I have a query with a large amount of results I want to be able to download all of those results so that I can analyse them with my own tools
#### Multiple Formats
When querying data via the API I want to be able to get the results in different formats (e.g. JSON, CSV, XML (?), ...) so that I can get it in a format most suitable for my client application or tool
#### Aggregate data (perform ops) via an API …
When querying data to use in a client application I want to be able to perform aggregations such as sum, group by etc so that I can get back summary data directly and efficiently (and don't have to compute myself or wait for large amounts of data)
#### SQL API
When querying the API as a Power User I want to use SQL so that I can do complex queries and operations and reuse my exisitng SQL knowledge
#### GeoData API
When querying a dataset with geo attributes such as location I want to be able use geo-oriented functionality e.g. find all items near X so that I can find the records I want by location
#### Free Text Query (Google Style / ElasticSearch Style)
When querying I want to do a google style search in data e.g. query for "brown" and find all rows with brown in them or do `brown road_name:*jfk*` and get all results with brown in them and whose field `road_name` has `jfk` in it so that I can provide a flexible query interface to my users
#### Custom Data API
As a Data Curator I want to create a custom API for one or more resources so that users can access my data in convenient ways …
* E.g. query by dataset or resource name rather than id ...
#### Search through all data (that is searchable) / Get Summary Info
As a Consumer I want to search across all the data in the Data Portal at once so that I can find the value I want quickly and easily … (??)
#### Search for variables used in datasets
As a Consumer (researcher/student …) I want to look for datasets with particular variables in them so that I can quickly locate the data I want for my work
* Search across the column names so that ??
#### Track Usage of my Data API
As a DataSet Owner I want to know how much my Data API is being used so that I can report that to stakeholders / be proud of that
#### Limit Usage of my Data API (and/or charge for it)
As a Sysadmin I want to limit usage of my Data API per user (and maybe charge for above a certain level) so that I dont spend too much money
#### Restrict Access to my Data API
As a Publisher I want to only allow specific people to access data via the data API so that …
* Want this to mirror the same restrictions I have on the dataset / resources elsewhere (?)
### UI for Exploring Data
>[!warning]This probably is not a Data API epic -- rather it would come under the Data Explorer.
* I want an interface to “sql style” query data
* I want a filter interface into data
* I want to download filtered data
* ...
### Write API
When adding data I want to write new rows via the data API so that the new data is available via the API
* ? do we also want a way to do bulk additions?
### DataStore
When creating a Data API I want a structured data store (e.g. relational database) so that I can power the Data API and have it be fast, efficient and reliable.
## CKAN v2
In CKAN 2 the bulk of this functionality is in the core extension `ckanext-datastore`:
* https://docs.ckan.org/en/2.8/maintaining/datastore.html
* https://github.com/ckan/ckan/tree/master/ckanext/datastore
In summary: the underlying storage is provided by a Postgres database. A dataset resource is mapped to a table in Postgres. There are no relations between tables (no foreign keys). A read and write API is provided by a thin Python wrapper around Postgres. Bulk data loading is provided in separate extensions.
### Implementing the 4 Components
Here's how CKAN 2 implements the four components described above:
* Read API: is provided by an API wrapper around Postgres. This is written as a CKAN extension written in Python and runs in process in the CKAN instance.
* Offers both classic Web API query and SQL queries.
* Full text, cross field search is provided via Postgres and creating an index concatenating across fields.
* Also includes a write API and functions to create tables
* DataStore: a dedicated Postgres database (separate to the main CKAN database) with one table per resource.
* Data Load: provided by either DataPusher (default) or XLoader. More details below.
* Utilize the CKAN jobs system to load data out of band
* Some reporting integrated into UI
* Supports tabular data (CSV or Excel) : this converts CSV or Excel into data that can be loaded into the Postgres DB.
* Bulk Export: you can bulk download via the extension using the dump functionality https://docs.ckan.org/en/2.8/maintaining/datastore.html#downloading-resources
* Note however this will have problems with large resources either timing out or hanging the server
### Read API
The CKAN DataStore extension provides an ad-hoc database for storage of structured data from CKAN resources.
See the DataStore extension: https://github.com/ckan/ckan/tree/master/ckanext/datastore
[Datastore API](https://docs.ckan.org/en/2.8/maintaining/datastore.html#the-datastore-api)
[Making Datastore API requests](https://docs.ckan.org/en/2.8/maintaining/datastore.html#making-a-datastore-api-request)
[Example: Create a DataStore table](https://docs.ckan.org/en/2.8/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_create)
```sh
curl -X POST http://127.0.0.1:5000/api/3/action/datastore_create \
-H "Authorization: {YOUR-API-KEY}" \
-d '{ "resource": {"package_id": "{PACKAGE-ID}"}, "fields": [ {"id": "a"}, {"id": "b"} ] }'
```
### Data Load
See [Load page](/docs/dms/load#ckan-v2).
### DataStore
Implemented as a separate Postgres Database.
https://docs.ckan.org/en/2.8/maintaining/datastore.html#setting-up-the-datastore
### What Issues are there?
Sharp Edges
* connection between MetaStore (main CKAN objects DB) and DataStore is not always well maintained e.g, if I call “purge_dataset” action, it will remove stuff from MetaStore but it wont delete a table from DataStore. This does not break UX but your DataStore DB raises in size and you might have junk tables with lots of data.
DataStore (Data API)
* One table per resource and no way to join across resources
* Indexes are auto-created and no way to customize per resource. This can lead to issues on loading large datasets.
* No API gateway (i.e. no way to control DDOSing, to do rate limiting etc)
* SQL queries not working (with private datasets)
## CKAN v3
Following the general [next gen microservices approach][ng], the Data API is separated into distinct microservices.
[ng]: /docs/dms/ckan-v3/next-gen
### Read API
Approach: Refactor current DataStore API into a standalone microservice. Key point would be to break out permissioning. Either via a call out to separate permissioning service or a simple JWT approach where capability is baked in.
Status: In Progress (RFC) - see https://github.com/datopian/data-api
### Data Load
Implemented via AirCan. See [Load page](/docs/dms/load).
### Storage
Back onto Postgres by default just like CKAN 2. May also explore using other backends esp from Cloud Providers e.g. BigQuery or AWS RedShift etc.
* See Data API service https://github.com/datopian/data-api
* BigQuery: https://github.com/datopian/ckanext-datastore-bigquery

View File

@ -1,282 +0,0 @@
---
sidebar: auto
---
# Data Explorer
The Datopian Data Explorer is a React single page application and framework for creating and displaying rich data explorers (think Tableau-lite). Use stand-alone or with CKAN. For CKAN it is a drop-in replacement for ReclineJS in CKAN Classic.
![Data Explorer](/static/img/docs/dms/data-explorer/data-explorer.png)
> [Data Explorer for the City of Montreal](http://montreal.ckan.io/ville-de-montreal/geobase-double#resource-G%C3%83%C2%A9obase%20double)
## Features / Highlights
"Data Explorer" is an embedable React/Redux application that allows users to:
* Explore tabular, map, PDF, and other types of data
* Create map views of tabular data using the [Map Builder](#map-builder)
* Create charts and graphs of tabular data using [Chart Builder](#chart-builder)
* Easily build SQL queries for Data Store API using graphical interface of [Datastore Query Builder](#datastore-query-builder)
## Components
The Data Explorer application acts as a coordinating layer and state management solution -- via [Redux](https://redux.js.org/) -- for several libraries, also maintained by Datopian.
### [Datapackage Views](https://github.com/datopian/datapackage-views-js)
![Datapackage Views](/static/img/docs/dms/data-explorer/datapackage-views.png)
Datapackage View is the rendering engine for the main window of the Data Explorer.
The above image displays a table shown at the `Table` tab, but note that Datapackage-views renders _all_ data visualizations: Tables, Charts, Maps, and others.
### [Datastore Query Builder](https://github.com/datopian/datastore-query-builder)
<img alt="Datastore Query Builder" src="/static/img/docs/dms/data-explorer/query-builder.png" width="250px" />
The Datastore Query Builder interfaces with the Datastore API to allow users to search data resources using an SQL like interface. See the docs for this module here - [Datastore Query Builder docs](/docs/dms/data-explorer/datastore-query-builder/).
### [Map Builder](https://github.com/datopian/map-builder)
<img alt="Map Builder" src="/static/img/docs/dms/data-explorer/map-builder.png" width="250px" />
Map Builder allows users to build maps based on geo-data contained in tabular resources.
Supported geo formats:
* lon / lat (separate columns)
### [Chart Builder](https://github.com/datopian/chart-builder)
<img alt="Chart Builder" src="/static/img/docs/dms/data-explorer/chart-builder.png" width="250px" />
Chart Builder allows users to create charts and graphs from tabular data.
## Quick-start (Sandbox)
* Clone the data explorer
```bash
$ git clone git@gitlab.com:datopian/data-explorer.git
```
* Use yarn to install the project dependencies
```bash
$ cd data-explorer
$ yarn
```
* To see the Data Explorer running in a sandbox environment run [Cosmos](https://github.com/react-cosmos/react-cosmos)
```bash
$ yarn cosmos
```
## Configuration
[`data-datapackage` attribute](#add-data-explorer-tags-to-the-page-markup) may influence how the element will be displayed. It can be created from a [datapackage descriptor](https://frictionlessdata.io/specs/data-package/).
### Fixtures
Until we have better documentation on Data Explorer settings, use the [Cosmos fixtures](https://gitlab.com/datopian/data-explorer/blob/master/__fixtures__/with_widgets/geojson_simple.js) as an example of how to instantiate / configure the Data Explorer.
### Serialized state
`store->serializedState` is a representation of the application state _without fetched data_
A data-explorer can be "hydrated" using the serialized state, it will refetch the data, and will render in the same state it was exported in
### Share links
Share links can be added in `datapakage.resources[0].api` attribute.
There is common limit of up 2000 characters on URL strings. Our share links contain the entire application store tree, which is often larger than 2000 characters, in which the application state cannot be shared via URL. Thems the breaks.
## Translations
### Add a Translation To Data Explorer
To add a translation to a new language to the data explorer you need to:
1. clone the repository you need to update
```bash
git clone git@gitlab.com:datopian/data-explorer.git
```
2. go to `src/i18n/locales/` folder
3. add a new sub-folder with locale name and the new language json file (e.g. `src/i18n/locales/ru/translation.json`)
4. add the new file to resources settings in `i18n.js`:
`src/i18n/i18n.js`:
```javascript
import en from './locales/en/translation.json'
import da from './locales/da/translation.json'
import ru from './locales/ru/translation.json'
...
ru: {
translation: {
...require('./locales/ru/translation.json'),
...
}
},
...
```
5. create a merge request with the changes
### Add a translation To a Component
Some strings may come from a component, to add translation for them will require some extra steps, e.g. datapackage-views-js:
1. clone the repository
```bash
https://github.com/datopian/datapackage-views-js.git
```
2. go to `src/i18n/locales/` folder
3. add a new sub-folder with locale name and the new language json file (e.g. `src/i18n/locales/ru/translation.json`)
4. add the new file to resources settings in `i18n.js`:
`src/i18n/i18n.js`:
```javascript
...
import ru from './locales/ru/translation.json'
...
resources: {
...
ru: {translation: ru},
},
...
```
5. create a pull request for datapackage-views-js
6. get the new datapackage-views-js version after merging (e.g. 1.3.0)
7. clone data-explorer
8. upgrade the data-explorer's datapackage-views-js dependency with the new version:
a. update package.json
b. run `yarn install`
9. add the component's translations path to Data Explorer:
```javascript
import en from './locales/en/translation.json'
import da from './locales/da/translation.json'
import ru from './locales/ru/translation.json'
...
ru: {
translation: {
...require('./locales/ru/translation.json'),
...require('datapackage-views-js/src/i18n/locales/ru/translation.json'),
}
},
...
```
10. create a merge request for data-explorer
### Testing a Newly Added Language
To see your language changes in Data Explorer you can run `yarn start` and change the language cookie of the page (`defaultLocale`):
![i18n Cookie](/static/img/docs/dms/data-explorer/i18n-cookie.png)
### Language detection
Language detection rules are determined by `detection` option in `src/i18n/i18n.js` file. Please edit with care, as other projects may already depend on them.
## Embedding in CKAN NG Theme
### Copy bundle files to theme's `public` directory
```bash
$ cp data-explorer/build/static/js/*.js frontend-v2/themes/your_theme/public/js
$ cp data-explorer/build/static/js/*.map frontend-v2/themes/your_theme/public/js
$ cp data-explorer/build/static/css/* frontend-v2/themes/your_theme/public/css
```
#### Note on app bundles
The bundled resources have a hash in the filename, for example `2.a3e71132.chunk.js`
During development it may be preferable to remove the hash from the file name to avoid having to update the script tag during iteration, for example
```bash
$ mv 2.a3e71132.chunk.js 2.chunk.js
```
A couple caveats:
* The `.map` file names should remain the same so that they are loaded properly
* Browser cache may need to be invalidated manually to ensure that the latest script is loaded
### Require Data Explorer resources in NG theme template
In `/themes/your-theme/views/your-template-wth-explplorer.html`
```html
<!-- Everything before the content block goes here -->
{% block content %}
<!-- Data Explorer CSS -->
<link rel="stylesheet" type="text/css" href="/static/css/main.chunk.css">
<link rel="stylesheet" type="text/css" href="/static/css/2.chunk.css">
<!-- End Data Explorer CSS -->
```
### Configure datapackage
```htmlmixed=
<!-- where datapackage is -->
<srcipt>
const datapackage = {
resources: [{resource}], // single resource for this view
views: [...], // can be 3 views aka widgets
controls: {
showChartBuilder: true,
showMapBuilder: true
}
}
</srcipt>
```
### Add data-explorer tags to the page markup
Each Data Explorer instance needs a corresponding `<div>` in the DOM. For example:
```html
{% for resource in dataset.resources %}
<div class="data-explorer" id="data-explorer-{{ loop.index - 1 }}" data-datapackage='{{ dataset.dataExplorers[loop.index - 1] | safe}}'></div>
{% endfor %}
```
Note that each container div needs the following attributes:
* `class="data-explorer"` (All explorer divs should have this class)
* `id="data-explorer-0"` (1, 2, etc...)
* `data-datapackage=`{'{JSON CONFIG}'}` (A valid JSON configuration)
### Add data explorer scripts to your template
```html
<script type="text/javascript" src="/static/js/runtime~main.js"></script>
<script type="text/javascript" src="/static/js/2.chunk.js"></script>
<script type="text/javascript" src="/static/js/main.chunk.js"></script>
```
*NOTE* that the scripts should be loaded _after the container divs are in the DOM, typically by placing the `<script>` tags at the bottom of the footer_
See [a real-world example here](https://gitlab.com/datopian/clients/ckan-montreal/blob/master/views/showcase.html)
## New builds
In order to build files for production, run `npm run build` or `yarn build`.
You need to have **node version >= 12** in order to build files. Otherwise a 'heap out of memory error' gets thrown.
### Component changes
If the changes involve component updates that live in separate repositories make sure to upgrade them too before building:
1. Prepare the component with dist version (eg run yarn build:package in the component repo, see [this](/docs/dms/data-explorer/datastore-query-builder#release) for an example)
2. run `yarn add <package>` to get latest changes, e.g. `yarn add @datopian/datastore-query-builder` (do not use `yarn upgrade`, see here on why https://github.com/datopian/data-explorer/issues/28#issuecomment-700792966)
3. you can verify changes in `yarn.lock` - there should be the latest component commit id
4. `yarn build` in data-explorer
### Testing not yet released component changes
If there are some changes to be tested that are not ready to be released in a component the best option is to use
cosmos directly in the component repository, but if that is not enough you can add the dependency from a branch
temporarily:
```
yarn add https://github.com/datopian/datastore-query-builder.git#<branch name>
```
## Appendix: Design
See [Data Explorer Design page &raquo;](/docs/dms/data-explorer/design/)

View File

@ -1,109 +0,0 @@
---
sidebar: auto
---
# Datastore Query Builder
This project was bootstrapped with [Create React App](https://github.com/facebook/create-react-app).
The code repository is located at github - https://github.com/datopian/datastore-query-builder.
## Usage
Install it:
```
yarn add @datopian/datastore-query-builder
```
Basic usage in a React app:
```JavaScript
import React from 'react'
import { QueryBuilder } from 'datastore-query-builder'
export const MyComponent = props => {
// `resource` is a resource descriptor that must have 'name', 'id' and
// 'schema' properties.
// `action` - this should be a Redux action that expects back the resource
// descriptor with updated 'api' property. It is up to your app to fetch data.
return (
<QueryBuilder resource={resource} filterBuilderAction={action} />
)
}
```
Note that this app doesn't fetch any data - it only builds API URI based on user
selection.
It's easier to learn by examples provided in the `/__fixtures__/` directory.
## Features
* Date Picker - if the resource descriptor has a field with `date` type it will be displayed as a date picker element:
![Date Picker](/static/img/docs/dms/data-explorer/date-picker.png)
## Available Scripts
In the project directory, you can run:
### `yarn cosmos` or `npm run cosmos`
Runs dev server with the fixtures from `__fixtures__` directory. Learn more about `cosmos` - https://github.com/react-cosmos/react-cosmos
### `yarn start` or `npm start`
Runs the app in the development mode.<br/>
Open [http://localhost:3000](http://localhost:3000) to view it in the browser.
The page will reload if you make edits.<br/>
You will also see any lint errors in the console.
### `yarn test` or `npm test`
Launches the test runner in the interactive watch mode.<br/>
See the section about [running tests](https://facebook.github.io/create-react-app/docs/running-tests) for more information.
### `yarn build:package` or `npm run build:package`
Run this to compile your code so it is installable via yarn/npm.
### `yarn build` or `npm run build`
Builds the app for production to the `build` folder.<br/>
It correctly bundles React in production mode and optimizes the build for the best performance.
The build is minified and the filenames include the hashes.<br/>
Your app is ready to be deployed!
See the section about [deployment](https://facebook.github.io/create-react-app/docs/deployment) for more information.
## Release
When releasing a new version of this module, please, make sure you've built compiled version of the files:
```bash
yarn build:package
# Since this a release, you need to change version number in package.json file.
# E.g., this is a patch release so my `0.3.6` will become `0.3.7`.
# Now commit the changes
git add dist/ package.json
git commit -m "[v0.3.7]: your commit message."
```
Next, you need to tag your commit and add some descriptive message about the release:
```bash
git tag -a v0.3.7 -m "Your release message."
```
Now you can push your commits and tags:
```bash
git push origin branch && git push origin branch --tags
```
The tag will initiate a Github action that will publish the release to NPM.

View File

@ -1,145 +0,0 @@
# Data Explorer Design
>[!note]
Design sketches from Aug 2019. This remains a work in progress though a good part was implemented in the new [Data Explorer](/docs/dms/data-explorer).
## Job Stories
[Preview] As a Data Consumer I want to have a sense of what data there is in a dataset's resources before I download it (or download an individual resource) so that I don't waste my time and get interested
[Preview] As a Data Consumer I want to view (the most important contents) of a resource without downloading it and opening it so i save time (and don't have to get specialist tools)
[Preview - with tweaks] As a Data Consumer I want to be able to display tabular data with geo info on a map so that I can see it in an easily comprehensible way
[Explorer] As a Viewer I want to explore (filter, facet?) a dataset so I can find the data i'm looking for ...
[Explorer - map] As a viewer i want to filt4er down the data i dispaly on the map so that I can see the data i want
[Map / Dash Creator] As a Publisher i want to create a custom map or dashboard so that I can display my data to viewers powerfully
[View the data] As a User, I want to see my city related data (eg, crime, road accidents) on the map so that:
* I can easily understand which area is safe for me.
* I can evaluate different neighbourhoods when planning a move.
As a User from city council, I want to see my city related data (eg, traffic) on the map so that I can take better actions to improve the city (make it safe for citizens).
> is this self-service created, a custom map made by publisher, an auto-generated map (e.g. preview)
[Data Explorer] As a Power User I want to do SQL queries on the datastore so that I can dsiplay / download the results and get insight without having to download into my own tool and do that wrangling
## Architecture
```mermaid
graph LR
subgraph "Filter UI"
simpleselectui[Filter by columns explicitly]
sqlfilterui[SQL UI]
richselectui[Filter and Group by etc in a UI]
end
subgraph Renderers
tableview[Table Renderer]
chartview[Chart Renderer]
mapview[Map Renderer]
end
subgraph Builders
datasetselector[Select datasets to use, potentially with combination]
chartbuilder[Chart Builder - UI to create a chart]
mapbuilder[Map Builder]
end
subgraph APIs
queryapi[Abstract Query API implemented by others]
datastoreapijs[DataStore API wrapper - returns a Data Package with cached data and query as url?]
datajs[Data Package - Data in Memory: Dataset and Table objects]
datajsquery[Query Wrapper Around Dataset with cached data in memory]
end
classDef todo fill:#f9f,stroke:#333,stroke-width:1px
classDef working fill:#00ff00,stroke:#333,stroke-width:1px
class chartbuilder todo;
class chartview,tableview,mapview,simpleselectui working;
```
Filter UI updates Redux Store using a one-way data binding as the ONLY way to modify application state or component state (except internal state of components as needed):
```mermaid
graph TD
FilterUI_Update --> ReduxACTION:UpdateFilters
ReduxACTION:UpdateFilters --> RefetchData
ReduxACTION:UpdateFilters --> updateUIState
RefetchData --store.workingData--> UpdateStore
updateUIState --store.uiState--> UpdateStore
UpdateStore --> RerenderApp
```
## Interfaces to define
```
dataset => data package
query[Query - source data package + cached data + filter state]
workingdataset[Working Dataset in Memory]
chartconfig[]
mapconfig[]
```
### Redux store / top level state
```javascript=
queryuistate: {
// url / data package rarely changes during lifetime of explorer usually
url: datastore url / or an original data package
filters: ...
sqlstatement:
}
// list of datasets / resources we are working with ...
datasets/resources: [
]
layout: [
// this is the switcher layout where you only see one widget at a time
layouttype: chooser; // chooserr aka singleton, stacked, custom ...
views: [list of views in their order]
]
views: [
{
type:
resource:
char
}
]
```
## Research
### Our One
![](https://i.imgur.com/XAdHq26.jpg)
### Redash
![](https://i.imgur.com/6JssnLA.png)
### Metabase
https://github.com/metabase/metabase
![](https://i.imgur.com/bOjIKdE.png)
### CKAN Classic
![](https://i.imgur.com/tGdupkz.png)
![](https://i.imgur.com/fDtjGSk.png)
### Rufus' Data Explorer (2014)
![](https://i.imgur.com/XJMHRes.png)

View File

@ -1,18 +0,0 @@
# Data Lake
A data lake is a repository -- typically a large one -- for storing data of many types. They are more flexible (less structured) than their predecessor Data Warehouses. At their crudest they are little more than raw storage with an organizational structure plus, maybe, a catalog. At their more sophisticated they can become an entire data management infrastructure.
The flexibility of the data lake concept is both its advantage and a limitation: almost any data architecture that includes collecting organizational data together could be described as data lake.
At a practical level, the flexibility can become a limitation in that **data lakes become data swamps**: the lack of structure for data lakes often limit the usability of the lake: data cannot be found or is not of adequate quality. As ThoughtWorks note: "Many enterprises failed to generate a return on their investment because they had quality issues with the data in their lakes or had invested significant sums in creating their lakes before identifying use cases."[^1]
[^1]: https://www.thoughtworks.com/decoder/data-lake
## Schematic overview of a Data Lake Architecture
<img src="https://docs.google.com/drawings/d/e/2PACX-1vThZmi5ok8VNaM03Vj5RQHJRQiZJIkrxaU08vpG_T_kcElFQDCO7bZVO1FJzcpR2X8wfKZVWdWXpLUz/pub?w=1159&amp;h=484" />
## References
* https://www.thoughtworks.com/decoder/data-lake
* https://martinfowler.com/articles/data-monolith-to-mesh.html

View File

@ -1,287 +0,0 @@
# Data Portals
> *Data Portals have become essential tools in unlocking the value of data for organizations and enterprises ranging from the US government to Fortune 500 pharma companies, from non-profits to startups. They provide a convenient point of truth for discovery and use of an organization's data assets. Read on to find out more.*
## Introduction: Data Portals are Gateways to Data
A Data Portal is a gateway to data. That gateway can be big or small, open or restricted. For example, data.gov is open to everyone, whilst an enterprise "intra" data portal is restricted to that enterprise (and perhaps even to certain people within it).
A Data Portal's core purpose is to enable the rapid discovery and use of data. However, as a flexible, central point of truth on an organizations data assets, a Data Portal can become essential data infrastructure and be extended or integrated to provide many additional features:
* Data storage and APIs
* Data visualization and exploration
* Data validation and schemas
* Orchestration and integration of data
* Data Lake coordination and organization
The rise of Data Portals reflect the rapid growth in the volume and variety of data that organizations hold and use. With so much data available internally (and externally) it is hard for users to discover and access the data they need. And with so many potential users and use-cases it is hard to anticipate what data will be needed, when.
Concretely: how does Jane in the new data science team know that Geoff in accounting has the spreadsheet she needs for her analysis for the COO? Moreover, it is not enough just to have a dataset's location: if users are easily to discover and access data it has to be suitably organized and presented.
Data portals answer this need: by making it easy to find and access data, a data portal helps solve these problems. As a result, data portals have become essential tools for organizations to bring order to the "data swamp" and unlock the value of their data assets.[^1]
[^1]: The nature of the problem that Data Portals solve (i.e. bringing order to diverse, distributed data assets) explains why data portals first arose in Government and as *open* data portals. Government had lots of useful data, much of it shareable, but poorly organized and strewn all over the place. In addition, much of the value of that data lay in unexpected or unforeseen uses. Thus, Data Portals in their modern form started in Government in the mid-late 2000s. They then spread into large companies and then with the spread of data into all kinds of organizations big and small.
## Why Data Portals?
### Data Variety and Volume have Grown Enormously
The volume and variety of data available has grown enormously. Today, even small organizations have dozens of data assets ranging from spreadsheets in their cloud drive to web analytics. Meanwhile, large organizations can have an enormous -- and bewildering -- amount and variety of data ranging from Hadoop clusters and data warehouses to CRM systems plus, of course, plenty of internal spreadsheets, databases etc.
In addition to this diversity of *supply* there has been a huge growth in the potential and diversity of *demand* in the form of users and use cases. Self-service business intelligence, machine learning and even tools like google spreadsheets have democratized and expanded the range of users. Meanwhile, data is no longer limited to a single purpose: much of the *new* value for data for enterprises comes from unexpected or unplanned uses and from combining data across divisions and systems.
### This Creates a Challenge: Getting Lost in Data
As organizations seek to reap the benefits of this data cornucopia they face a problem: with so much data around its easy to get lost -- or even just not know that data even exists. And, as supply and demand have expanded and diversified it has got both harder and more important to match them up.
The addition of data integration and data engineering can actually makes this problem even worse -- do we need to create this new dataset X from Y and Z or do we already have that somewhere? And how can people find X once we have created it? Is X a finished a dataset that people can rely on or is it a one-off. Even if a one-off do we want to record that we created this kind of dataset so we can create it again in the future if we need it?[^lakes]
### Data Portals are a Platform that Connect Supply and Demand for Data
By making it easy to find and access data, a data portal helps address all these problems. As a platform it connects creators and users of data in a single place. As a single source of base metadata it provides essential infrastructure for data integration. By acting as a central repository of data it enables new forms of publication and sharing. Data Portals therefore play a central role in unlocking the value of data for organizations.
[^lakes]: Ditto for data lakes: the growth of data lakes have made data portals (and metadata management) even more important because without them your data lake quickly turns into a data swamp where data is hard to locate and even if found lacks esssential metadata and structure that would make make it usable.
### Data Portals as the First Step in an Agile Data Strategy
Data portals also play an initial, concrete step in data engineering / data strategy. Suppose you are a newly arrived CDO.
The first questions you will be asking are things like: what data do we have, where is it, what state is it in? (And secondarily, what data use cases do we have? Who has them? Do they match against the data we have?).
This immediately leads to a need do a data inventory. And for a data inventory you need a tool to hold the results (and structure this) = a data portal.
```mermaid
graph TD
cdo[Chief Data Officer]
what[What data do we have?]
inventory["We need a data inventory"]
portal[We need a data portal / catalog]
cdo --> what
what --> inventory
inventory --> portal
```
Even in more sophisticated situations, a data portal is a great place to start. Suppose you are a newly arrived CDO at an organization with an existing data lake and a rich set of data integration and data analytics workflows.
There is a good chance your data lake is rapidly becoming a data swamp and there is nothing to track dependencies and connections in those data and analytics pipelines. Again a simple data portal is a great place to start in bringing some order to this: lightweight, vendor-independent and (if you choose CKAN) open source infrastructure that gives you a simple solution for collecting and tracking metadata across datasets and data workflows.
### Summary: Data Portals make Data Discoverable and Accessible and provide a Foundation for Integration
In summary, Data Portals deliver value in three distinct, interlocking and incremental ways by:
* Making data discoverable: ranging from Excel files to Hadoop clusters. The portal does this by providing the infrastructure *and* process for reliable, cross-system metadata management and access (via humans *and* machines)
* Make data accessible: whether its an Excel file or a database cluster, the portal's combination of common metadata, data showcases and data APIs make data easily and quickly accessible to technical and non-technical users. Data can now be previewed and even explored in one location prior to use.
* Making data reliable and integrable: as central store of metadata and data access points, the data portal can be naturally used as a starting point for enriching data with data dictionaries (what does column `custid` mean?), data mappings (this column in this data file is a customer ID and the customer master data is in this other dataset there), and data validation (does this column of dates contain valid dates, are some of them out of range?)
In addition, in terms of proper data infrastructure and data engineering, a Data Portal provides both initial starting point, simple scaffolding and a solid foundation. It is an organizing point and rosetta stone for data discovery and metadata.
* TODO: this really links into the story of how to start doing data engineering / building a data lake / doing agile data etc etc
* For example, suppose you want to do some non-trivial business intelligence. You'll need a list of the datasets you'll need -- maybe sales, plus analytics, plus some public economic data. Where are you going to track those datasets? Where are you going to track the resulting datasets you produce?
* For example, suppose your data engineering team are building out data pipelines. These pipelines pull in a variety of datasets, integrate and transform them and then save the results. How are they going to track what datasets they are using and what they have produced? They are going to need a catalog. Rather than inventing their own (the classic "json file in git or spreadsheet in google docs etc" you want them to use a proper catalog (or integrate with your existing one).
* Using an open source service-oriented data portal framework like CKAN you can rapidly integrate and scale out your data orchestration. It provides a "small pieces, loosely joined" approach to developing your data infrastructure starting from the basics: what datasets do you have, what datasets do you want to create?
## What does a Data Portal do?
### A Data Portal provides a Catalog
In its most basic essence, a Data Portal is a catalog of datasets. Even here there are degrees: at its simplest a catalog is just a list of dataset names and links; whilst more sophisticated catalogs will have elaborate metadata on each dataset.
### And Much More ...
Along with the essential basic catalog features, modern portals now incorporate an extensive range of functionality for organizing, structuring and presenting data including:
* **Publication workflow and metadata management**: rich editing interfaces and workflows (for example, approval steps), bulk editing of metadata etc
* **Showcasing and presentation of datasets** extending to interactive exploration. For example, if a dataset contains an Excel file, in addition to linking to that file the portal will also display the contents in a table, allow users to create visualizations, and even to search and explore the data
* **Data storage and APIs**: as well as cataloging metadata and linking to data stored elsewhere, data portals can also store data. Building off this, data portals can provide "data APIs" to the underlying data to complement direct access and download. These APIs make it much quicker and easier for users to build their own rich applications and analytics workflows.
* **Permissions**: fine-grained permissions to control access to data and related materials.
* **Data ingest and transformation:** ETL style functionality e.g. for bulk harvesting metadata, or preparing or excerpting data for presentation (for example, loading data to power data APIs)
Moreover, as a flexible, central point of truth on an organizations data assets a Data Portal can become the foundations for broader data infrastructure and data management, for example:
* Orchestration of data integration: as a central repository of metadata, data portals are perfectly placed to integrate with wider data engineering and ETL workflows
* Data quality and provenance tracking
* Data Lake coordination and organization
## What are the main features of a Data Portal?
Focus on "functional" features vs technical features.
E.g. each technical feature may require one or more of these technical features:
* Storage
* API
* Frontend
* Admin UI (WebUI, possibly CLI, Mobile etc)
### High Level Overview
```mermaid
graph LR
perdataset --> storage[Store Data]
perdataset --> metadata[Store Metadata]
perdataset --> versioning
perdataset --> events[Notifications of Life Cycle events]
perdataset --> basic[Basic Access Control]
permissions --> auth[Identify]
permissions --> authz[Authorization]
permissions --> permintr[Permissions Integration]
hub --> showcase["(Pre)Viewing the Dataset"]
hub --> discovery[Discovery]
hub --> orgs[Users, Teams and Ownership]
hub --> tags[Tags, Themes]
hub --> audit[Audit and Notifications]
integration[Data Integration] --> pipelines
integration --> harvesting
```
### Coggle Detailed Overview
https://coggle.it/diagram/Xiw2ZmYss-ddJVuK/t/data-portal-feature-breakdown
<iframe width='853' height='480' src='https://embed.coggle.it/diagram/Xiw2ZmYss-ddJVuK/b24d6f959c3718688fed2a5883f47d33f9bcff1478a0f3faf9e36961ac0b862f' frameborder='0' allowfullscreen></iframe>
### Detailed Feature Breakdown
```mermaid
graph LR
dms[Basics]
dmsplus["Plus"]
cms[CMS]
theming[Theming]
permissions[Permissions]
datastore[Data API]
monitoring[Monitoring]
usage[Usage Analytics]
monitoring[Monitoring]
harvesting[Harvesting]
etl[ETL]
blog[Blog]
contact[Contact Page]
help[Support]
newsletter[Newsletter]
metadata[Metadata]
showcase[Showcase]
activity[Activity Streams]
search[Data Search]
catalogsearch[Catalog Search]
multi[Multi-language metadata]
resource[Resource previews]
xloader[xLoader]
datapusher[Data Pusher]
revision[Revisioning]
explorer[Data Explorer]
datavalidation[Data Validation]
filestore[FileStore]
siteadmin[Site Admin]
dms --> metadata
dms --> activity
dms --> catalogsearch
dms --> showcase
dms --> resource
dms --> multi
dms --> filestore
dms --> theming
dms --> i18n
dms --> siteadmin
dmsplus --> permissions
dmsplus --> revision
dmsplus --> datastore
dmsplus --> monitoring
dmsplus --> usage
dmsplus --> search
dmsplus --> explorer
cms --> blog
cms --> contact
cms --> help
cms --> newsletter
etl --> datapusher
etl --> xloader
etl --> harvesting
etl --> datavalidation
```
* Theming - customizing the look and feel of the portal
* i18n
* CMS - e.g., news/ideas/about/docs. Learn about CMS options - [CMS](/docs/dms/data-portals/cms).
* Blog
* Contact page?
* Help / Support / Chat
* Newsletter
* DMS Basic - Catalog: manage/catalog multiple formats of data
* Activity Streams
* Data Showcase (aka Dataset view page) -
* Resource previews
* Metadata creation and managemet
* Multi-language metadata
* Data import and storage
* Storing data
* Data Catalog searching
* Data searching
* Multiple Formats of data
* Tagging and Grouping of Datasets
* Organization as publishers and teams
* DMS Plus
* Permissions: identity, authentication, accounts and authorization (including "teams/groups/orgs")
* Revisioning of data and metadata
* DataStore and Data API: ....
* Monitoring: who is doing what, audit log etc
* Usage Analytics: e.g. number of views, amount of downloads, recent activity
* ETL: automated metadata and data import and processing (e.g. to data store), data transformation ...
* Harvesting: metadata and data harvesting
* DataPusher
* xLoader
* (Data) Explorer: Visualizations and Dashboards
Data Validation
DevOps
* CKAN Cloud: multi-instance deployment and management
* Backups / Disaster recovery
Not sure they merit an item ...
* Cross Platform
* Data Sharing: A place to store data, with a permanent link to share to the public.
* Discussions
* RSS
* Multi-file download
## CKAN the Data Portal Software
CKAN is the leading data portal software.
It is both usable out of the box and can also be utilized as a powerful framework for creating tailored solutions.
CKAN's combintation of open source codebase and enterprise support make it uniquely attractive for organizations looking to build customized, enterprise-grade solutions.
## Appendix
TODO: From Data Portal to DataHub (or Data Management System).
### Is a Data Catalog the same as a Data Portal? (Yes)
Is a data catalog the same as a data portal? Yes. Data Portals are the evolution of data catalogs.
Data Portals were originally called a variety of names including "Data Catalog". As catalogs grew in features they have evolved into a full portal.
### Open Data Portals and Internal Data Portals.
Many initial data portals were "open" or public: that is anyone could access them -- and the data they listed. This reflected the fact that these data portals were set up by governments seeking to maximize the value of their data by sharing it as widely as possible.
However, there is no reason a data portal need be "open". In fact, data portals internal to an enterprise are usually restricted to the organization or even specific teams within the enterprise.

View File

@ -1,158 +0,0 @@
# DataFrame
Designing a dataframe.js - and understanding data libs and data models in general.
TODO: integrate https://github.com/datopian/dataframe.js - my initial review from ~ 2015 onwards.
## Introduction
Conceptually a data library consists of:
* A data model i.e. a set of classes for holding / describing data e.g. Series (Vector/1d array), DataFrame (Table/2d array) (and possibly higher dim arrays)
* Tooling
* Operations e.g. group by, query, pivot etc etc
* Import / Export: load from csv, sql, stata etc etc
## Our need
We need to build tools for wrangling and presenting data ... that are ...
* focused on smallish data
* run in the browser and/or are lightweight/easy to install
Why? Because ...
* We want to build easy to use / install applications for non-developers (so they aren't going to use pandas or a jupyter notebook PLUS they want a UI PLUS probably not big data (or if it is we can work with a sample))
* We're often using these tools in web applications (or in e.g. desktop app like electron)
Discussion
* Could we not have browser act as thin client and push code to some backend ...? Yes we could but that means a whole other service ...
What we want: Something like openrefine but running in the browser ...
### Why not just use R / Pandas
Context: R, Pandas are already awesome. In fact, super-awesome. And they have huge existing communities and ecosystems.
Furthermore, not only do they do data analysis (so all the data science folks are using) but they are also pretty good for data wrangling (esp pandas)
So, we'd heavily recommend these (esp pandas) if you are developer (and doing work on your local machine).
However, ...
* if you're not a developer they can be daunting (even wrapped up in a juypyter notebook).
* if you are a developer and actually doing data engineering there are some issues
* pandas is a "kitchen-sink" of a library and depends on numpy. This makes it a heavy-weight dependency and harder to put into data pipelines and flows
* monolithic nature makes them hard to componentize ...
## Pandas
### Series
https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#series
> Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:
>
> ``>>> s = pd.Series(data, index=index)``
* Series is a 1-d array with the convenience of labelling each cell in the array with the index (which defaults to 0...n if not specified).
* This allows you to treat Series as an array *and* a dictionary
* You can give it a name "Series can also have a name attribute:
&nbsp;
`s = pd.Series(np.random.randn(5), name='something')`"
### DataFrame
https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html
> DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:
### Higher dimensional arrays
Not supported. See xarray.
## XArray
Comment: mature and well thought out. Exists to generalize pandas to higher levels.
http://xarray.pydata.org/en/stable/ => multidimensional arrays in pandas
> xarray has two core data structures, which build upon and extend the core strengths of NumPy and pandas. Both data structures are fundamentally N-dimensional:
>
> DataArray is our implementation of a labeled, N-dimensional array. It is an N-D generalization of a pandas.Series. The name DataArray itself is borrowed from Fernando Perezs datarray project, which prototyped a similar data structure.
>
> Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the pandas.DataFrame.
(Personally not sure about the analogy: Dataset is like a collection of series *or* DataFrames)
## NTS
* Pandas 2: https://dev.pandas.io/pandas2/ - https://github.com/pandas-dev/pandas2 (from 2017 for pandas2)
* Pandas 2 never happened https://github.com/pandas-dev/pandas2 (stalled in 2017 ...). May happen in 2021 according to this milestone for it https://github.com/pandas-dev/pandas/milestone/42
## Inbox
* Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second. 3.2 ⭐
* https://ray.io/ - distributed computing in python (??)
* Seems to be an alternative / competitor to ray but more general (dask is very oriented to scaling pandas style stuff)
* https://modin.readthedocs.io/en/latest/ - a way to convert pandas to run in parallel "Scale your pandas workflow by changing a single line of code"
* https://github.com/reubano/meza - meza is a Python library for reading and processing tabular data. It has a functional programming style API, excels at reading/writing large files, and can process 10+ file types.
* Quite a few similarities to frictionless data type stuff
* Mainly active 2015-2017 afaict and last commit in 2018
* https://github.com/atlanhq/camelot PDF Table Extraction for Humans
* http://blaze.pydata.org/ - seems inactive since 2016 (according to blog) and github repos look quiet since ~ 2016
* Datashape - https://datashape.readthedocs.io/en/latest/overview.html - Datashape is a data layout language for array programming. It is designed to describe in-situ structured data without requiring transformation into a canonical form.
* Dask: https://dask.org - "Dask natively scales Python. Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love". Was part of Blaze and now split out as a separate project. this is still *very* active (in fact main maintainer formed a consulting company for this in 2020)
* https://github.com/dask/dask "Parallel computing with task scheduling"
* odo - https://odo.readthedocs.io/en/latest/ - https://github.com/blaze/odo - Odo: Shapeshifting for your data
> odo takes two arguments, a source and a target for a data transfer.
>
> ```
> >>> from odo import odo
> >>> odo(source, target) # load source into target
> ```
>
> It efficiently migrates data from the source to the target through a network of conversions.
### Blaze
The Blaze ecosystem is a set of libraries that help users store, describe, query and process data. It is composed of the following core projects:
* Blaze: An interface to query data on different storage systems
* Dask: Parallel computing through task scheduling and blocked algorithms
* Datashape: A data description language
* DyND: A C++ library for dynamic, multidimensional arrays
* Odo: Data migration between different storage systems
## Appendix: JS "DataFrame" Libraries
A list of existing libraries.
*Note: when we started research on this in 2015 there were none that we could find so a good sign that they are developing.*
* https://github.com/opensource9ja/danfojs 274⭐ - ACTIVE Last update Aug 2020
* https://github.com/StratoDem/pandas-js 280⭐ - INACTIVE last update sep 2017
* https://github.com/fredrick/gauss 428⭐ - INACTIVE last update 2015 - JavaScript statistics, analytics, and data library - Node.js and web browser ready
* https://github.com/Gmousse/dataframe-js 283⭐ - INACTIVE? started in 2016 and largely inactive since 2018 (though minor update in early 2019)
* dataframe-js provides another way to work with data in javascript (browser or server side) by using DataFrame, a data structure already used in some languages (Spark, Python, R, ...).
* Comment: support browser and node etc. Pretty well structured. A long way from Pandas still.
* https://github.com/osdat/jsdataframe 26⭐ - INACTIVE started in 2016 and not much activity since 2017. Seems fairly R oriented (e.g. melt)
* Jsdataframe is a JavaScript data wrangling library inspired by data frame functionality in R and Python Pandas. Vector and data frame constructs are implemented to provide vectorized operations, summarization methods, subset selection/modification, sorting, grouped split-apply-combine operations, database-style joins, reshaping/pivoting, JSON serialization, and more. It is hoped that users of R and Python Pandas will find the concepts in jsdataframe quite familiar.
* https://github.com/maxto/ubique 91⭐ - ABANDONED last update in 2015 and stated as discontinued. A mathematical and quantitative library for Javascript and Node.js
* https://github.com/misoproject/dataset 1.2k⭐️ - now abandonware as no development since 2014, site is down (and maintainers seem unresponsive) (was a nice project!)
Other ones (not very active or without much info):
* https://github.com/walnutgeek/wdf - 1⭐ "web data frame" last commit in 2014 http://walnutgeek.github.io/wdf/DataFrame.html
* https://github.com/cjroth/dataframes
* https://github.com/jpoles1/dataframe.js
* https://github.com/danrobinson/dataframes
### References
* https://stackoverflow.com/questions/30610675/python-pandas-equivalent-in-javascript/43825646 (has a community wiki section)
* https://www.man.com/maninstitute/short-review-of-dataframes-in-javascript (2018) - pretty good review in june 2018. As it points there is no clear solution.

View File

@ -1,11 +0,0 @@
# DataHub Documentation
Welcome to the DataHub documentation.
DataHub is a platform for *people* to **store, share and publish** their data, **collect, inspect and process** it with **powerful tools**, and **discover and use** data shared by others.
Our focus is on data wranglers and data scientists: those who use automate their work with data using code and command line tools rather than editing it by hand (as, for example, many analysts do in Excel). Think people who use Python vs people who use Excel for data work.
Our goal is to provide simplicity *and* power.
[Developer Docs &raquo;](/docs/dms/datahub/developers) {'<3'} Python, JavaScript and data pipelines? Start here!

View File

@ -1,99 +0,0 @@
# Developers
This section of the DataHub documentation is for developers. Here you can learn about the design of the platform and how to get DataHub running locally or on your own servers, and the process for contributing enhancements and bug fixes to the code.
[![Gitter](https://img.shields.io/gitter/room/frictionlessdata/chat.svg)](https://gitter.im/frictionlessdata/chat)
## Internal docs
* [API](/docs/dms/datahub/developers/api)
* [Deploy](/docs/dms/datahub/developers/deploy)
* [Platform](/docs/dms/datahub/developers/platform)
* [Publish](/docs/dms/datahub/developers/publish)
* [User Stories](/docs/dms/datahub/developers/user-stories)
* [Views Research](/docs/dms/datahub/developers/views-research)
* [Views](/docs/dms/datahub/developers/views)
## Repositories
We use following GitHub repositories for DataHub platform:
* [DEPLOY][deploy] - Automated deployment
* [FRONTEND][frontend] - Frontend application in node.js
* [ASSEMBLER][assembler] - Data assembly line
* [AUTH][auth] - A generic OAuth2 authentication service and user permission manager.
* [SPECSTORE][specstore] - API server for managing a Source Spec Registry
* [BITSTORE][bitstore] - A microservice for storing blobs i.e. files.
* [RESOLVER][resolver] - A microservice for resolving datapackage URLs into more human readable ones
* [DOCS][docs] - Documentations
[deploy]: https://github.com/datahq/deploy
[frontend]: https://github.com/datahq/frontend
[assembler]: https://github.com/datahq/assembler
[auth]: https://github.com/datahq/auth
[specstore]: https://github.com/datahq/specstore
[bitstore]: https://github.com/datahq/bitstore
[resolver]: https://github.com/datahq/resolver
[docs]: https://github.com/datahq/docs
```mermaid
graph TD
subgraph Repos
frontend[Frontend]
assembler[Assembler]
auth[Auth]
specstore[Specstore]
bitstore[Bitstore]
resolver[Resolver]
docs[Docs]
end
subgraph Sites
dhio[datahub.io]
dhdocs[docs.datahub.io]
docs --> dhdocs
end
deploy((DEPLOY))
deploy --> dhio
frontend --> deploy
assembler --> deploy
auth --> deploy
specstore --> deploy
bitstore --> deploy
resolver --> deploy
```
## Install
We use several different services to run our platform, please follow the installation instructions here:
* [Install Assembler](https://github.com/datahq/assembler#assembler)
* [Install Auth](https://github.com/datahq/auth#datahq-auth-service)
* [Install Specstore](https://github.com/datahq/specstore#datahq-spec-store)
* [Install Bitstore](https://github.com/datahq/bitstore#quick-start)
* [Install DataHub-CLI](https://github.com/datahq/datahub-cli#usage)
* [Install Resolver](https://github.com/datahq/resolver#quick-start)
## Deploy
For deployment of the application in a production environment, please see [the deploy page][deploydocs].
[deploydocs]: /docs/dms/deploy
## DataHub CLI
The DataHub CLI is a Node JS lib and command line interface to interact with an DataHub instance.
[CLI code](https://github.com/datahq/datahub-cli)

View File

@ -1,34 +0,0 @@
# DataHub API
The DataHub API provides a range of endpoints to interact with the platform. All endpoints live under the URL `https://api.datahub.io` where our API is divided into the following sections: **auth, rawstore, sources, metastore, resolver**.
## Auth
A generic OAuth2 authentication service and user permission manager.
https://github.com/datahq/auth#api
## Rawstore
DataHub microservice for storing blobs i.e. files. It is a lightweight auth wrapper for n S3-compatible object store that integrates with the rest of the DataHub stack and especially the auth service.
https://github.com/datahq/bitstore#api
## Sources
An API server for managing a Source Spec Registry.
https://github.com/datahq/specstore#api
## Metastore
A search services for DataHub.
https://github.com/datahq/metastore#api
## Resolver
DataHub microservice for resolving datapackage URLs into more human readable ones.
https://github.com/datahq/resolver#api

View File

@ -1,91 +0,0 @@
## DevOps - Production Deployment
We use various cloud services for the platform, for example AWS S3 for storing data and metadata, and the application runs on Docker Cloud.
We have fully automated the deployment of the platform including the setup of all necessary services so that it is one command to deploy. Code and instructions here:
https://github.com/datahq/deploy
Below we provide a conceptual outline.
### Outline - Conceptually
```mermaid
graph TD
user[fa:fa-user User] --> frontend[Frontend]
frontend --> apiproxy[API Proxy]
frontend --> bits[BitStore - S3]
```
### New Structure
This diagram shows the current deployment architecture.
```mermaid
graph LR
cloudflare --> haproxy
haproxy --> frontend
subgraph auth
postgres
authapp
end
subgraph rawstore
rawobjstore
rawapp
end
subgraph pkgstore
pkgobjstore
pkgapp
end
subgraph metastore
elasticsearch
metastore
end
haproxy --/auth--> authapp
haproxy --/rawstore--> rawapp
haproxy --> pkgapp
haproxy --/metastore--> metastore
```
### Old Structures
#### Heroku
```mermaid
graph TD
user[fa:fa-user User]
bits[BitStore]
cloudflare[Cloudflare]
user --> cloudflare
cloudflare --> heroku
cloudflare --> bits
heroku[Heroku - Flask] --> rds[RDS Database]
heroku --> bits
```
#### AWS Lambda - Flask via Zappa
We are no longer using AWS and Heroku in this way. However, we have kept this for historical purposes and in case we return to any of them.
```mermaid
graph TD
user[fa:fa-user User] --> cloudfront[Cloudfront]
cloudfront --> apigateway[API Gateway]
apigateway --> lambda[AWS Lambda - Flask via Zappa]
cloudfront --> s3assets[S3 Assets]
lambda --> rds[RDS Database]
lambda --> bits[BitStore]
cloudfront --> bits
```

View File

@ -1,209 +0,0 @@
# Platform
The DataHub platform follows a service oriented architecture. It is built from a set of loosely coupled components, each performing distinct functions related to the platform as a whole.
## Architecture
<p style={{textAlign: "center"}}>Fig 1: Data Flow through the system</p>
```mermaid
graph TD
cli((CLI fa:fa-user))
auth[Auth Service]
cli --login--> auth
cli --store--> raw[Raw Store API<br>+ Storage]
cli --package-info--> pipeline-store
raw --data resource--> pipeline-runner
pipeline-store -.generate.-> pipeline-runner
pipeline-runner --> package[Package Storage]
package --api--> frontend[Frontend]
frontend --> user[User fa:fa-user]
package -.publish.->metastore[MetaStore]
pipeline-store -.publish.-> metastore[MetaStore]
metastore[MetaStore] --api--> frontend
```
<p style={{"textAlign": "center"}}>Fig 2: Components Perspective - from the Frontend</p>
```mermaid
graph TD
subgraph Web Frontend
frontend[Frontend Webapp]
browse[Browse & Search]
login[Login & Signup]
view[Views Renderer]
frontend --> browse
frontend --> login
end
subgraph Users and Permissions
user[User]
permissions[Permissions]
authapi[Auth API]
authzapi[Authorization API]
login --> authapi
authapi --> user
authzapi --> permissions
end
subgraph PkgStore
bitstore["PkgStore (S3)"]
bitstoreapi[PkgStore API<br/>put,get]
bitstoreapi --> bitstore
browse --> bitstoreapi
end
subgraph MetaStore
metastore["MetaStore (ElasticSearch)"]
metaapi[MetaStore API<br/>read,search,import]
metaapi --> metastore
browse --> metaapi
end
subgraph CLI
cli[CLI]
end
```
## Information Architecture
```
datahub.io # frontend
api.datahub.io # API - see API page for structure
rawstore.datahub.io # rawstore - raw bitstore
pkgstore.datahub.io # pkgstore - package bitstore
```
## Components
### Frontend Web Application
Core part of platform - Login & Sign-Up and Browse & Search Datasets
https://github.com/datahq/frontend
#### Views and Renderer
JS Library responsible for visualization and views.
See [views][] section for more about Views.
### Assembler
TODO
### Raw Storage
We first save all raw files before sending to pipeline-runner.
**Pipeline-runner** is a service that runs the data package pipelines. It is used to normalise and modify the data before it is displayed publicly
- We use AWS S3 instance for storing data
### Package Storage
We store files after passing pipeline-runner
- We use AWS S3 instance for storing data
### BitStore
We are preserving the data byte by byte.
- We use AWS S3 instance for storing data
### MetaStore
The MetaStore provides an integrated, searchable view over key metadata for end user services and users. Initially this metadata will just be metadata on datasets in the Package Store. In future it may expand to provide a unified to include other related metadata such as pipelines. It also includes summary metadata (or the ability to compute summary data) e.g. the total size of all your packages
#### Service architecture
```mermaid
graph TD
subgraph MetaStore
metaapi[MetaStore API]
metadb[MetaStore DB fa:fa-database]
end
metadb --> metaapi
assembler[Assembler] --should this by via api or direct to DB??--> metadb
metaapi --> frontend[Frontend fa:fa-user]
metaapi --> cli[CLI fa:fa-user]
frontend -.no dp stuff only access.-> metaapi
```
### Command Line Interface
The command line interface.
https://github.com/datahq/datahub-cli
[views]: /docs/dms/views
[web-app]: http://datahub.io/
## Domain model
There are two main concepts to understand in DataHub domain model - [Profile](#profile) and [Package](#data-package)
```mermaid
graph TD
pkg[Data Package]
resource[Resource]
file[File]
version[Version]
user[User]
publisher[Publisher]
subgraph Package
pkg --0..*--> resource
resource --1..*--> file
pkg --> version
end
subgraph Profile
publisher --1..*--> user
publisher --0..*--> pkg
end
```
### Profile
Set of an authenticated and authorized entities like publishers and users. They are responsible for publishing, deleting or maintaining data on platform.
**Important:** Users do not have Data Packages, Publishers do. Users are *members* of Publishers.
#### Publisher
Publisher is an organization which "owns" Data Packages. Publisher may have zero or more Data Packages. Publisher may also have one or more user.
#### User
User is an authenticated entity, that is member of Publisher organization, that can read, edit, create or delete data packages depending on their permissions.
#### Package
A Data Package is a simple way of “packaging” up and describing data so that it can be easily shared and used. You can imagine as collection of data and and it's meta-data ([datapackage.json][datapackage.json]), usually covering some concrete topic Eg: *"Gold Prices"* or *"Population Growth Rate In My country"* etc.
Each Data Package may have zero or more resources and one or more versions.
**Resources** - think like "tables" - Each can map to one or more physical files (usually just one). Think of a data table split into multiple CSV files on disk.
**Version of a Data Package** - similar to git commits and tags. People can mean different things by a "Version":
* Tag - Same as label or version - a nice human usable label e.g. *"v0.3"*, *"master"*, *"2013"*
* Commit/Hash - Corresponds to the hash of datapackage.json, with that datapackage.json including all hashes of all data files
We interpret Version as *"Tag"* concept. *"Commit/Hash"* is not supported
[datapackage.json]: http://frictionlessdata.io/guides/data-package/#datapackagejson

View File

@ -1,107 +0,0 @@
# Publish
Explanation of DataHub publishing flow from client and back-end perspectives.
```mermaid
graph TD
cli((CLI fa:fa-user))
auth[Auth Service]
cli --login--> auth
cli --store--> raw[Raw Store API<br>+ Storage]
cli --package-info--> pipeline-store
raw --data resource--> pipeline-runner
pipeline-store -.generate.-> pipeline-runner
pipeline-runner --> package[Package Storage]
package --api--> frontend[Frontend]
frontend --> user[User fa:fa-user]
package -.publish.->metastore[MetaStore]
pipeline-store -.publish.-> metastore[MetaStore]
metastore[MetaStore] --api--> frontend
```
## Diagram for upload process
```mermaid
graph TD
CLI --jwt--> rawstore[RawStore API]
rawstore --signed urls--> CLI
CLI --upload using signed url--> s3[S3 bucket]
s3 --success message--> CLI
CLI --metadata--> pipe[Pipe Source]
```
## Identity Pipeline
**Context: where this pipeline fits in the system**
```mermaid
graph LR
specstore --shared db--> assembler
assembler --identity pipeline--> pkgstore
pkgstore --> frontend
```
**Detailed steps**
```mermaid
graph LR
load[Load from RawStore] --> encoding[Encoding Check<br>Add encoding info]
encoding --> csvkind[CSV kind check]
csvkind --> validate[Validate data]
validate --> dump[Dump S3]
dump --> pkgstore[Pkg Store fa:fa-database]
load -.-> dump
validate --> checkoutput[Validation<br>Reports]
```
## Client Perspective
Publishing flow takes the following steps and processes to communicate with DataHub API:
```mermaid
sequenceDiagram
Upload Agent CLI->>Upload Agent CLI: Check Data Package valid
Upload Agent CLI-->>Auth(SSO): login
Auth(SSO)-->>Upload Agent CLI: JWT token
Upload Agent CLI->>RawStore API: upload using signed url
RawStore API->>Auth(SSO): Check key / token
Auth(SSO)->>RawStore API: OK / Not OK
RawStore API->>Upload Agent CLI: success message
Upload Agent CLI->>pipeline store: package info
pipeline store->>Upload Agent CLI: OK / Not OK
pipeline store->>pipeline runner: generate
RawStore API->>pipeline runner: data resource
pipeline runner->>Package Storage: generated
Package Storage->>Metadata Storage API: publish
pipeline store->>Metadata Storage API: publish
Metadata Storage API->>Upload Agent CLI: OK / Not OK
```
<br/>
* Upload API - see `POST /source/upload` in *source* section of [API][api]
* Authentication API - see `GET /auth/check` in *auth* section of [API][api].
* Authorization API - see `GET /auth/authorize` in *auth* section of [API][api].
See example [code snippet in DataHub CLI][publish-code]
[api]: /docs/dms/datahub/developers/api
[publish-code]: https://github.com/datahq/datahub-cli/blob/b869d38073248903a944029cf93eddf3ef50001a/bin/data-push.js#L34
[api]: /docs/dms/datahub/developers/api

View File

@ -1,811 +0,0 @@
# User Stories
DataHub is the place where *people* can **store, share and publish** their data, **collect, inspect and process** it with **powerful tools**, and **discover and use** data shared by others. [order matters]
People = data wranglers = those who use machines (e.g. code, command line tools) to work with their data rather than editing it by hand (as, for example, many analysts do in Excel). (Think people who use Python vs people who use Excel for data work)
* Data is not chaotic and is in some sense neat
* Can present your data with various visualization tools (graphs, charts, tables etc.)
* Easy to publish
* Specific data (power) tools and integrations
* Can validate your data before publishing
* Data API
* Data Conversion / Bundling: zip the data, provide sqlite
* Generate a node package of your data
* (Versioning)
## Personas
* **[Geek] Publisher**. Knows how to use a command line or other automated tooling. Wants to publish their data package in order to satisfy their teams requirements to publish data.
* Non-Geek Publisher. Tbc …
* **Consumer**: A person or organization looking to use data packages (or data in general)
* Data Analyst
* Coder (of data driven applications)
* …
* **Admin**: A person or organization who runs an instance of a DataHub
## Stories v2
### Publishing data
As a Publisher I want to publish a file/dataset and view/share just with a few people (or even just myself)
* ~~"Private" link: {'/{username}/{uuid}'}~~
* I want JSON as well as CSV versions of my data
* I want a preview
* I want to be notified clearly if something went wrong and what I can do to fix it.
As a Publisher I want to publish a file/dataset and share publicly with everyone
* Viewable on my profile
* Public link: nice urls {'/{username}/{dataset-name}'}
For the pipeline =>
**Context: where this pipeline fits in the system**
```mermaid
graph LR
specstore --shared db--> assembler
assembler --identity pipeline--> pkgstore
pkgstore --> frontend
```
**Detailed steps**
```mermaid
graph LR
load[Load from RawStore] --> encoding[Encoding Check<br>Add encoding info]
encoding --> csvkind[CSV kind check]
csvkind --> validate[Validate data]
validate --> dump[Dump S3]
dump --> pkgstore[Pkg Store fa:fa-database]
load -.-> dump
validate --> checkoutput[Validation<br>Reports]
```
### Push Package
#### Diagram for upload process
```mermaid
graph TD
CLI --jwt--> rawstore[RawStore API]
rawstore --signed urls--> CLI
CLI --upload using signed url--> s3[S3 bucket]
s3 --success message--> CLI
CLI --metadata--> pipe[Pipe Source]
```
### Push File
Levels:
0. Already have Data Package (?)
1. Good CSV
2. Good Excel
3. Bad data (i.e. has ...)
3. Something else
```
data push {file-or-directory}
```
How does data push work?
```
# you are pushing the raw file
# and the extraction to get one or more data tables ...
# in the background we are creating a data package + pipeline
data push {file}
Algorithm:
1. Detect type / format
2. Choose the data (e.g. sheet from excel)
3. Review the headers
4. Infer data-types and review
5. [Add constraints]
6. Data validation
7. Upload
8. Get back a link - view page (or the raw url) e.g. http://datapackaged.com/core/finance-vix
* You can view, share, publish, [fork]
1. Detect file type
=> file extension
1. Offer guess
2. Probable guess (options?)
3. Unknown - tell us
1B. Detect encoding (for CSV)
2. Choose the data
1. 1 sheet => ok
2. Multiple sheets guess and offer
3. Multiple sheets - ask them (which to include)
2B: bad data case - e.g. selecting within table
3. Review the headers
* Here is what we found
* More than one option for headers - try to reconcile
*
### Upload:
* raw file with name a function of the md5 hash
* Pros: efficient on space (e.g. same file stored once but means you need to worry about garbage collection?)
* the pipeline description: description of data and everything else we did [into database]
Then pipeline runs e.g. load into a database or into a data package
* stores output somewhere ...
Viewable online ...
Note:
data push url # does not store file
data push file # store in rawstore
### BitStore
/rawstore/ - content addressed storage (md5 or sha hashed)
/packages/{owner}/{name}/{tag-or-pipeline}
```
Try this for a CSV file
```
data push mydata.csv
# review headers
# data types ...
Upload
* csv file gets stored as blob md5 ...
* output of the pipeline stored ...
* canonical CSV gets generated ...
```
Data Push directory
```
data push {directory}
# could just do data push file for each file but ...
# that could be tedious
# once I've mapped one file you try reusing that mapping for others ...
# .data directory that stores the pipeline and the datapackage.json
```
## Stories
### 1. Get Started
#### 1. Sign in / Sign up [DONE]
As a Geek Publisher I want to sign up for an account so that I can publish my data package to the registry and to have a publisher account to publish my data package under.
*Generally want this to be as minimal, easy and quick as possible*
* Sign in with a Google account
* (?) what about up other social accounts?
* Essential profile information (after sign in we prompt for this)
* email address
* Name
* (?) - future. Credit card details for payment - can we integrate with payment system (?)
* They need to choose a user name which is url friendly unique human readable name for our app. Can be used in sign in and in many other places.
* WHY? Where would we need this? For url on site & for publisher
* Same as publisher names (needed for URLs): [a-z-_.]
* Explain: they cannot change this later e.g. "Choose wisely! Once you set this it cannot be changed later!"
* Send the user an email confirming their account is set up and suggesting next steps
Automatically:
* Auto-create a publisher for them
* Same name as their user name but a publisher
* That way they can start publishing straight away …
**TODO: (??) should we do *all* of this via the command line client (a la npmjs) **
#### Sign up via github (and/or google) [DONE]
As a Visitor I want to sign up via github or google so that I dont have to enter lots of information and remember my password for yet another website
* How do we deal with username conflicts? What about publisher name conflicts?
* This does not arise in simple username system because we have only pool of usernames
#### Next Step after Sign Up
As a Geek Publisher I want to know what do next after signing up so that I can get going quickly.
Things to do:
* Edit your profile
* Download a client / Configure your client (if you have one already)
* Instructions on getting relevant auth credentials
* Note they will *need* to have set a username / password in their profile
* Join a Publisher (understand what a publisher is!)
#### Invite User to Join Platform
As an Admin (or existing Registered User?) I want to invite someone to join the platform so that they can start contributing or using data
* Get an invitation email with a sign up link
* *Some commonality with Publisher invite member below*
### 2. Publish Data Packages
#### Publish with a Client [DONE]
As a Geek Publisher I want to import (publish) my data package into the registry so my data has a permanent online home so that I and others can have access
On command line looks like:
```
$ cd my/data/package
$ data publish
> … working …
>
> SUCCESS
```
Notes
* Permissions: must be a member of the Publisher
* Internally: DataPackageCreate or DataPackageUpdate capability
* Handle conflicts: if data package already exists, return 409. Client instructions should be already exists and use "--force" or similar to overwrite
* API endpoint behind the scenes: POST {'{api}/package/'}
* TODO: private data packages
* And payment!
##### Configure Client [DONE]
As a Geek Publisher I want to configure my client so I can start publishing data packages.
Locally in $HOME store store something like:
```
.dpm/credentials # stores your API key and user name
.dpm/config # stores info like your default publisher
```
#### Update a Data Package [DONE]
As a Geek Publisher I want to use a publish command to update a data package that is already in the registry so it appears there
* Old version will be lost (!)
#### Delete a Data Package
As a Geek Publisher I want to unpublish (delete) a data package so it is no longer visible to anyone
#### Purge a Data Package
As a Geek Publisher I want to permanently delete (purge) a data package so that it no longer takes up storage space
#### Validate Data in Data Package
##### Validate in CLI [DONE]
As a Publisher [owner/member] I want to validate the data I am about to publish to the registry so that I publish “good” data and know that I am doing so and do not have to manually check that the published data looks ok (e.g. rendering charts properly) (and if wrong I have to re-upload)
```
data datavalidate [file-path]
```
* [file-path] - run this against a given file. Look in the resources to see if this file is there and if so use the schema. Otherwise just do goodtables table …
* If no file provided run validate against each resource in turn in the datapackage
* Output to stdout.
* Default: human-readable - nice version of output from goodtables.
* Option for JSON e.g. --json to put machine readable output
* check goodtables command line tool and follow if possible. Can probably reuse code
* Auto-run this before publish unless explicit suppression (e.g. --skip-datavalidate)
* Use goodtables (?)
##### Validate on Server
As a Publisher [owner] i want my data to be validated when I publish it so that I know immediately if I have accidentally “broken” my data or have bugs and can take action to correct
As a Consumer I want to know that the data I am downloading is “good” and can be relied on so that I dont have to check it myself or run into annoying bugs later on
* Implies showing something in the UI e.g. “Data Valid” (like build passing)
**Implementation notes to self**
* Need a new table to store results of validation and a concept of a “run”
* Store details of the run [e.g. time to complete, ]
* How to automate doing validation (using goodtables we assume) - do we reuse a separate service (goodtables.io in some way) or run ourselves in a process like ECS ???
* Display this in frontend
#### Cache Data Package Resource data (on the server)
As a Publisher I want to publish a data package where its resource data is stored on my servers but the registry caches a copy of that data so that if my data is lost or gets broken I still have a copy people can use
As a Consumer I want to be able to get the data for a data package even if the original data has been moved or removed so that I can still use is and my app or analysis keeps working
* TODO: what does this mean for the UI or command line tools. How does the CLI know about this, how does it use it?
#### Publish with Web Interface
As a Publisher I want to publish a data package in the UI so that it is available and published
* Publish => they already have datapackage.json and all the data. They just want to be able to upload and store this.
As a Publisher I want to create a data package in the UI so that it is available and published
* Create => no datapackage.json - just data files. Need to add key descriptors information, upload data files and have schemas created etc etc.
#### Undelete data package
[cli] As a Publisher I want to be able to restore the deleted data package via cli, so that it is back visible and available to view, download (and searchable)
```
dpmpy undelete
```
[webui] As a Publisher i want to undelete the deleted data packages, so that the deleted data packages is now visible again.
#### Render (views) in data package in CLI before upload
As a Publisher, I want to be able to preview the views (graphs and table (?)) of the current data package using cli prior to publishing so that I can refine the json declarations of datapackage view section to achieve a great looking result.
### 3. Find and View Data Packages
#### View a Data Package Online [DONE]
**EPIC: As a Consumer I want to view a data package online so I can get a sense of whether this is the dataset I want**
* *Obsess here about “whether this is the dataset I want”*
* *Publishers want this too … *
* *Also important for SEO if we have good info here*
Features
* Visualize data in charts - gives one an immediate sense of what this is
* One graph section at top of page after README excerpt
* One graph for each entry in the “views”
* Interactive table - allows me to see what is in the table
* One table for each resource
This user story can be viewed from two perspectives:
* From a publisher point of view
* From a consumer point of view
As a **publisher** i want to show the world how my published data is so that it immediately catches consumers attention (and so I know it looks right - e.g. graph is ok)
As a **consumer** i want to view the data package so that i can get a sense of whether i want this dataset or not.
Acceptance criteria - what does done mean!
* A table for each resource
* Simple graph spec works => converts to plotly
* Multiple time series
* Plotly spec graphs work
* All core graphs work (not sure how to check every one but representative ones)
* Recline graphs specs (are handled - temporary basis)
* Loading spinners whilst data is loading so users know what is happening
Bonus:
* Complex examples e.g. time series with a log scale … (e.g. hard drive data …)
Features: [*DP view status*](fonts/Alegreya-regular.ttf)
* Different options to view data as graph.
* Recline
* Vega-lite
* Vega
* [Plotly]
* General Functionality
* Multiple views [wrongly done. We iterate over resource not views]
* Table as a view
* Interactive table so that consumer can do
* Filter
* Join
#### (Pre)View a not-yet-published Data Package Online
As a (potential) Publisher I want to preview a datapackage I have prepared so that I can check it works and share the results (if there is something wrong with others)
* Be able to supply a URL to my datapackage (e.g. on github) and have it previewed as it would look on DPR
* Be able to upload a datapackage and have it previewed
*Rufus: this was a very common use case for me (and others) when using data.okfn.org. Possibly less relevant if the command line tool can do previewing but still relevant IMO (some people may not have command line tool, and it is useful to be able to share a link e.g. when doing core datasets curation and there is something wrong with a datapackage).*
*Rufus: also need for an online validation tool*
#### See How Much a Data Package is Used (Downloaded) {'{2d}'}
As a Consumer i want to see how much the data has been downloaded so that i can choose most popular (=> probably most reliable and complete) in the case when there are several alternatives for my usecase (maybe from different publishers)
#### Browse Data Packages [DONE]
As a potential Publisher, unaware of datapackages, I want to see real examples of published packages (with the contents datapackage.json), so that I can understand how useful and simple is the datapackage format and the registry itself.
As a Consumer I want to see some example data packages quickly so I get a sense of what is on this site and if it is useful to look further
* Browse based on what properties? Most recent, most downloaded?
* Most downloaded
* Start with: we could just go with core data packages
#### Search for Data Packages [DONE]
As a Consumer I want to search data packages so that I can find the ones I want
* Essential question: what is it you want?
* Rufus: in my view generic search is actually *not* important to start with. People do not want to randomly search. More useful is to go via a publisher at the beginning.
* Search results should provide enough information to help a user decide whether to dig further e.g. title, short description
* For future when we have it: [number of downloads], stars etc
* Minimum viable search (based on implementation questions)
* Filter by publisher
* Free text search against title
* Description could be added if we start doing actual scoring as easy to add additional fields
* Scoring would be nice but not essential
* Implementation questions:
* Search:
* Should search perform ranking (that requires scoring support)
* Free text queries should search against which fields (with what weighting)?
* Filtering: On what individual properties of the data package should be able to filter?
* Themes and profiles:
* Searching for a given profile: not possible atm.
* Themes: Should we tag data packages by themes like finance, education and let user find data package by that?
* Maybe but not now - maybe in the future
* If we follow the go via a publisher at the beginning then should we list the most popular publisher on the home page of user[logged-in/ not logged in]?
* If most popular publisher then by what mesaure?
* Sort by Most published?
* Sort by Most followers?
* Sort by most downloads?
* Or all show top5 in each facet?
Sub user stories:
> *[WONTFIX?] As a Consumer i want to find the data packages by profile (ex: spending) so that I can find the kind of data I want quickly and easily and in one big list*
>
> *As a Consumer i want to search based on description of data package, so that I can find package which related to some key words*
#### Download Data Package Descriptor
As a Consumer I want to download the data package descriptor (datapackage.json) on its own so that …
*Rufus: I cant understand why anyone would want to do this *
#### Download Data Package in One File (e.g. zip)
As a Consumer I want to download the data package in one file so that I dont have to download descriptor and each resource by hand
*Only useful if no cli tool and no install command*
### 4. Get a Data Package (locally)
Lets move discussion to the github: *https://github.com/frictionlessdata/dpm-py/issues/30
*TODO add these details from the requirement doc*
* *Local “Data Package” cache storage (`.datapackages` or similar)*
* *Stores copies of packages from Registry*
* *Stores new Data Packages the user has created*
* *This* [**Ruby lib**](fonts/Lato-boldItalic.ttf) *implements something similar*
#### Use DataPackage in Node (package auto-generated)
As a NodeJS developer I want to use data package as a node lib in my project so that I can depend on it using my normal dependency framework
* See this [*real-world example*](fonts/SourceSansPro-regular.ttf) of this request for country-list
* => auto-building node package and publishing to npm (not that hard to do …)
* Convert CSV data to json (thats what you probably want from node?)
* Generate package.json
* Push to npm (register the dataset users)
* Rufus: My guess here is that to implement here we want something a bit like github integrations specific additional hooks which also get some configuration (or do it like travis - github integration plus a tiny config file - in our case rather than a .travis.yml we have a .node.yml or whatever)
* Is it configurable for user that enable to push to npm or not?
* Yes. Since we need to push to a specific npm user (for each publisher) this will need to be configured (along with authorization - where does that go?)
* Is this something done for *all* data packages or does user need to turn something on? Probably want them to turn this on …
Questions:
* From where we should push the data package to npm repo.
* Is it from dpmpy or from server? Obviously from a server - this needs to be automated. But you can use dpmpy if you want (but Im not sure we do want to …)
* What to do with multiple resources? Ans: include all resources
* Do we include datapackage.json into the node package? Yes, include it so they get all the original metadata.
*Generic version is:*
*As a Web Developer I want to download a DataPackage (like currency codes or country names) so that I can use it in the web service I am building [...]*
#### Import DataPackage into R [DONE?]
As a Consumer [R user] I want to load a Data Package from R so that I can immediately start playing with it
* Should we try and publish to CRAN?
* Probably not? Why? think it can be quite painful getting permission to publish to CRAN and very easy to load from the registry
* On the CRAN website I can't find a way to automate publishing. It seems possible by filling web-form, but to know the status we have to wait and parse email.
* Using this library: https://github.com/ropenscilabs/datapkg
* Where can i know about this?
* On each data package view page …
*Generic version:*
*As a Data Analyst I want to download a data package, so that I can study it and wrangle with it to infer new data or generate new insights.*
*As a Data Analyst, I want to update previously downloaded data package, so that I can work with the most recent data.*
#### Import DataPackage into Pandas [DONE?]
TODO - like R
#### SQL / SQLite database
As a Consumer I want to download a DataPackages data one coherent SQLite database so that I can get it easily in one form
Question:
* Why does we need to store datapackage data in sqlite. Is not it better to store in file structure?
We can store the datapackage like this way:
```
~/.datapackage/<publisher>/<package>/<version>/*
```
This is the way maven/gradle/ivy cache jar locally.
#### See changes between versions
As a Data Analyst I want to compare different versions of some datapackage locally, so that I can see schema changes clearly and adjust my analytics code to the desired schema version.
#### Low Priority
As a Web Developer of multiple projects, I want to be able to install multiple versions of the same datapackage separately so that all my projects could be developed independently and deployed locally. (virtualenv-like)
As a Developer I want to list all DataPackages requirements for my project in the file and pin the exact versions of any DataPackage that my project depends on so that the project can be deterministically deployed locally and wont break because of the DataPackage schema changes. (requirements.txt-like)
### 5. Versioning and Changes in Data Packages
When we talk about versioning we can mean two things:
* Explicit versioning: this is like the versioning Of releases “v1.0” etc. This is conscious and explicit. Main purpose:
* to support other systems depending on this one (they want the data at a known stable state)
* easy access to major staging points in the evolution (e.g. i want to see how things were at v1)
* Implicit versioning or “revisioning”: this is like the commits in git or the autosave of a word or google doc. It happens frequently, either with minimum effort or even automatically. Main purpose:
* Undelete and recovery (you save a every point and can recover if you accidentally write or delete something)
* Collaboration and merging of changes (in revision control)
* Activity logging
#### Explicit Versioning - Publisher
As a Publisher I want to tag a version of my data on the command line so that … [see so thats below]
dpmpy tag {'{tag-name}'}
=> tag current “latest” on the server as {'{tag-name}'}
* Do we restrict {'{tag-name}'} to semver? I dont think so atm.
* As a {'{Publisher}'} I want to tag datapackage to create a snapshot of data on the registry server, so that consumers can refer to it
* As a {'{Publisher}'} I want to be warned that a tag exists, when I try to overwrite it, so that I dont accidentally overwrite stable tagged data, which is relied on by consumers.
* As a {'{Publisher}'} I want to be able to overwrite the previously tagged datapackage, so that I can fix it if I mess up.
* The versioning here happens server side
* Is this confusing for users? I.e. they are doing something local.
Background “so that” user story epics:
* As a {'{Publisher}'} I want to version my Data Package and keep multiple versions around including older versions so that I do not break consumer systems when I change my Data Package (whether schema or data) [It is not just the publisher who wants this, it is a consumer - see below]
* As a {'{Publisher}'} I want to be able to get access to a previous version I tagged so that I can return to it and review it (and use it)
* so that i can recover old data if i delete it myself or compare how things changed over time
#### Explicit Versioning - Consumer
As a {'{Consumer}'} (of a Data Package) I want to know full details when and how the data package schema has changed and when so that I can adjust my scripts to handle it.
Important info to know for each schema change:
* time when published
* for any ***changed*** field - name, what was changed (type, format, …?),
> +maybe everything else that was not changed (full field descriptor)
* for any ***deleted*** field - name,
> +maybe everything else (full field descriptor)
* for any ***added*** field - all data (full field descriptor)
*A change in schema would correspond to a major version change in software (see http://semver.org/)
***Concerns about explicit versioning**: we all have experience with consuming data from e.g. government publishers where the publishers change the data schema breaking client code. I am constatnly looking for a policy/mechanism to guide publishers to develop stable schema versioning for the data they produce, and help consumers to get some stability guarantees.*
***Automated versioning / automated tracking**: Explicit versioning relies on the publisher, and humans can forget or not care enough about others. So to help consumers my suggestion would be to always track schema changes of uploaded packages on the server, and allow users to review those changes on the website. (We might even want to implement auto-tagging or not allowing users to upload a package with the same version but a different schema without forcing)*
As a {'{Consumer}'} I want to get a sense how outdated is the datapackage, that I have downloaded before, so that I can decide if I should update or not.
* I want to preview a DataPackage changelog (list of all available versions/tags with brief info) online, sorted by creation time, so that I can get a sense how data or schema has changed since some time in the past. Important brief info:
* Time when published
* How many rows added/deleted for each resource data
* What fields(column names) changed, added or deleted for each resource.
As a {'{Consumer}'} I want to view a Datapackage at a particular version online, so that I can present/discuss the particular data timeslice of interest with other people.
As a {'{Consumer}'} I want to download a Data package at a particular version so that I know it is compatible with my scripts and system
* Online: I want to pick the version I want from the list, and download it (as zip for ex.)
* CLI: I want to specify tag or version when using the `install` command.
##### Know when a package has changed re caching
Excerpted from: https://github.com/okfn/data.okfn.org-new/issues/7
_From @trickvi on June 20, 2013 12:37_
I would like to be able to use data.okfn.org as an intermediary between my software and the data packages it uses and be able to quickly check whether there's a new version available of the data (e.g. if I've cached the package on a local machine).
There are ways to do it with the current setup:
1. Download the datapackage.json descriptor file, parse it and get the version there and check it against my local version. Problems:
- This solution relies on humans and that they update their version but there might not be any consistency in it since the data package standard describes the version attribute as: _"a version string conforming to the Semantic Versioning requirement"_
- I have to fetch the whole datapackage.json (it's not big I know but why download all that extra data I might not even want)
2. Go around data.okfn.org and look directly at the github repository. Problems:
- I have to find out where the repo is, use git and do a lot of extra stuff (I don't care how the data packages are stored, I just want a simple interface to fetch them)
- What would be the point of data.okfn.org/data? In my mind it collects data packages and provides a consistent interface to get the data packages irrespective of how its stored.
I propose data.okfn.org provides an internal system to allow users to quickly check whether a new version might be released. This does not have to be an API. We could leverage HTTP's caching mechanism using an ETag header that would contain some hash value. This hash value can e.g. be the the sha value of heads ref objects served via the Github API:
```
https://api.github.com/repos/datasets/cpi/git/refs/heads/master
```
Software that works with data packages could then implement a caching strategy and just send a request with an If-None-Match header along with a GET request for datapackage.json to either get a new version of the descriptor (and look at the version in that file) or just serve the data from its cache.
_Copied from original issue: frictionlessdata/ideas#51_
#### Revisioning - Implicit Versioning
#### Change Notifications
As a Consumer I want to be notified of changes to a package i care about so that I can check out what has changed and take action (like downloading the updated data)
As a Consumer I want to see how active the site is to see if I should get involved
### 6. Publishers
#### Create a New Publisher
TODO
#### Find a Publisher (and users?)
As a Consumer I want to browse and find publishers so that I can find interesting publishers and their packages (so that I can use them)
#### View a Publisher Profile
*view data packages associated to a publisher or user*
Implementation details: [*https://hackmd.io/MwNgrAZmCMAcBMBaYB2eAWR72woghmLNIrAEb4AME+08s6VQA===*](fonts/SourceSansPro-boldItalic.ttf)
As a Consumer I want to see a publishers profile so that I can discover their packages and get a sense of how active and good they are
**As a Publisher I want to have a profile with a list of my data packages so that:**
* Others can find my data packages quickly and easily
* Can see how many data packages i have
* **I can find a data package i want to look at quickly [they can discover their own data]**
* **I can find the link for a data package to send to someone else**
* *People want to share what they have done. This is probably the number one way the site gets prominence at the start (along with simple google traffic)*
* so that I can check that members do not abuse their rights to publish and only publish topical data packages.
As a Consumer I want to view a publishers profile so that I can see who is behind a particular package or to see what other packages they produce [navigate up from a package page] [so that: i can trust on his published data packages to reuse.]
**Details**
* Profile =
* Full name / title e.g. “World Bank”, identifier e.g. world-bank
* *picture, short description text (if we have this - we dont atm)*
* *(esp important to know if this is the world bank or not)*
* *Total number of data packages*
* List of data packages
* View by most recently created (updated?)
* For each DataPackage want to see: title, number of resources (?), first 200 character of description, license (see data.okfn.org/data/ for example)
* Do we limit / paginate this list? No, not for the moment
* *[wontfix atm] Activity - this means data packages published, updated*
* *[wontfix atm] Quality … - we dont have anything on this*
* *[wontfix atm] List of users*
* What are the permissions here?
* Do we show private data packages? No
* Do we show them when “owner” viewing or sysadmin? Yes (but flag as “private”)
* What data packages to show? All the packages you own.
* What about pinning? No support for this atm.
##### Search among publishers packages
As a Consumer i want to search among all data packages owned by a publisher so that I can easily find one data package amongst all the data packages by this publisher.
##### Registered Users Profile and packages
*As a Consumer i want to see the profile and activity of a user so that …*
*As a Registered User I want to see the data packages i am associated with **so that** [like publisher]*
#### Publisher and User Leaderboard
As a ??? I want to see who are the top publihers and users so that I can emulate them or ???
#### Manage Publisher
##### Create and Edit Profile
As {'{Owner ...}'} I want to edit my profile so that it is updated with new information
##### Add and Manage Members
As an {'{Owner of a Publisher in the Registry}'} I want to invite an existing user to become a member of my publisher
* Auto lookup by user name (show username and fullname) - standard as per all sites
* User gets a notification on their dashboard + email with link to accept invite
* If invite is accepted notify the publisher (?) - actually do not do this.
As an {'{Owner of a Publisher in the Registry}'} I want to invite someone using their email to sign up and become a member of my Publisher so that they are authorized to publish data packages under my Publisher.
As an {'{Publisher Owner}'} I want to remove someone from membership in my publisher so they no longer have ability to publish or modify my data packages
As a {'{Publisher Owner}'} I want to view all the people in my organization and what roles they have so that I can change these if I want
As a {'{Publisher Owner}'} I want to make a user an “owner” so they have full control
As a {'{Publisher Owner}'} I want to remove a user as an “owner” so they are just a member and no longer have full control
### 7. Web Hooks and Extensions
TODO: how do people build value added services around the system (and push back over the API etc …) - OAuth etc
### 8. Administer Site
#### Configure Site
As the Admin I want to set key configuration parameters for my site deployment so that I can change key information like the site title
* Main config database is the one thing we might need
#### See usage metrics
As an Admin I want to see key metrics about usage such as users, API usage, downloads etc so that I know how things are going
* Total users are signed up, how many signed up in last week / month etc
* Total publishers …
* Users per publisher distribution (?)
* API usage
* Downloads
* Billing: revenue in relevant periods
* Costs: how much are we spending on storage
#### Pricing and Billing
As an Admin I want to have a pricing plan and billing system so that I can charge users and make my platform sustainable
As a Publisher I want to know if this site has a pricing plan and what the prices are so that I can work out what this will cost me in the future and have a sense that these guys are sustainable (free forever does not work very well)
As a Publisher I want to sign up for a given pricing plan so that I am entitled to what it allows (e.g. private stuff …)
### Private Data Packages
cf npmjs.com
As a Publisher I want to have private data packages that I can share just with my team
### Sell My Data through your site
**EPIC: As a Publisher i want to sell my data through your site so that I make money and am able to sustain my publishing and my life …**

File diff suppressed because it is too large Load Diff

View File

@ -1,168 +0,0 @@
# Views
Producers and consumers of data want to have data presented in tables and graphs -- "views" on the data. They want this for a range of reasons, from simple eyeballing to drawing out key insights.
```mermaid
graph LR
data[Your Data] --> table[Table]
data --> grap[Graph]
data --> geo[Map]
```
To achieve this we need to provide:
* A tool-chain to create these views from the data.
* A descriptive language for specifying views such as tables, graphs, map.
These requirements are addressed through the introduction of Data Package "Views" and associated tooling.
```mermaid
graph LR
subgraph Data Package
resource[Resource]
view[View]
resource -.-> view
end
view --> toolchain
toolchain --> svg["Rendered Graph (SVG)"]
toolchain --> table[Table]
toolchain --> map[Map]
```
This section describes the details of how we support [Data Package Views][views] in the DataHub.
It consists of two parts, the first describes the general tool chain we have. The second part describes how we use that to generate graphs in the showcase page.
**Quick Links**
* [Data Package Views introduction and spec][views]
* [datapackage-render-js][] - this is the library that implements conversion from the data package views spec to vega/plotly and then svg or png
[views]: /docs/dms/publishers/views
[datapackage-render-js]: https://github.com/frictionlessdata/datapackage-render-js
[dpr-js]: https://github.com/frictionlessdata/dpr-js
## The Tool Chain
***Figure 1: From Data Package View Spec to Rendered output***
```mermaid
graph TD
pre[Pre-cursor views e.g. Recline] --bespoke conversions--> dpv[Data Package Views]
dpv --"normalize (correct any variations and ensure key fields are present)"--> dpvn["Data Package Views<br />(Normalized)"]
dpvn --"compile in resource & data ([future] do transforms)"--> dpvnd["Self-Contained View<br />(All data and schema inline)"]
dpvnd --compile to native spec--> plotly[Plotly Spec]
dpvnd --compile to native spec--> vega[Vega Spec]
plotly --render--> html[svg/png/etc]
vega --render--> html
```
**IMPORTANT**: an important "convention" we adopt for the "compiling-in" of data is that resource data should be inlined into an `_values` attribute. If the data is tabular this attribute should be an array of *arrays* (not objects).
### Graphs
***Figure 2: Conversion paths***
```mermaid
graph LR
inplotly["Plotly DP Spec"] --> plotly[Plotly JSON]
simple[Simple Spec] --> plotly
simple .-> vega[Vega JSON]
invega[Vega DP Spec] --> vega
vegalite[Vega Lite DP Spec] --> vega
recline[Recline] .-> simple
plotly --plotly lib--> svg[SVG / PNG]
vega --vega lib--> svg
classDef implemented fill:lightblue,stroke:#333,stroke-width:4px;
class recline,simple,plotly,svg,inplotly,invega,vega implemented;
```
Notes:
* Implemented paths are shown in lightblue - code for this is in [datapackage-render-js][]
* Left-most column (Recline): pre-specs that we can convert to our standard specs
* Second-from-left column: DP View spec types.
* Second-from-right column: the graphing libraries we can use (which all output to SVG)
### Geo support
**Note**: support for customizing map is limited to JS atm - there is no real map "spec" in JSON yet beyond the trivial version.
**Note**: vega has some geo support but geo here means full geojson style mapping.
```mermaid
graph LR
geo[Geo Resource] --> map
map[Map Spec] --> leaflet[Leaflet]
classDef implemented fill:lightblue,stroke:#333,stroke-width:4px;
class geo,map,leaflet implemented;
```
### Table support
```mermaid
graph LR
resource[Tabular Resource] --> table
table[Table Spec] --> handsontable[HandsOnTable]
table --> html[Simple HTML Table]
classDef implemented fill:lightblue,stroke:#333,stroke-width:4px;
class resource,table,handsontable implemented;
```
### Summary
***Figure 3: From Data Package View to Rendered output flow (richer version of diagram 1)***
<img src="https://docs.google.com/drawings/d/1M_6Vcal4PPSHpuKpzJQGvRUbPb5yeaAdRHomIIbfnlY/pub?w=790&h=1402" />
## Views in the Showcase
To render Data Packages in browsers we use DataHub views written in JavaScript. The module implemented in ReactJS framework and it can render tables, maps and various graphs using third-party libraries.
Implementing code can be found in:
* [dpr-js repo][dpr-js] - which in turn depends on [datapackage-render-js][]
```mermaid
graph TD
url["metadata URL passed from back-end"]
dp-js[datapackage-js]
dprender[datapackage-render-js]
table["table view"]
chart["graph view"]
hot[HandsOnTable]
map[LeafletMap]
vega[Vega]
plotly[Plotly]
browser[Browser]
url --> dp-js
dp-js --fetched dp--> dprender
dprender --spec--> table
table --1..n--> hot
dprender --geojson--> map
dprender --spec--> chart
chart --0..n--> vega
chart --0..n--> plotly
hot --table--> browser
map --map--> browser
vega --graph--> browser
plotly --graph--> browser
```
Notice that DataHub views render a table view per tabular resource. If GeoJSON resource is given, it renders a map. Graph views should be specified in `views` property of a Data Package.
## Appendix
There is a separate page with [additional research material regarding views specification and tooling][views-research].
[views-research]: /docs/dms/datahub/developers/views-research

View File

@ -1,184 +0,0 @@
# DataHub v3 (Next)
## Introduction
Overview of the third generation DataHub. In planning since 2019 this will launch in 2021. For background on v1 and v2 and how we came to v3 see [History section below](#history).
## What
Make it stupidly easy, fast and reliable to share your data in a **useable**<sup>*</sup> way<sup>**</sup>.
<small><sup>*</sup> It is already easy to "share" data: just use dropbox, google drive, s3, github etc. However, its not so easy to share it in a way thats usable e.g. with descriptions for the columns, data thats viewable and searchable (not just raw), with clearly associated docs, with an API etc.</small>
<small><sup>**</sup> Not only with others but with *yourself*. This may sound a bit odd: dont you already have the data? What we mean is, for example, going from a raw CSV to a browseable table (share it with your "eyes") or converting it to an API so that you can use it in a data driven app or analysis (sharing from one tool to another).</small>
Make it easy **for whom**? Power users like data engineers and data scientists. People familiar with a command line and github.
### An Analogy
There is a useful analogy with Vercel (Zeit). Vercel focuses on webapp deployment and developer experience. We focus on data "deployment" and data (wrangler) experience.
| Vercel | DataHub |
|--------|---------|
| A platform for deploying webapps (esp Next.JS) with a focus on simplicity, speed and DX | A platform for "deploying" datasets with a focus on simplicity, speed and DX (data experience). Deploying = a "portal-like" presentation of the data plus e.g. APIs, workflows etc. |
Aside: in a further analogy, we will also have an open-source data presentation framework "Portal.JS" which has some analogies with Next.JS:
| Vercel | DataHub |
|--------|---------|
| **Next.JS**: a framework for building webapps (with react) | **Portal.JS**: the data presentation framework (a framework for presenting data(sets) and building data-driven micro web apps |
## Features
`data` will be used throughout for the DataHub command line tool.
* Get a Portal (Present your data)
* Local (Preview)
* Deployed online
* Data API
* (?) Local
* Deployed online
* Hub: management UI (and API)
### Portal
I have a dataset `my-data`
```bash
README.md
data.csv
## descriptor is optional (we infer if not there)
# datapackage.json
```
I can do:
```bash
cd my-dataset
data portal
```
And I get a nice dataset page like this available locally at e.g. http://localhost:3000:
![](https://i.imgur.com/KSEtNF1.png)
#### Details
* Elegant presentation
* Shows the data in a table etc (searchable / filterable)
* Supports other data formats e.g. json, xlsx etc
* Show graphs (vega, plotly)
* Show maps
* Data summary
* Works with …
* README + csv
* Frictionless dataset
* Frictionless resource
* pure README with frontmatter
Bonus
* Copes with lots of data files
* (?) Git history if you have it … (with data oriented support e.g. diffs)
* ?? (for local not sure?) gives me a queryable api with the data … (< 100MB)
Bonus ++:
* Customizable themes
### Deploy
```bash
cd my-data
data deploy
```
Gives me a url like:
```
https://dataset.owner.datahub.io
```
#### Details
* Deploys a shareable url with the content of show
* Semi-private
* Can integrate access control (?)
* Deploys a data API
* [Other integrations e.g. push to google spreadsheets]
### API
Run `data deploy` and you get an API to your data that you can others can use (as well as the portal):
```bash
https://dataset.owner.datahub.io/api
# query it ...
https://dataset.owner.datahub.io/api?file=data.csv&q=abc
```
NB: this is tabular data only (or JSON data in tabular structure.)
#### Details
* GraphQL by default (?)
* Maybe a basic
* Also can expose raw SQL
* Get an API explorer
* API shows up in portal presentation
* Can customize the API with a yml file
Bonus
* Authentication
* Usage tracking
* Billing per usage
### Hub
I can login at datahub.io and go to datahub.io/dashboard and I have a Dashboard showing all my DataHub projects
### Github integration
Push to github and automatically deploy to DataHub.
#### Details
* Overview and add/track your GitHub projects from DataHub dashboard.
* In the future, we may add other platforms like GitLab etc.
* Deploy on every push to any branch
* Main branch => production => main URL
* Other branches => with hash or branch name => branchOrHash.dataset.username.datahub.io
* Create a new project/dataset from a template (?)
* Have a status check in the GitHub UI that shows if your deployment succeeded/failed/pending
## History
### v1 DataHub(.io)
This was the original CKAN.net starting from 2007 up until circa 2016. It was powered by CKAN and changed from CKAN.net to DataHub.io (first thedatahub.org) around 2012.
### v2 DataHub(.io)
In 2016-2017 DataHub v2 was created and launched. This was a rewrite from scratch with a vision of a next generation DataHub. Datasets were all Frictionless, all data was stored (no more metadata only datasets), there was built in data workflows for processing data including validation, data summarization, CSV to JSON. Visualizations were supported, completely new and elegant UI, command line by default (in fact no UI for creating datasets though one was planned).
v2 was actively worked on from ~2016 to late 2018. It was a major advance technically and product wise on the old DataHub. However, it did not get traction and was rather complex (even if simpler than old CKAN): not only did we build presentation but we were also building our own data factory from the ground up plus doing versioning. There are other people doing parts of that better (we even argued for using airflow vs build our own factory at the time). In addition, i would observe that:
* Code goes with the data so you want to keep them together
* People want to use their existing tools (e.g. pandas or airflow vs another ETL system)
* People want to keep their data (and code) in their “system” if possible (for security, compliance, privacy etc)
We were trying to solve several different problems (with very limited resource):
* Data showcasing
* Lightweight ETL (data factory)
* Data deployment (some combination of the above)
* Marketplace
In particular, we never resolved an ambiguity slightly ambiguous whether we were a data marketplace (we tried this as a pivot for some period from late 2017) or a data publishing platform.
### v3 DataHub
We've been thinking about a v3 of DataHub since 2019. Originally, the core idea was to retain catalog aspect but move to being more git(hub) backed. See for example this (deprecated) outline idea for v3: https://github.com/datopian/datahub-next (NB: Git(hub) backed data portals remain an active idea and as of Q1 2021 we've implemented one and plan more. However it is not the focus for DataHub but is instead probably part of the Portal.JS and CKAN evolution).

View File

@ -1,87 +0,0 @@
# DMS (Data Management System)
This document is an introduction to the technical design of Data Management Systems (DMS). This also covers Data Portals since Data Portals are one major solution one can build with a data management system.
## Domain Model
* Project: a data project. It has has a single dataset in the same way GitHub or Gitlab "project" has a single repo. Traditionally in, say CKAN, this has been implicit and identified with the dataset. There are, however, important differences: a project can include a dataset but also other related functionality such as issues, workflows etc.
* Dataset: a set of data, usually zero or more resources.
* Resource (or File): a single data object.
Revisioning
* Revision
* Tag
* (Branch)
Presentation
* View
* Showcase
* Data API
Identity and Permissions
* Account
* Profile
* Permission
Data Factory
* Task
* DAG (Pipeline)
* Run (Job)
### GraphQL version
```graphql=
type Project {
id: ID!
description: String
readme: String
dataset: Dataset
views: [View]
issues: [Issue]
actions: [Action]
}
type Dataset {
# data package descriptor structure
id: ID
name: String
...
resources: [Resource]
}
type Resource {
# follows Frictionless Resource
path: ...
id: ...
name: ...
schema: Schema
}
# Table Schema usually ...
type Schema {
}
# dataset view e.g. table, graph, map etc
type View {
id: ID!
}
```
## Actions / Flows [component]
* View Dataset: [Showcase page] a page displaying the dataset (or a resource)
* View a Revision / Tag / Branch:
* Add / Upload: ...
* Tag
## Components
* **Meta~~Store~~Service**: stores dataset metadata (and revisions)
* **HubStore**: stores all the users, organizations and their connections to the datasets.
* **SearchStore + Service**: search index and API
* **BlobStore**: stores blobs (for files)

View File

@ -1,94 +0,0 @@
# Data Processing: Data Flows and Data Factories
## Introduction
A common aspect of data management is **processing** data in some way or another: cleaning it, converting it from one format to another, integrating different datasets together etc. Such processing usually takes place in what are termed data (work)flows or pipelines. Each flow or pipeline consists of one or more stages with one particular operation (task) being done with the data at each stage. Finally, there is a need for something to manage and orchestrate the data flows/pipelines. This overall system which includes both the flows themselves and the framework for managing them needs a name. We call it a "Data Factory".
Let's have some concrete examples of simple pipelines:
* Loading a raw CSV file into a database (e.g. to power the data API)
* Converting a file from one format to another e.g. CSV to JSON
* Loading a file, validating it and then computing some summary statistics
Fig 1: A simple data pipeline to clean up a CSV file
```mermaid
graph TD
source[Source data e.g. load from CSV]
t1[Transform 1 e.g. delete trailing rows]
t2[Transform 2 e.g. lower case everything]
sink[Write output to CSV]
source -- resource --> t1 --resource--> t2 --resource--> sink
```
### Domain Model
* Tasks: a single processing step that is applied to data
* Flows (DAGs): a flow or pipeline of tasks. These tasks form a "directed acyclic graph" (DAG) where each task is a node.
* Factory: a system for creating and managing (orchestrating, monitoring etc) those flows.
Each flow or pipeline consists of one or more stages with one particular operation (task) being done with the data at each stage.
In a basic setup a flow is linear: the data arrives, operation A happens, then operation B, and, finally operation C. However, more complex flows/pipelines can involve branching e.g. the data arrives, then operation A, then there is a branch and operation B and C can happen independently.
**Fig 2: An illustration of a Generic Branching Flow (DAG)**
```mermaid
graph TD
a[Source] --> b
a --> c
b --> d
c --> d
d --> sink[Sink]
```
## CKAN v2
CKAN v2 has two implicit data factory system embedded in other functionality. These systems are technically entirely independent:
* Data Load system for loading data to the DataStore -- see [Data Load page &raquo;](/docs/dms/load/)
* Harvesting system for importing dataset metadata from other catalogs -- see [Harvesting page &raquo;](/docs/dms/harvesting/)
## CKAN v3
The Data Factory system is called AirCan and is built on top of AirFlow. AirCan itself can be used on its own or integrated with CKAN.
AirCan:
* Runner: Apache AirFlow
* Pipelines/DAGs: https://github.com/datopian/aircan. This is a set of AirFlow DAGs designed for common data processing operations in a DMS such as loading data into Data API storage.
CKAN integration:
* CKAN extension: https://github.com/datopian/ckanext-aircan. This hooks key actions from CKAN into AirCan and provides an API to run the flows (DAGs).
* GUI: Under development.
**Status**: Beta. AirCan and ckanext-aircan are in active use in production. GUI is under development.
**Documentation**: including setup and use of the all the components including CKAN integration can be found in https://github.com/datopian/aircan
### Design
See [Design page](/docs/dms/flows/design).
## Links
* [Research](/docs/dms/flows/research) - list of tools and concepts with some analysis
* [History](/docs/dms/flows/history) - some previous thinking and work (2016-2019)
## Appendix: Our Previous Work
See also [History page](/docs/dms/flows/history).
* http://www.dataflows.org/ - new system Adam + Rufus designed in spring 2018 and Adam led development on
* https://github.com/datahq/dataflows
* https://github.com/datahq/dataflows/blob/master/TUTORIAL.md
* https://github.com/datahq/dataflows/blob/master/PROCESSORS.md
* https://github.com/datopian/dataflow-demo - Rufus outline on a very simple tool from April 2018
* https://github.com/datopian/factory - datahub.io Factory "The service is responsible for running the flows for datasets that are frequently updated and maintained by Datahub. Service is using Datapackage Pipelines is a framework for declarative stream-processing of tabular data, and DataFlows to run the flows through pipelines to process the datasets."

Binary file not shown.

Before

Width:  |  Height:  |  Size: 216 KiB

View File

@ -1,225 +0,0 @@
# Data Factory Design
Our Data Factory system is called AirCan. A Data Factory is a set of services/components to process and integrate data (coming from different sources). Plus patterns / methods for integrating with CKAN and the DataHub.
## Components
```mermaid
graph LR
subgraph Orchestration
airflow[AirFlow]
airflowservice[AirFlow service]
end
subgraph CKAN integration
ckanhooks[CKAN extension to trigger and report on factory activity]
ckanapi[API for triggering DAGs etc]
ckanui[UI integration - display info on ]
end
subgraph Processors and Flows
ckandatastoreload[CKAN Loader lib]
ckanharveters[CKAN Harvesters]
validation[Validation Lib]
end
```
## DataStore Load job story
### Reporting Integration
When I upload a file to CKAN and it is getting loaded to the datastore (automatically), I want to know if that succeeded or failed so that I can share with my users that the new data is available (or do something about the error).
For a remote Airflow instance (let's say on Google Composer), describe the DAG tasks and the process. i.e.
* File upload on CKAN triggers the ckanext-aircan connector
* which makes API request to airflow on GCP and triggers a DAG with following parameters
* A f11s resource orject including
* the remote location of the CSV file and the reource ID
* The target resource id
* An API key to use when loading to CKAN datastore
* [A callback url]
* The DAG
* deletes the datatore table
* if it exists, creates a new datastore table
* loads CSV from the specified location (inforation available on DAG parameters)
* converts the CSV to JSON. The output of the converted JSON file will be in a bucket on GCP.
* upserts the JSON data row by row into the CKAN DataStore via CKAN's DataStore API
* This is what we have now: {'invoke{"message":"Created <DagRun ckan_api_load_gcp @ 2020-07-14 13:04:43+00:00: manual__2020-07-14T13:04:43+00:00, externally triggered: True>"}'} `/api/3/action/datastore_create` and passing the contents of the json file
* OR using upsert with inserts (faster) NB: datapusher just pushes the whole thing into `datastore_create` so stick with that.
* OR: if we are doing postgres copy we need direct access to postgres DB
* ... [tbd] notifies CKAN instance of this (?)
Error Handling and other topics to consider
* How can we let CKAN know something went wrong? Shall we create a way to notify a certain endpoint on ckannext-aircan connector?
* Shall we also implement a timeout on CKAN?
* What are we going to display in case of an error?
* The "tmp" bucket on GCP will eventually get full of files; shall we flush it? How do we know when it's safe to delete a file?
* Lots of ways up this mountain.
* What do we do for large files?
## AirCan API
AirCan is built on AirFlow so we have same basic API TODO: insert link
However, we have standard message formats to pass to DAGs following these principles: All dataset and data resource objects should following the Frictionless specs
Pseudo-code showing how we call the API:
```python=
airflow.dag_run({
"conf": {
"resource": json.dumps({ # f11s resource object
resource_id: ...
path: ...
schema: ...
})
"ckan_api_key: ...
"ckan_api_endpoint": demo.ckan.org/api/
}
})
```
See for latest, up to date version: https://github.com/datopian/ckanext-aircan/blob/master/ckanext/aircan_connector/action.py#L68
## CKAN integration API
There is a new API as follows:
`http://ckan:5000/api/3/action/aircan_submit?dag_id=...&dataset=...&resource`
Also DAGs can get triggered on events ... TODO: go look at Github actions and learn from it ...
## Architecture
Other principles of architecture:
* AirFlow tasks and DAGs should do very little themselves and should hand off to separate libraries. Why? To have better separation of concerns and **testability**. AirCan is reasonably cumbersome to test but an SDK is much more testable.
* Thus AirFlow tasks are often just going to pass through arguments TODO: expand this with an example ...
* AirFlow DAG will have incoming data and config set in "global" config for the DAG and so available to every task ...
* Tasks should be as decoupled as possible. Obviously there *is* some data and metadata passing between tasks and that should be done by writing those to a storage bucket. Metadata MUST be stored in f11s format.
* See this interesting blog post (not scientific) about why the previous approach, with side effcts, is not very resilient in the long run of a project https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a
* don't pass data explicitly between tasks (rather it is passed implicitly via an expectation of where the data is stored ...)
* tasks and flows should be re-runnable ... (no side effects principle)
Each task can write to this location:
```
bucket/dagid/runid/taskid/resource.json
bucket/dagid/runid/taskid/dataset.json
bucket/dagid/runid/taskid/... # data files
```
## UI in DMS
URL structure on a daaset
```
# xxx is a dataset
/@myorg/xxx/actions/
/@myorg/xxx/actions/runs/{id}
```
Main question: to display to user we need some way to log what jobs are associated with what datasets (and users) and perhaps their status
* we want to keep factory relatively dumb (it does not know about datasets etc etc)
* in terms of capabilities we need a way to pass permissions into the data factory (you hand over the keys to your car)
Simplest approach:
* MetaStore (CKAN metadata db) has Jobs table which have structure of `| id | factory_id | job_type | created | updated | dataset | resource | user | status | info |` (where info is json blob)
* status = one of `WAITING | RUNNING | DONE | FAILED | CANCELLED`. If failed we should have stuff in info about that.
* `job_type` = one of `HARVEST | LOAD | VALIDATE ...` it is there so we could have several different factory jobs in one db
* `info`: likely stuff
* run time
* error information (on failure)
* success information: what was outcome, where are outputs if any etc
* On creating a job in the factory, the factory returns a factory id. the metastore stores the factory id into a new job object along with dataset and user info ...
* Qu: why have id and factory_id separate? is there any situation where you have a job w/o a factory id?
* Then on loading a job page in frontend you can poll the factory for info and status (if status is WAITING or RUNNING)
* => do we need the `info` column on the job (it's just a cache of this info)?
* Ans: useful for jobs which are complete so we don't keep polling the factory (esp if factory deletes stuff)
* Can list all jobs for a given dataset (or resource) with info about them
Qus:
* For Data Factory what do I do with Runs that are stale etc - how do I know who they are associated with. Can I store metadata on my Runs like who requested it etc.
### UI Design
Example from Github:
![](https://i.imgur.com/xnTRq5T.png)
## Appendix
### Notes re AirCan API
https://medium.com/@ptariche/interact-with-apache-airflows-experimental-api-3eba195f2947
```
{"message":"Created <DagRun ckan_api_load_gcp @ 2020-07-14 13:04:43+00:00: manual__2020-07-14T13:04:43+00:00, externally triggered: True>"}
GET /api/experimental/dags/<string:dag_id>/dag_runs/<string:execution_date>
GET /api/experimental/dags/ckan_api_load_gcp/dag_runs/2020-07-14 13:04:43+00:00
https://b011229e45c662be6p-tp.appspot.com/api/experimental/dags/ckan_api_load_gcp/dag_runs/2020-07-14T13:04:43+00:00
Resp: `{"state":"failed"}`
```
### Google Cloud Composer
Google Cloud Composer is a hosted version of AirFlow on Google Cloud.
#### How Google Cloud Composer differs from local AirFlow
* File handling: On GCP, all the file handling must become interaction with a bucket ~rufus: what about from a url online (but not a bucket)
Specifying the csv resource location (on a local Airflow) must become sending a resource to a bucket (or just parsing it from the JSON body). When converting it to a JSON file, it must become an action of creating a file on a bucket.
* Authentication: TODO
### AirFlow Best Practices
* Should you and how do you pass information between tasks?
* https://medium.com/datareply/airflow-lesser-known-tips-tricks-and-best-practises-cf4d4a90f8f
* https://towardsdatascience.com/airflow-sharing-data-between-tasks-7bbaa27eeb1
### What terminology should we use?
ANS: we use AirFlow terminology:
* Task
* DAG
* DagRun
For internals what are the options?
* Task or Processor or ...
* DAG or Flow or Pipeline?
TODO: table summarizing options in AirFlow, Luigi, Apache Beam etc.
#### UI Terminology
* Actions
* Workflows
Terminology options
* Gitlab
* Pipelines: you have
* Jobs (runs of those
* Schedules
* Github
* Workflows
* Runs
* (Schedules - not explicit)
* Airflow
* DAGs
* Tasks
* DAG Runs

View File

@ -1,517 +0,0 @@
# Data Factory and Flows Design - Oct 2017 to Apr 2018
Date: 2018-04-08
> [!note]
>This is a miscellaneous content from various HackMD docs. I'm preserving because either a) there is material to reuse here that I'm not sure is elsewhere b) there were various ideas in here we used later (and it's useful to see their origins).
>
>Key content:
>
>* March-April 2018: first planning of what became dataflows (had various names including dataos). A lot of my initial ideas ended up in this micro-demo https://github.com/datopian/dataflow-demo. This evolved with Adam into https://github.com/datahq/dataflows
>* Autumn 2017: planning of Data Factory which was the data processing system inside DataHub.io. This was more extensive than dataflows (e.g. it included a runner, an assembler etc) and was based original data-package-pipelines and its runner. Issues with that system was part of the motivation for starting work on dataflows.
>
>~Rufus May 2020
## Plan April 2018
* tutorial (what we want our first post to look like)
* And then implement minimum for that
* programmatic use of pipelines and processors in DPP
* processor abstraction defined ...
* DataResource and DataPackage object that looks like [Frictionless Lib pattern][frictionless-lib]
* processors library split out
* code runner you can call dataos.run.runSyncPipeline
* dataflow init => python and yaml
* @adam Write up Data Factory architecture and naming as it currently stands [2h]
[frictionless-lib]: http://okfnlabs.org/blog/2018/02/15/design-pattern-for-a-core-data-library.html
## 8 April 2018
Lots of note on DataFlow which are now moved and refactored into https://github.com/datahq/dataflow-demo
The Domain Model of Factory
* Staging area
* Planner
* Runner
* Flow
* Pipelines
* Processors
```mermaid
graph TD
flow[Flow]
pipeline[Pipelines]
processor[Processors]
flow --> pipeline
pipeline --> processor
```
Assembler ...
```mermaid
graph LR
source[Source from User<br/>source.yaml] --assembler--> plan[Execution Plan<br/>DAG of pipelines]
plan --> pipeline[Pipeline Runner<br/>Optimizer/Dependency management]
```
=> Assembler generates a DAG.
- dest filenames in advance ...
- for each pipeline: pipelines it dpeends on
- e.g. sqlite: depends on on all derived csv pipelines running
- node depends: all csv, all json pipelines running
- zip: depends on all csv running
- Pipelines
```mermaid
graph LR
source[Source Spec Parse<br/>validate?] --> planner[Planner]
planner --> workplanb(Plan of Work<br/><br/>DAG of pipelines)
subgraph Planner
planner
subplan1[Sub Planner 1 e.g. SQLite]
end
```
## 5 Oct 2017
Notes:
* Adam: finding more and more bugs (edge cases) and then applying fixes but then more issues
* => Internal data model of pipelines was wrong ...
* Original data model has a store: with one element the pipeline + a state (its idle, invalid, waiting to be executed, running, dirty)
* Problem starts: you have a very long pipeline ...
* something changes and pipeline gets re-added to the queue. then you have same pipeline in queue in two different states. Should not be a state of the pipeline but state of the execution of the pipeline.
* Split model: pipeline (with their hash) + "runs" ordered by time of request
Questions:
* Tests in assembler ...
### Domain Model
```mermaid
graph TD
flow[Flow]
pipeline[Pipelines]
processor[Processors]
flow --> pipeline
pipeline --> processor
```
* Pipelines have no branches they are always linear
* Input: nothing or a file or a datapackage (source is stream or nothing)
* Output: datapackage - usually dumped to something (could be stream)
* Pipelines are a list of processors **and** their inputs
* A Flow is a DAG of pipelines
* In our case: one flow produces a "Dataset" at a given "commit/run"
* Source Spec + DataPackage[.json] => (via assembler) => Flow Spec
* Runner
* Pipelines runner: a set of DAG of pipelines (where each pipeline is schedule to run once all dependencies have been run)
* Events => lead to new flows or pipelines being created ... (or existing ones being stopped or destroyed)
```mermaid
graph LR
subgraph flow2
x --> z
y --> z
end
subgraph flow1
a --> c
b --> c
end
```
State of Factory(flows, state)
f(event, state) => state
flow dependencies?
Desired properties
* We economise on runs: we don't rerun processors (pipelines?) that have same config and input data
* => one reason for breaking down into smaller "operators" is that we economise here ...
* Simplicity: the system is understandable ...
* Processors (pipelines) are atomic - they get their configuration and run ...
* We can generate from a source spec and an original datapackage.json a full set of pipelines / processors.
* Pipelines as Templates vs Pipelines as instances ...
pipeline id = hash(pipeline spec, datapackage.json)
{'{pipelineid}'}/...
next pipeline can rely on {'{pipelineid}'}
Planner ...
=> a pipeline is never rerun (once it is run)
| | Factory | Airflow |
|--------------|----------------------------|-----------|
| DAG | Implicit (no concept) | DAG |
| Node | Pipelines (or processors?) | Operators |
| Running Node | ? Running pipelines | Tasks |
| Comments | | |
https://airflow.incubator.apache.org/concepts.html#workflows
# Analysis from work for Preview
As a Publisher I want to upload a 50Mb CSV so that the showcase page works - it does not crash by browser (because it is trying to load and display 50Mb of CSV)
*plus*
As a Publisher I want to customize whether I generate a preview for a file or not so that I don't get inappropriate previews
> As a Publisher I want to have an SQLite version of my data auto-built
>
> As a Publisher I want to upload an Excel file and have csv versions of each sheet and an sqlite version
### Overview
*This is what we want*
```mermaid
graph LR
subgraph Source
file[vix-daily.csv]
view[timeseries-1<br/>Uses vix-daily]
end
subgraph "Generated Resources"
gcsv[derived/vix-daily.csv]
gjson[derived/vix-daily.json]
gjsonpre[derived/vix-daily-10k.json]
gjsonpre2[derived/view-time-series-1.json]
end
subgraph "Generated Views"
preview[Table Preview]
end
file --rule--> gcsv
file --rule--> gjson
file --rule--> preview
view --> gjsonpre2
preview --> gjsonpre
```
### How does this work?
#### Simple example: no previews, single CSV in source
```mermaid
graph LR
load[Load Data Package<br/>Parse datapackage.json]
parse[Parse source data]
dump((S3))
load --> parse
parse --> dcsv
parse --> djson
parse --> dumpdp
dcsv --> dump
djson --> dump
dsqlite --> dump
dnode --> dump
dumpdp --> dump
dcsv --> dnode
dcsv --> dsqlite
subgraph "Dumpers 1"
dcsv[Derived CSV]
djson[Derived JSON]
end
subgraph "Dumpers 2 - after Dumper 1"
dsqlite[SQLite]
dnode[Node]
end
dumpdp[Derived DataPackage.json<br/><br/>Assembler gives it the DAG info<br/>Runs after everything<br/>as needs size,md5 etc]
```
```yaml=
meta:
owner: <owner username>
ownerid: <owner unique id>
dataset: <dataset name>
version: 1
findability: <published/unlisted/private>
inputs:
- # only one input is supported atm
kind: datapackage
url: <datapackage-url>
parameters:
resource-mapping:
<resource-name-or-path>: <resource-url>
outputs:
- ... see https://github.com/datahq/pm/issues/17
```
#### With previews
```mermaid
graph LR
load[Load Data Package<br/>Parse datapackage.json]
parse[Parse source data]
dump((S3))
load --> parse
parse --> dcsv
parse --> djson
parse --> dumpdp
dcsv --> dump
djson --> dump
dsqlite --> dump
dnode --> dump
dumpdp --> dump
dcsv --> dnode
dcsv --> dsqlite
parse --> viewgen
viewgen --> previewgen
previewgen --view-10k.json--> dump
subgraph "Dumpers 1"
dcsv[Derived CSV]
djson[Derived JSON]
end
subgraph "Dumpers 2 - after Dumper 1"
dsqlite[SQLite]
dnode[Node]
viewgen[Preview View Gen<br/><em>Adds preview views</em>]
previewgen[View Resource Generator]
end
dumpdp[Derived DataPackage.json<br/><br/>Assembler gives it the DAG info<br/>Runs after everything<br/>as needs size,md5 etc]
```
### With Excel (multiple sheets)
Source is vix-daily.xls (with 2 sheets)
```mermaid
graph LR
load[Load Data Package<br/>Parse datapackage.json]
parse[Parse source data]
dump((S3))
dumpdp[Derived DataPackage.json<br/><br/>Assembler gives it the DAG info<br/>Runs after everything<br/>as needs size,md5 etc]
load --> parse
parse --> d1csv
parse --> d1json
parse --> d2csv
parse --> d2json
parse --> dumpdp
d1csv --> dump
d1json --> dump
d2csv --> dump
d2json --> dump
dsqlite --> dump
dnode --> dump
dumpdp --> dump
d1csv --> dnode
d1csv --> dsqlite
d2csv --> dnode
d2csv --> dsqlite
d1csv --> view1gen
d2csv --> view2gen
view1gen --> preview1gen
view2gen --> preview2gen
preview1gen --view1-10k.json--> dump
preview2gen --view2-10k.json--> dump
subgraph "Dumpers 1 sheet 1"
d1csv[Derived CSV]
d1json[Derived JSON]
end
subgraph "Dumpers 1 sheet 2"
d2csv[Derived CSV]
d2json[Derived JSON]
end
subgraph "Dumpers 2 - after Dumper 1"
dsqlite[SQLite]
dnode[Node]
view1gen[Preview View Gen<br/><em>Adds preview views</em>]
preview1gen[View Resource Generator]
view2gen[Preview View Gen<br/><em>Adds preview views</em>]
preview2gen[View Resource Generator]
end
```
datapackage.json
```javascript=
{
resources: [
{
"name": "mydata"
"path": "mydata.xls"
}
]
}
```
```yaml=
meta:
owner: <owner username>
ownerid: <owner unique id>
dataset: <dataset name>
version: 1
findability: <published/unlisted/private>
inputs:
- # only one input is supported atm
kind: datapackage
url: <datapackage-url>
parameters:
resource-mapping:
<resource-name-or-path>: <resource-url>
processing:
-
input: <resource-name> # mydata
output: <resource-name> # mydata_sheet1
tabulator:
sheet: 1
-
input: <resource-name> # mydata
output: <resource-name> # mydata_sheet2
tabulator:
sheet: 2
```
```yaml=
meta:
owner: <owner username>
ownerid: <owner unique id>
dataset: <dataset name>
version: 1
findability: <published/unlisted/private>
inputs:
- # only one input is supported atm
kind: datapackage
url: <datapackage-url>
parameters:
resource-mapping:
<resource-name-or-path>: <resource-url> // excel file
=> (implictly and in cli becomes ...)
...
processing:
-
input: <resource-name> # mydata
output: <resource-name> # mydata-sheet1
tabulator:
sheet: 1
```
Result
```javascript=
{
resources: [
{
"name": "mydata",
"path": "mydata.xls"
},
{
"path": "derived/mydata.xls.sheet1.csv"
"datahub": {
"derivedFrom": "mydata"
}
},
{
"path": "derived/mydata.xls.sheet2.csv"
"datahub": {
"derivedFrom": "mydata"
}
}
]
}
```
### Overall component design
Assembler ...
```mermaid
graph LR
source[Source from User<br/>source.yaml] --assembler--> plan[Execution Plan<br/>DAG of pipelines]
plan --> pipeline[Pipeline Runner<br/>Optimizer/Dependency management]
```
=> Assembler generates a DAG.
- dest filenames in advance ...
- for each pipeline: pipelines it dpeends on
- e.g. sqlite: depends on on all derived csv pipelines running
- node depends: all csv, all json pipelines running
- zip: depends on all csv running
- Pipelines
```mermaid
graph LR
source[Source Spec Parse<br/>validate?] --> planner[Planner]
planner --> workplanb(Plan of Work<br/><br/>DAG of pipelines)
subgraph Planner
planner
subplan1[Sub Planner 1 e.g. SQLite]
end
```
### NTS
```
function(sourceSpec) => (pipelines, DAG)
pipeline
pipeline-id
steps = n *
processor
parameters
schedule
dependencies
```

View File

@ -1,109 +0,0 @@
# Data Flows + Factory - Research
## Tooling
* Luigi & Airflow
* These are task runners - managing a dependency graph between 1000s of tasks.
* Neither of them focus on actual data processing and are not a data streaming solution. Tasks do not move data from one to the other.
* AirFlow: see further analysis below
* Nifi: Server based, Java, XML - not really suitable for quick prototyping
* Cascading: Only Java support
* Bubbles http://bubbles.databrewery.org/documentation.html - https://www.slideshare.net/Stiivi/data-brewery-2-data-objects
* mETL https://github.com/ceumicrodata/mETL mito ETL (yaml config files)
* Apache Beam: see below
* https://delta.io/ - acid for data lakes (mid 2020). Comes out of DataBricks. Is this pattern or tooling?
## Concepts
* Stream and Batch dichotomy is probably a false one -- and unhelpful. Batch is just some grouping of stream. Batch done regularly enough starts to be a stream.
* More useful is complete vs incomplete data sources
* Hard part of streaming (or batch) work is handling case where events arrive "late". For example, let's say i want to total up total transaction volume at a bank per day ... but some transactions arrived at the server late e.g. a transaction at 2355 actually arrives at 1207 because of network delay or some other issue then if i batch at 1200 based on what has arrived i have an issue. Most of work and complexity in Beam / DataFlow model relates to this.
* Essential duality between flows and states via difference and wum. E.g. transaction and balance:
* Balance over time -- differenced --> Flow
* Flow -- summed --> Balance
* Balance is often just a cached "sum".
* Also relevant to datsets: we often think of them as states but really they are a flow.
### Inbox
* [x] DataFlow paper: "The Dataflow Model: A Practical Approach to BalancingCorrectness, Latency, and Cost in Massive-Scale,Unbounded, Out-of-Order Data Processing" (2015)
* [ ] Stream vs Batch
* [x] Streaming 101: The world beyond batch. A high-level tour of modern data-processing concepts. (Aug 2015) https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 **Good intro to streaming and DataFlow by one of its authors**
* [ ] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Follow up to previous paper
* [ ] Apache Beam **in progress -- see below*
* [ ] dbt **initial review. Mainly a way conventient way of tracking in DB transforms**
* [ ] Frictionless DataFlows
* [x] Kreps (kafka author): https://www.oreilly.com/radar/questioning-the-lambda-architecture/
* lambda architecture is where you run both batch and streaming in parallel as way to have traditional processing plus some kind of real-time results.
* basically Kreps says its a PITA to keep two parallel systems running and you can just go "streaming" (remember we are beyond the dichotomy)
## Apache Beam
https://beam.apache.org/blog/2017/02/13/stateful-processing.html
### Pipeline
https://beam.apache.org/releases/pydoc/2.2.0/apache_beam.pipeline.html
Pipeline, the top-level Beam object.
A pipeline holds a DAG of data transforms. Conceptually the nodes of the DAG are transforms (PTransform objects) and the edges are values (mostly PCollection objects). The transforms take as inputs one or more PValues and output one or more PValue s.
The pipeline offers functionality to traverse the graph. The actual operation to be executed for each node visited is specified through a runner object.
Typical usage:
```python
# Create a pipeline object using a local runner for execution.
with beam.Pipeline('DirectRunner') as p:
# Add to the pipeline a "Create" transform. When executed this
# transform will produce a PCollection object with the specified values.
pcoll = p | 'Create' >> beam.Create([1, 2, 3])
# Another transform could be applied to pcoll, e.g., writing to a text file.
# For other transforms, refer to transforms/ directory.
pcoll | 'Write' >> beam.io.WriteToText('./output')
# run() will execute the DAG stored in the pipeline. The execution of the
# nodes visited is done using the specified local runner.
```
## Airflow
Airflow organices tasks in a DAG. A DAG (Directed Acyclic Graph) is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.
* Each task could be Bash, Python or others.
* You can connect the tasks in a DAG as you want (which one depends on which).
* Tasks could be built from Jinja templates.
* It has a nice and comfortable UI.
You can also use _Sensors_: you can wait for certain files or database changes for activate anoter jobs.
References
* https://github.com/apache/airflow
* https://medium.com/videoamp/what-we-learned-migrating-off-cron-to-airflow-b391841a0da4
* https://medium.com/@rbahaguejr/airflow-a-beautiful-cron-alternative-or-replacement-for-data-pipelines-b6fb6d0cddef
### airtunnel
https://github.com/joerg-schneider/airtunnel
* https://medium.com/bcggamma/airtunnel-a-blueprint-for-workflow-orchestration-using-airflow-173054b458c3 - excellent piece on how to pattern airflow - "airtunnel", plus overview of key tooling
> This is why we postulate to have a central declaration file (as in YAML or JSON) per data asset, capturing all these properties required to run a generalized task (carried out by a custom operator). In other words, operators are designed in a generic way and receive the name of a data asset, from which they can grab its declaration file and learn how to parameterize and carry out the specific task.
```
├── archive
├── ingest
│ ├── archive
│ └── landing
├── ready
└── staging
├── intermediate
├── pickedup
└── ready
```

View File

@ -1,56 +0,0 @@
# Frictionless Data and Data Packages
## What's a Data Package?
A [Data Package](https://frictionlessdata.io/data-package/) is a simple container format used to describe and package a collection of data (a dataset).
A Data Package can contain any kind of data. At the same time, Data Packages can be specialized and enriched for specific types of data so there are, for example, Tabular Data Packages for tabular data, Geo Data Packages for geo data etc.
## Data Package Specs Suite
When you look more closely you'll see that Data Package is actually a *suite* of specifications. This suite is made of small specs, many of them usuable on their own, that you can also combine together.
This approach also reflects our philosophy of "small pieces, loosely joined" as well as "make the simple things simple and complex things possible": it easy to just use the piece you need as well to scale up to more complex needs.
For example, the basic Data Package spec can be combined with Table Schema spec for tabular data (plus CSV as the base data format) to create the Tabular Data Package specification.
We also decomposed the overall Data Package spec into Data Package and Data Resource with the Data Resource spec just describing an individual file and a Data Package being a collection of one or more Data Resources with additional dataset-level metadata.
**Example: Data Resource spec + Table Schema spec becomes a Tabular Data Resource spec**
```mermaid
graph TD
dr[Data Resource] --add table schema--> tdr[Tabular Data Resource]
```
**Example: How a Tabular Data Package is composed out of other specs**
```mermaid
graph TD
dr[Data Resource] --> tdr
tdr[Tabular Data Resource] --> tdp[Tabular Data Package]
dp[Data Package] --> tdp
jts[Table Schema] --> tdr
csvddf[CSV Data Descriptor] --> tdr
style tdp fill:#f9f,stroke:#333,stroke-width:4px;
```
Two different logics of grouping:
* By function e.g. Tabular stuff ...
* Tabular Data package
* Tabular Data resource
* Inheritance / Composition structure
* Resource -> Tabular Data Resource
* Data Package -> Tabular Data Package
For developers of the specs latter may be better.
For ordinary users I imagine the former is better.
## Tutorials
Data Package Find-Prepare-Share Guide: https://datahub.io/docs/getting-started/datapackage-find-prepare-share-guide

View File

@ -1,113 +0,0 @@
# Frontend
The (read) frontend component covers all of the traditional "read" frontend functionality of a data portal: front page, searching datasets, viewing datasets etc.
>[!tip]Announcing Portal.js📣
>Portal.js 🌀 is a javascript framework for building rich data portal frontends fast using a modern frontend approach (JavaScript, React, SSR).
>
>https://github.com/datopian/portal.js
## Features
* **Home Page** When visiting the data portal's home page, I want to see an overview of the portal (e.g. datasets) so that I understand if it's relevant for me.
* **Search/Browse the Catalog** When looking for dataset, I want to search for specific strings (keywords, topics etc.) so that I can find it quickly, if available.
* **Dataset Showcase** When exploring a dataset I want to see a description and key information (title etc) and (if possible) data preview and download options so that I understand what it contains and decide if I want to use it or download it.
* **Organization and User Profiles**: I want to see the data published by a particular team, organization or user so that I can find the data I want or understand what a particular group are doing
* **Groups/Topics/Collections**: I want to browse datasets by topic so that I can find the data I want or find unexpected results
* **Custom additional pages**
* **Permissions**: I want to restrict access to some of the above based on a user's role and memberships so that I can share data with only the appropriate people
Developer Experience
* **Theme** (simple): When developing a new portal, I want to theme it quickly and easily so that it's look and feel aligned to the client's needs.
* **Customize Home Page**: When building a data portal home page I want to be able to customize it completely, integrating different widgets so that I have a great landing experience for users
* **i18n**: I want to be able to i18n content and enable the client to do this so that we have a site in the client's locale
* **Rich customization** (new routes, major page changes)
* When working on a data portal, I want to add frontend functionality to existing templates so that I can build upon past work and still extend the project to my own needs.
* When building up a new frontend I want to quickly add standard pages and components (and tweak them) so that I have a basic functional site quickly
* **Use common languages and tooling**: When working on a data portal, I want to build it using Javascript so that I can rely on the latest frontend technologies (frameworks/libraries).
* **Deploy quickly**: When delivering a data portal, I want to quickly and easily deploy changes to my frontend so that I can reduce the feedback loop.
## CKAN v2
The Frontend is implemented in the core app spread across various controllers, templates etc. For extending/theming a template, you have to write an extension (`ckanext-mysite`), and either override or inherit from the default files.
* Home page. The CKAN default template shows: Site title, Search element, The latest organizations, The latest groups. In order to change this, we need to create a CKAN extension and modify templates etc.
* Search/Browse the Catalog Already available in CKAN Classic (v2) with ability to search by facets etc., see an example here - https://demo.ckan.org/dataset
* Dataset Showcase. It is already available by default, for example:
* Dataset page - https://demo.ckan.org/dataset/dataset_389383 - a summary of resources and package level metadata such as package title, description, license etc.
* Package controller - https://github.com/ckan/ckan/blob/master/ckan/controllers/package.py
* Package view module - https://github.com/ckan/ckan/blob/master/ckan/views/dataset.py
* Resource page - https://demo.ckan.org/dataset/dataset_389383/resource/331f57d1-74fc-46ad-9885-50eb26dde13a - showcase of individual resource including views etc.
* Resource view module - https://github.com/ckan/ckan/blob/master/ckan/views/resource.py
* Package and resource templates - https://github.com/ckan/ckan/tree/master/ckan/templates/package
### Developer Experience (DX)
Docs - https://docs.ckan.org/en/2.8/theming/index.html
* You need to do it in a new CKAN extension and follow recommended standards. There are no easy ways of reusing code from other projects, since most often they are not written in the required languages/frameworks/libraries.
* Nowdays, the best to do it is to create an extension for each of the components.
* There's no easy documented path for achieving this.
* The easier way is to deploy a complete CKAN v2 stack using Docker Compose.
- Theming - https://docs.ckan.org/en/2.8/theming/index.html
- Create new helper functions https://docs.ckan.org/en/2.8/theming/templates.html#template-helper-functions
### Theming
Theming is done via CKAN Classic extensions. See https://docs.ckan.org/en/2.8/theming/index.html
### Extending (Plugins)
In CKAN Classic you extend the frontend e.g. adding new pages or altering existing ones by a) overriding templates b) creating an extension using specific plugin points (e.g. IController): https://docs.ckan.org/en/2.8/extensions/index.html
### Limitations
There are two main issues:
* There is no standard, satisfactory way to create data portals that integrates data and content. Current methods for integrating CMS content into CKAN are clunky and limited.
* Theming and frontend work is slow, painful and difficult for standard frontend ddvs because a) it requires installing and interacting with the full (complex) CKAN b) you use very specific frontend stack (python etc rather than javascript) c) template spaghetti (the curse of a million "slots") (did inheritance rather tha composition)
* There is too much coupling of frontend and backend e.g. logic layer doing dictize. Poor separation of concerns.
In more detail:
* Theming - styling, templating:
* It uses Bootstrap 3 (out-dated). An upgrade takes significant amount of effort because all the existing themes rely, or may rely, on it.
* No documented way of switching Bootstrap off and replace it for another framework.
* Although the documentation only mentions pure CSS, CKAN also uses LESS. It's not clear how a theme could be written in LESS, if recommended or possible.
* For changing or adding a better overview, one needs to create a CKAN extension, with its own challenges.
* It needs to happen in Python/Jinja, overriding the exting actions and templates.
* The main challenge is general theming in CKAN Classic, e.g., you have to follow the CKAN Classic way using inheritance model of templates.
* Javascript:
* No viable way of extending it in other languages such as Javascript.
* It's not simple to achieve the common task of adding Javascript to the frontend.
* You must understand CKAN and a large portion of its architecture.
* You must run CKAN in its entirety.
* The document is far from short https://github.com/ckan/ckan/blob/2.8/doc/theming/javascript.rst
* Not (easily, at least) possible to develop a Single Page Application while still relying on CKAN for all the backend.
* Other:
* It's not easy to make configuration changes to how the existing feature works.
* The dataset URL follows a nested RESTFul format, with non-human readable IDs.
* Not good for SEO.
* It may be a reasonable default, but hardly works in practice as stakeholders have their own preferences.
## CKAN v3
>[!tip]Announcing Portal.js📣
>Portal.js 🌀 is a javascript framework for building rich data portal frontends fast using a modern frontend approach (JavaScript, React, SSR).
>
>https://github.com/datopian/portal.js
Previous (stable) version is `frontend-v2`: https://github.com/datopian/frontend-v2. It is written in NodeJS using ExpressJS. For templating it uses [Nunjucks][].
[Nunjucks]: https://mozilla.github.io/nunjucks/templating.html
>[!note]It is easy to write your own Next Gen frontend in any language or framework you like -- much like the frontend of a headless CMS site. And obviously you can still reuse the patterns (and even code if you are using JS) from the default approach presented here.
## RFC
Background and motivation in the RFC https://github.com/ckan/ideas/blob/master/rfcs/0005-decoupled-frontend.md

View File

@ -1,190 +0,0 @@
# Giftless - the GIT LFS server
## Introduction
Our work on Giftless started from the context of two distinct needs:
* Need: direct to (cloud) storage uploading (and download) including from client => you need a service that will issue tokens to upload direct to storage -- what we term a "Storage Access Gateway" =>The Git LFS server protocol actually provides this with its `batch` API. Rather than reinventing the wheel let's use this existing protocol.
* Need: git is already widespread and heavily used by data scientists and data engineers. However, git does not support large files well whilst data work often involves large files. Git LFS is the protocol designed to support large file storage stored outside of git blobstore. If we have our own git lfs server then we can integrate any storage we want with git.
From these we arrived at a vision for a standalone Git LFS server (standalone in contrast to the existing git lfs servers provided as an integrated part of existing git hosting providers such as github or gitlab). We also wanted to be able to customize it so it could be backed onto any major cloud storage backend (e.g. S3, GCS, Azure etc). We also had a preference for Python.
**Why build something new?** We looked around at the existing [Git LFS server implementations][impl] and couldn't find one that looked like it suit our needs: there were only a few standalone servers, only one in Python, and those that did exist were usually quite out of date and supported old versions of the LFS protocol (see appendix below for further details).
[impl]: https://github.com/git-lfs/git-lfs/wiki/Implementations
## Giftless API
Giftless follows the [gif-lfs API][lfsapi] in general, with the following differences and extensions:
* Locking: no support at present
* Multipart: giftless adds support for multi-part transfers. See XXX for details
* Giftless adds optional support for `x-filename` object property, which allows specifying the filename of an object in storage (this allows storage backends to set the "Content-disposition" header when the file is downloaded via a browser, for example)
Below we summarize the key API endpoints.
[lfsapi]: https://github.com/git-lfs/git-lfs/tree/master/docs/api
### `POST /foo/bar/objects/batch`
```
{
"transfers": ["multipart-basic", "basic"],
"operation": "upload",
"objects": [
{
"oid": "20492a4d0d84f8beb1767f6616229f85d44c2827b64bdbfb260ee12fa1109e0e",
"size": 10000000
}
]
}
```
### Optional API endpoints
The following endpoints are also exposed by Giftless, but may not be used in some workflows, depending on your setup:
#### `POST /foo/bar/objects/storage/verify`
Verify an object in storage; This is used by different storage backends to check
a file after it has been uploaded, corresponding to the Git LFS `verify` action.
#### `PUT /foo/bar/objects/storage`
Store an object in local storage; This is only used if Giftless is configured to also
act as the storage backend server, which is not a typical production setup. This accepts
the file to be uploaded as HTTP request body.
An optional `?jwt=...` query parameter can be added to specify a JWT auth token, if JWT
auth is in use.
#### `GET /foo/bar/objects/storage`
Fetch an object from local storage; This is only used if Giftless is configured to also
act as the storage backend server, which is not a typical production setup. This will
return the file contents.
An optional `?jwt=...` query parameter can be added to specify a JWT auth token, if JWT
auth is in use.
### Comment: why the slightly weird API layout
The essence of giftless is to hand out tokens to store or get data in blob storage. One would anticipae the basic API to be a bit simpler e.g. something like implemented in our earlier effort `bitstore`: https://github.com/datopian/bitstore#get-authorized-upload-urls
```json=
POST /authorize
{
"metadata": {
"owner": "<user-id-of-uploader>",
"name": "<data-set-unique-id>"
},
"filedata": {
"filepath": {
"length": 1234, #length in bytes of data
"md5": "<md5-hash-of-the-data>",
"type": "<content-type-of-the-data>",
"name": "<file-name>"
},
"filepath-2": {
"length": 4321,
"md5": "<md5-hash-of-the-data>",
"type": "<content-type-of-the-data>",
"name": "<file-name>"
}
...
}
}
```
However, the origins of git lfs with github means that it has a slightly odd setup whereby the start of the url is `/foo/bar` corresponding to `{org}/{repo}`.
## Mapping from Giftless to Storage
One important question when using giftless is how a file maps from api call to storage ...
This call:
```
/foo/bar/objects/batch
oid: (sha256) xxxx
```
Maps in storage to ...
```
storage-bucket/{prefix}/foo/bar/xxx
```
Where `{prefix}` is configured as env variable to Giftless server (and can be empty)
## Authentication and Authorization
How does giftless determine that you are allowed to upload to
```
/foo/bar/objects/batch
oid: XXXX
```
Does that based on scopes on jwt token ...
See https://github.com/datopian/giftless/blob/feature/21-improve-documentation/docs/source/auth-providers.md for more details.
## Use with git
https://github.com/git-lfs/git-lfs/blob/master/docs/api/server-discovery.md
Set `.lfsconfig`
## Appendix: Summary of Git LFS
https://github.com/git-lfs/git-lfs/blob/master/docs/api/batch.md
* TODO: sequence diagram of git interaction with gif lfs server
* TODO: summary of server API
### Summary of interaction (for git client)
* Perform server discovery to discover the git lfs server: https://github.com/git-lfs/git-lfs/blob/master/docs/api/server-discovery.md
* Use `.lfsconfig` if it exists
* Default: Git LFS appends .git/info/lfs to the end of a Git remote url to build the LFS server URL it will use:
* Authentication (?)
* Send a `batch` API call to the server configured in the Git client's .lfsconfig (TODO: verify config location)
* The specifics of the `batch` request depend on the current operation being performed and the objects (that is files) operated on;
* The response to `batch` includes "instructions" on how to download / upload files to storage
* Storage can be on the same server as Git LFS (in some dedicated endpoints) or on a whole different server, for example S3 or Google Cloud Storage
* Git LFS defines a few "transfer modes" which define how to communicate with the storage server; The most basic mode (known as `"basic"`), uses HTTP GET and PUT to download / upload the files given a URL and some headers.
* There could be other transfer modes - for example, Giftless defines a custom transfer mode called `multipart-basic` which is specifically designed to take advantage of Cloud storage vendors' multipart upload features.
* Based on the transfer mode & instructions (typically URL & headers) specified in the response to the `batch` call, the git lfs client will now interact with the storage server to upload / download files
### How git lfs works for git locally
* Have special git lfs blob storage in git directory
* On checkout pull from that blob store rather than standard one
* on Add and commit write a pointer file inito "git" tree instead of actual file and put file in git lfs blob storage
* On push: push git lfs blobs to blob server
* On pull: pull git lfs blobs that i need for current checkout
* TODO: do i pull other stuff?
### Git LFS Server API
https://github.com/git-lfs/git-lfs/tree/master/docs/api
### Batch API
https://github.com/git-lfs/git-lfs/blob/master/docs/api/batch.md
TODO: summarize here.
## Appendix: Existing Git LFS server implementations
A review of some of the existing GIT LFS server implementations.
* https://github.com/kzwang/node-git-lfs Node, well developed but now archived and last updated 5y ago. This implementation provided inspiration for giftless.
* https://github.com/mgax/lfs: Python, only speaks legacy v1 API and last updated properly ~2y ago
* https://github.com/meltingice/git-lfs-s3 - Ruby, repo is archived. Last updated 6y ago.

View File

@ -1,39 +0,0 @@
# Glossary
[Resource]: #Resource
## Data Management System
A Data Management System is a *framework* for building data management solutions such as data catalogs, data portals, data factories, data workflows and various combinations and extensions of these.
## Dataset
Dataset is a collection of related and (potentially) interconnected [Resource][]s. Example: Excel file with mulitple sheets, Database etc.
## DMS
DMS is an acronym for Data Management System.
## File
Usually a *data* file. See Resource.
## Profile
The structure for general metadata for data. E.g. this dataset follows the "Biodiversity Data Publication v1.3 Profile".
## Resource
Resource (aka File) is a single data file or object. Strictly, the Resource should correspond to a single logical data structure e.g. a single table vs multiple tables. Example: CSV file, single sheet spreadsheet, geojson file.
Confusingly, an actual physical file/resource can correspond to multiple logical resources e.g. an Excel file with multiple sheets corresponds conceptually to a (logical) Dataset with multiple (logical) Resources.
## Schema
A schema for data (specifically a resource). For example:
* The set of fields present e.g the columns in the spreadsheet
* Thee type of each field e.g. is this column a string, number, date etc
* Other restrictions e.g. all values in this field are positive
See Frictionless Table Schema for a detailed spec: http://frictionlessdata.io/table-schema/

View File

@ -1,658 +0,0 @@
# Harvesting
## Introduction
Harvesting is the automated collection into a Data Portal of metadata (and maybe data) from other catalogs and sources.
The core epic is: As a Data Portal Manager I want to harvest datasets' metadata (and maybe data) from other portals into my portal so that all the metadata is in one place and hence searchable/discoverable there
### Features
Key features include:
* Harvest from multiple sources and with a variety of source metadata formats (e.g. data.json, DCAT, CKAN etc).
* Implied is the ability to create and maintain (generic) harvesters for different types of metadata (e.g. data.json, DCAT) (below we call these pipelines)
* Off-the-shelf harvesting for common metadata formats e.g. data.json, DCAT etc
* Incremental, efficient harvesting from a given source. For example, imagine a source catalog that has ~100k datasets and adds 100 new datasets every day. Assuming you have already harvested this catalog, you only want to harvest those 100 new datasets during your daily harvest (and not re-harvest all 100k). Similarly, you want to be able to handle deletions and modifications of existing datasets.
* And even more complex case is where the harvested metadata is edited in the harvesting catalog and one has to handle merging of changes from the source catalog into the harvesting catalog (i.e. you can handle changes in both locations).
* Create and update harvest sources via API and UI
* Run and view harvests via API (and UI) and the background logging and monitoring to support that
* Detailed and useful feedback of harvesting errors so that harvest maintainers (or downstream catalog maintainers) can quickly and easily diagnose and fix issues
* Robust and reliable performance, for example supporting harvesting thousands or even millions of datasets a day
### Harvesting is ETL
"Harvesting" of metadata in its essence is exactly the same as any data collection and transformation process ("ETL"). "Metadata" is no different from any other data for our purposes (it is just quite small!).
This insight allows us to see harvesting as just like any other ETL process. At the same time, the specific nature of *this* ETL process e.g. that it is about collecting dataset metadata, allows us to design the system in specific ways.
We can use standard ETL tools to do harvesting, loosely coupling their operation to the CKAN Classic (or CKAN Next Gen) metastore.
### Domain Model
The Harvesting Domain Model has the following conceptual components:
* **Pipeline**: a generic "harvester" pipeline for harvesting a particular data type e.g. data.json, dcat. A pipeline consists of Processors.
* **Source (aka Harvester)**: the entire spec of a repeatable harvest from a given source including the pipeline to use, the source info and any additional configuration e.g. the schedule on which to run this
* **Run (Job)**: a given run of a Source
* **Dataset**: a resulting dataset.
* **Log (Entry)**: (including errors)
NB: the term harvester is often used both for a pipeline (e.g. the DCAT Harvester) and for a Source e.g. "XYZ Agency data.json Harvester". Because of this confusion we prefer to avoid the term, or to reserve it for an active Source e.g. "the GSA data.json harvester".
### Components
A Harvesting system has the following key subsystems and components:
#### ETL
* **Pipelines**: a standard way of creating pipelines and processors in code
* **Runner**: a system for executing Runs of the harvesters. This should be queue based.
* **Logging**: a system for logging (and reporting) including of errors
* **Scratch (Store)**: Intermediate storage for temporary or partial outputs of the
* **API**: interfaces the runner and errors
#### Sources and Configuration
* **Source Store**: database of Sources
* **API/UI**: UI, API and CLI usually covering Source Store plus reporting on them e.g. Runs, Errors etc
#### UI (web, command line etc)
* User interface (web and/or command line etc) to ETL e.g. runner, errors
* User interface to sources and configuration
#### MetaStore
* **MetaStore**: the store for harvested metadata -- this is considered to be outside the harvesting system itself
{/* <!-- TODO: explain how each of these is implemented in NG harvesting (maybe in a table) and compare with Classic --> */}
## CKAN v2
CKAN v2 implements harvesting via [ckanext-harvest extension][ckanext].
This extension stores configuration in the main DB and harvesters run off a queue process. A detailed analysis of how it works is in [the appendix below](#appendix-ckan-classic-harvesting-in-detail).
[ckanext]: https://github.com/ckan/ckanext-harvest
### Limitations
The main problem is that ckanext-harvest builds its own bespoke mini-ETL system and builds this into CKAN. A bespoke system is less powerful and flexible, harder to maintain etc and building it in makes CKAN more bulky (conceptually, resource wise etc) and creates unnecessary coupling.
Good: Using CKAN as config store and having a UI for that
Not so good:
* An ETL system integrated with CKAN
* Tightly coupled: so running tests of ETL requires CKAN. This makes it harder to creaate tests (with the result that many harvests have few or no tests).
* Installation is painful (CKAN + 6 steps) making it harder to use, maintain and contribute to
* Dependent on CKAN upgrade cycles (tied via code rather than service API)
* CKAN is more complex
* Bespoke ETL system is both less powerful and harder to maintain
* For example, the Runner is bespoke to CKAN rather than using something standard like e.g. Airflow
* Rigid structure (gather, fetch, import) which may not fit many situations e.g. data.json harvesting (one big file with many datasets where everything done in gather) or where dependencies between stages
* Logging and reporting is not great and hard to customize (logs into CKAN)
* Maintenance Status - Some maintenance but not super it looks like but quite a lot outstanding (as of Aug 2019):
* 47 [open issues](https://github.com/ckan/ckanext-harvest/issues)
* 6 [open pull requests](https://github.com/ckan/ckanext-harvest/pulls) (some over a year old)
## CKAN v3
Next Gen harvesting decouples the core "ETL" part of harvesting into a small, self-contained microservice that is runnable on its own and communicates with the rest of CKAN over APIs. This is consistent with the general [next gen microservice approach](ckan-v3).
The design allows the Next Gen Harvester to be used with both CKAN Classic and CKAN Next Gen.
Perhaps most important of all, the core harvester can use standard third-party patterns and tools to make it both more powerful, easier to maintain and easier to use. For example, it can use Airflow for its runner rather than a bespoke system built into CKAN.
### Features
Specific aspects of the next gen approach:
* Simple: Easy to write harvesters -- just a python script and you can create harvesters without needing to know almost anything about CKAN
* Runnable and testable standalone (without the rest of CKAN) which makes running and testing much easier
* Uses the latest standard ETL techniques and technologies
* Multi-cluster support: run one harvester for multiple CKAN instances
* Data Package based
### Design
Here is an overview of the design. Further details on specific parts e.g. Pipelines in following sections. Coloring indicates implementation status:
* Green: Implemented
* Pink: In progress
* Grey: Next up
```mermaid
graph TD
ui[WUI to CRUD Sources]
runui[Run UI]
config(Harvest Source API)
metastore(MetaStore API for<br /> harvested metadata)
logging(Reporting API)
runner[Runner + Queue - Airflow]
runapi[Run API - Logic of Running Jobs etc]
pipeline[Pipeline System + Library]
subgraph "CKAN Classic"
ui --> config
metastore
end
subgraph "Next Gen Components"
runui --> runapi
runapi --> runner
pipeline
end
pipeline --> metastore
config -.-> runapi
runner --> pipeline
pipeline -.reporting/errors.-> logging
subgraph "Pipeline System + Library"
framework[Framework<br/>DataFlows + Data Packages] --> dataerror[Data Errors]
dataerror --> ckansync[CKAN Syncing]
ckansync --> datajson[data.json impl]
end
classDef done fill:lightgreen,stroke:#333,stroke-width:2px;
classDef existsneedsmod fill:blue,stroke:#333,stroke-width:2px;
classDef started fill:yellow,stroke:#333,stroke-width:2px;
classDef todo fill:orange,stroke:#333,stroke-width:2px;
class config,metastore,runner,ui,datajson,ckansync,framework,pipeline done;
class runapi started;
class runui,logging todo;
```
### User Journey
* Harvest Curator goes to WUI for Sources and does Create Harvest Source ...
* Fills it in ...
* Comes back to the harvest source dashboard
* To run a harvest source you go to the new Run UI interface
* It lists all harvest sources like the harvest source ...
* Click on Run (even if this is just to set up the schedule ...)
* Go to Job page for this run which shows the status of this run and any results ...
* [TODO: how do we link up harvest sources to runs]
### Pipelines
These follow a standard pattern:
* Built in Python
* Use DataFlows by default as way to structure the pipeline (but you can use anything)
* Produce data packages at each step
* Pipelines should be divided into two parts:
* Extract and Transform: (fetch and convert) fetching remote datasets and converting them into a standard intermediate form (Data Packages)
* Load: loading that intermediate form into the CKAN instance. This includes not only the format conversion but the work to synchronize state i.e. to create, update or delete in CKAN based on whether a given dataset from the source already has a representation in CKAN (harvests run repeatedly).
```mermaid
graph LR
subgraph "Extract and Transform"
extract[Extract] --> transform[Transform]
end
subgraph "Load"
transform --data package--> load[Load]
end
load --> ckan((CKAN))
```
This pattern has these benefits:
* Load functionality to be reused across Harvest Pipelines (the same Load functionality can be used again and again)
* Cleaner testing: you can test extract and load without needing to have a CKAN instance
* Ability to reuse Data Package tooling
#### Pipeline Example: Fetch and process a data.json
* **Extract**: Take say a data.json
* Validate
* **Transform**: Split into datasets and then transform into data packages
* Save to local
* **Transform 2** check the difference and write to the existing DB.
* **Load**: Write to DB (CKAN metastore / DB)
```mermaid
graph TD
get[Retrieve data.json]
validate[Validate data.json]
split[Split data.json into datasets]
datapkg[Convert each data.json dataset into Data Package]
write[Write results somewhere local]
ckan[Sync to CKAN - work out diff and implement]
get --> validate
validate --> split
split --> datapkg
datapkg --> write
write --> ckan
```
#### Pipeline example detailed
```mermaid
graph TD
download[Download data]
fetch[Fetch]
validate[Validate data]
get_previous_datasets(Get previous harvested data)
save_download_results(Save as Data Package)
compare(Compare)
save_compare_results(Save compare results)
write_destination(Write to destination)
save_previous_data(Save as Data Package)
save_log_report(Save final JSON log)
federal[Federal]
non_federal[Non Federal]
dataset_adapter[Dataset Adapter]
resource_adapter[Resource Adapter]
classDef green fill:lightgreen;
class download,fetch,validate,get_previous_datasets,save_download_results,compare,save_compare_results,write_destination,save_previous_data,save_log_report,federal,non_federal,dataset_adapter,resource_adapter green;
subgraph "Harvest source derivated (data.json, WAF, CSW)"
download
save_download_results
end
subgraph "Harvester core"
fetch
subgraph Validators
validate
federal
non_federal
end
subgraph Adapters
dataset_adapter
resource_adapter
end
end
subgraph "Harvest Source base class"
save_download_results
compare
save_compare_results
save_log_report
end
subgraph "Harvest Destination"
get_previous_datasets
write_destination
save_previous_data
end
download -.transform to general format.-> save_download_results
save_download_results --> compare
compare --> save_compare_results
save_compare_results --> dataset_adapter
dataset_adapter --> resource_adapter
resource_adapter --> write_destination
get_previous_datasets -.transform to general format.-> save_previous_data
save_previous_data --> compare
compare -.Logs.-> save_log_report
download --> fetch
fetch --> validate
validate --> download
```
### Runner
We use Apache Airflow for the Runner.
### Source Spec
This the specification of the Source object
You can compare this to [CKAN Classic HarvestSource objects below](#harvest-source-objects).
```javascript
id:
url:
title:
description:
date:
harvester_pipeline_id: // type in old CKAN
config:
enabled: // is this harvester enabled at the moment
owner: // user_id
publisher_id: // what is this?? Maybe the org the dataset is attached to ...
frequency: // MANUAL, DAILY etc
```
### UI
* Jobs UI: moves to next gen
* Source UI: stays in classic for now ...
### Installation
CKAN Next Gen is in active development and is being deployed in production.
You can find the code here:
https://github.com/datopian/ckan-ng-harvester-core
### Run it standalone
TODO
### How to integrate with CKAN Classic
Config in CKAN MetaStore, ETL in new System
* Keep storage and config in CKAN MetaStore (e.g. CKAN Classic)
* New ETL system for the actual harvesting process
**Pulling Config**
* Define a spec format for sources
* Script to convert this to Airflow DAGs
* Script to convert CKAN Harvest sources into the spec and hence into Airflow DAGs
**Showing Status and Errors**
* We create a viewer from Airflow status => JS SPA
* and then embed in CKAN Classic Admin UI
```mermaid
graph LR
config[Configuration]
etl[ETL - ckanext-harvesting etc]
ckan[CKAN Classic]
new["New ETL System"]
subgraph "Current CKAN Classic"
config
etl
end
subgraph Future
ckan
new
end
config --> ckan
etl --> new
```
More detailed version:
```mermaid
graph TD
config[Configuration e.g. Harvest sources]
api[Web API - for config, status updates etc]
ui[User Interface to create harvests, see results]
metastore["Storage for the harvested metadata (+ data)"]
logging[Logging]
subgraph "CKAN Classic"
config
ui
metastore
api
logging
end
subgraph ETL
runner[Runner + Queue]
pipeline[Pipeline system]
end
pipeline --> metastore
config --> runner
runner --> pipeline
pipeline -.reporting/errors.-> logging
ui --> config
logging --> api
metastore --> api
config --> api
```
### How do I ...
Support parent-child relationships in harvested datasets e.g. in data.json?
Enhance / transform incoming datasets e.g. assigning topics based on sources e.g. this is geodata
## Appendix: CKAN Classic Harvesting in Detail
https://github.com/ckan/ckanext-harvest
README is excellent and def worth reading - key parts of that are also below.
### Key Aspects
* Redis and AMQP (does anyone use AMQP)
* Logs to database with API access (off by default) - https://github.com/ckan/ckanext-harvest#database-logger-configurationoptional
* Dataset name generation (to avoid overwriting)
* Send mail when harvesting fails
* CLI - https://github.com/ckan/ckanext-harvest#command-line-interface
* Authorization - https://github.com/ckan/ckanext-harvest#authorization
* Built in CKAN harvester - https://github.com/ckan/ckanext-harvest#the-ckan-harvester
* Running it: you run the queue listeners (gather )
Existing harvesters
* CKAN - ckanext-harvest
* DCAT - https://github.com/ckan/ckanext-dcat/tree/master/ckanext/dcat/harvesters
* Spatial - https://github.com/ckan/ckanext-spatial/tree/master/ckanext/spatial/harvesters
### Domain model
See https://github.com/ckan/ckanext-harvest/blob/master/ckanext/harvest/model/__init__.py
* HarvestSource - a remote source for harvesting datasets from e.g. a CSW server or CKAN instance
* HarvestJob - a job to do the harvesting (done in 2 stages: gather and then fetch and import). This is basically state for the overall process of doing a harvest.
* HarvestObject - job to harvest one dataset. Also holds dataset on the remote instance (id / url)
* HarvestGatherError
* HarvestObjectError
* HarvestLog
#### Harvest Source Objects
https://github.com/ckan/ckanext-harvest/blob/master/ckanext/harvest/model/__init__.py#L230-L245
```python
# harvest_source_table
Column('id', types.UnicodeText, primary_key=True, default=make_uuid),
Column('url', types.UnicodeText, nullable=False),
Column('title', types.UnicodeText, default=u''),
Column('description', types.UnicodeText, default=u''),
Column('config', types.UnicodeText, default=u''),
Column('created', types.DateTime, default=datetime.datetime.utcnow),
Column('type', types.UnicodeText, nullable=False),
Column('active', types.Boolean, default=True),
Column('user_id', types.UnicodeText, default=u''),
Column('publisher_id', types.UnicodeText, default=u''),
Column('frequency', types.UnicodeText, default=u'MANUAL'),
Column('next_run', types.DateTime), # not needed
```
#### Harvest Error and Log Objects
https://github.com/ckan/ckanext-harvest/blob/master/ckanext/harvest/model/__init__.py#L303-L331
```python
# New table
harvest_gather_error_table = Table(
'harvest_gather_error',
metadata,
Column('id', types.UnicodeText, primary_key=True, default=make_uuid),
Column('harvest_job_id', types.UnicodeText, ForeignKey('harvest_job.id')),
Column('message', types.UnicodeText),
Column('created', types.DateTime, default=datetime.datetime.utcnow),
)
# New table
harvest_object_error_table = Table(
'harvest_object_error',
metadata,
Column('id', types.UnicodeText, primary_key=True, default=make_uuid),
Column('harvest_object_id', types.UnicodeText, ForeignKey('harvest_object.id')),
Column('message', types.UnicodeText),
Column('stage', types.UnicodeText),
Column('line', types.Integer),
Column('created', types.DateTime, default=datetime.datetime.utcnow),
)
# Harvest Log table
harvest_log_table = Table(
'harvest_log',
metadata,
Column('id', types.UnicodeText, primary_key=True, default=make_uuid),
Column('content', types.UnicodeText, nullable=False),
Column('level', types.Enum('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL', name='log_level')),
Column('created', types.DateTime, default=datetime.datetime.utcnow),
)
```
### Key components
* Configuration: HarvestSource objects
* Pipelines: Gather, Fetch, Import stages in a given harvest extension
* Runner: bespoke queue system (backed by Redis or AMQP). Scheduler is external e.g. Cron
* Logging: logged into CKAN DB (as Harvest Errors)
* Interface: API, Web UI and CLI
* Storage:
* Final: Datasets are in CKAN MetaStore
* Intermediate: HarvestObject in CKAN MetaStore
### Flow
0. *Harvest run:* a regular run of the Harvester that generates a HarvestJob object. This is then passed to gather stage. This is what is generated by cron `harvest run` execution (or from web UI)
1. The **gather** stage compiles all the resource identifiers that need to be fetched in the next stage (e.g. in a CSW server, it will perform a GetRecords operation).
2. The **fetch** stage gets the contents of the remote objects and stores them in the database (e.g. in a CSW server, it will perform an GetRecordById operations).
3. The **import** stage performs any necessary actions on the fetched resource (generally creating a CKAN package, but it can be anything the extension needs).
```mermaid
graph TD
gather[Gather]
run[harvest run] --generates a HarvestJob--> gather
gather --generates HarvestObject for each remote source item --> fetch
fetch --HarvestObject updated with remote metadata--> import
import --store package--> ckan[CKAN Package]
import --HarvestObject updated --> harvestdb[Harvest DB]
```
```python
def gather_stage(self, harvest_job):
'''
The gather stage will receive a HarvestJob object and will be
responsible for:
- gathering all the necessary objects to fetch on a later.
stage (e.g. for a CSW server, perform a GetRecords request)
- creating the necessary HarvestObjects in the database, specifying
the guid and a reference to its job. The HarvestObjects need a
reference date with the last modified date for the resource, this
may need to be set in a different stage depending on the type of
source.
- creating and storing any suitable HarvestGatherErrors that may
occur.
- returning a list with all the ids of the created HarvestObjects.
- to abort the harvest, create a HarvestGatherError and raise an
exception. Any created HarvestObjects will be deleted.
:param harvest_job: HarvestJob object
:returns: A list of HarvestObject ids
'''
def fetch_stage(self, harvest_object):
'''
The fetch stage will receive a HarvestObject object and will be
responsible for:
- getting the contents of the remote object (e.g. for a CSW server,
perform a GetRecordById request).
- saving the content in the provided HarvestObject.
- creating and storing any suitable HarvestObjectErrors that may
occur.
- returning True if everything is ok (ie the object should now be
imported), "unchanged" if the object didn't need harvesting after
all (ie no error, but don't continue to import stage) or False if
there were errors.
:param harvest_object: HarvestObject object
:returns: True if successful, 'unchanged' if nothing to import after
all, False if not successful
'''
def import_stage(self, harvest_object):
'''
The import stage will receive a HarvestObject object and will be
responsible for:
- performing any necessary action with the fetched object (e.g.
create, update or delete a CKAN package).
Note: if this stage creates or updates a package, a reference
to the package should be added to the HarvestObject.
- setting the HarvestObject.package (if there is one)
- setting the HarvestObject.current for this harvest:
- True if successfully created/updated
- False if successfully deleted
- setting HarvestObject.current to False for previous harvest
objects of this harvest source if the action was successful.
- creating and storing any suitable HarvestObjectErrors that may
occur.
- creating the HarvestObject - Package relation (if necessary)
- returning True if the action was done, "unchanged" if the object
didn't need harvesting after all or False if there were errors.
NB You can run this stage repeatedly using 'paster harvest import'.
:param harvest_object: HarvestObject object
:returns: True if the action was done, "unchanged" if the object didn't
need harvesting after all or False if there were errors.
'''
```
### UI
Harvest admin portal
![](https://i.imgur.com/Y0cyrUG.png)
Add a Harvest Source
![](https://i.imgur.com/uoBFAjY.png)
Clicking on Harvest Source gives you list of datasets harvested
![](https://i.imgur.com/hIRrDHP.png)
Clicking on about gives you..
![](https://i.imgur.com/S9GTl9n.png)
Admin view of a particular (Harvest) source
![](https://i.imgur.com/Loeshhv.png)
Edit harvester
![](https://i.imgur.com/3SOuCm5.png)
Jobs summary
![](https://i.imgur.com/QwD7nHi.png)
Individual jobs
![](https://i.imgur.com/ljKEk0l.png)

View File

@ -1,31 +0,0 @@
# HubStore
> [!warning]Work in Progress
This is is early stage and still a work in progress.
A HubStore maintains a catalog of organizations and their ownership of projects / datasets.
It's name derives from the common appelation of "Hub" for something that organizes a collection of individual items e.g. Git*Hub* or Data*Hub*. The HubStore handles the information that makes a Hub a Hub.
## Domain Model
* Organization
* Account (User)
* MembershipRole e.g. admin, editor etc
* Project (which has a Dataset)
Associations
* Organization --owns--> Project
* Organization --membership--> Account
* Membership association has an associated MembershipRole.
Potential extras:
* Do we allow Accounts to own Projects or only Organizations? Yes, we do. I think this is a key use case.
* Organization Hierarchies: Organization --parent--> Organization. One could allow for hierarchies of organizations. We do not by default but it is possible to do so.
* Team: a convenient grouping of Accounts for the purpose of assigning permissions to something
* All team members have the same status (if you want different statuses get different teams)
* Team --membership--> Account (without a role)
* Example: Github Teams
* Comment: hirerarchical organizations *could* make much of the use case obsolete. IME teams are an annoying feature of github that bring complexity (who exactly has access to this thing, if i want to remove Person X from access i have to check all the teams with access etc).

View File

@ -1,114 +0,0 @@
<div className="hero" style={{textAlign: "center"}}>
<h1>Datopian Tech</h1>
<a href="https://datopian.com/">
<img src="/images/datopian-light-logotype.svg" style={{maxWidth: "250px", display: "block", margin: "3rem auto 1.5rem"}} />
</a>
<p className="description" style={{fontSize: "1.6rem", lineHeight: 1.3}}>
We are experts in data management and engineering
This is an overview of our technology
</p>
</div>
## Data Management Systems
A [Data Management System (DMS)][dms] is a _framework_. It can be used to create a variety of _solutions_ such as [Data Portals][], [Data Catalogs][], [Data Lakes][] (or Data Meshes) etc. We have developed two DMS stacks that share a set of underlying core components:
- [CKAN][]: the open source data management system we created in 2007 and that we continue to develop and maintain. The main information on CKAN is at https://ckan.org/. Here we have some specific notes on how we develop and deploy CKAN as well as our thoughts on the [next generation of CKAN (v3)][v3].
- [DataHub][]: a simpler version of CKAN focused on SaaS platform at DataHub.io. DataHub and CKAN v3 share many of the same core components.
[data portals]: /docs/dms/data-portals
[data lakes]: /docs/dms/data-lake
[data catalogs]: /docs/dms/data-portals
[dms]: /docs/dms/dms
[CKAN]: /docs/dms/ckan
### Solutions
You can use a DMS to build many kinds of specific solutions
- [Data Portals][portals] are gateways to data. That gateway can be big or small, open or restricted. For example, data.gov is open to everyone, whilst an enterprise "intra" data portal is restricted to its personnel.
- Data Catalog: see https://ckan.org/
- Metadata manager: see [Publishing][]
- Data Lake: you can use a DMS to rapidly create a data lake using existing infrastructure. For example, using the DMS' catalog and storage gateway with existing cloud storage and data processing capabilities.
- Data Engineering: you can use components of the DMS to rapidly create, orchestrate and supply data pipelines.
[dms]: /docs/dms/dms
[portals]: /docs/dms/data-portals
[publishing]: /docs/dms/publish
[datahub]: /docs/dms/datahub
[ckan]: /docs/dms/ckan
[v3]: /docs/dms/ckan-v3
### Features
A DMS has a variety of features. This section provides an overview and links to specific feature pages that include details of how they work in CKAN and CKAN v3 / DataHub.
<img src="https://docs.google.com/drawings/d/e/2PACX-1vRdMzNeIAEkjDRGtBfuocy6zDyRg_qDujSkLrTe69U1qlu_1kfTYN0OL_v4IZKKo0eDXRbCzgzQMlFz/pub?w=622&amp;h=635" />
> [!tip] There are many ways to break down features and this is just one framing. We are thinking about others and if you have thoughts please get in touch.
- [Discovering and showcasing data (catalog and presenting)](/docs/dms/frontend)
- [Views on data](/docs/dms/views) including visualizing and previewing data as well [Data Explorers][explorer] and [Dashboards][]
- [Publishing data](/docs/dms/publish)
- [Data API DataStore](/docs/dms/data-api)
- [Permissions](/docs/dms/permissions) and [Authentication](/docs/dms/authentication)
- [Versioning](/docs/dms/versioning)
- [Harvesting](/docs/dms/harvesting)
[dashboards]: /docs/dms/dashboards
[explorer]: /docs/dms/data-explorer
### Components
A DMS has the following key components:
- [HubStore](/docs/dms/hubstore)
- [Data Flows and Factory](/docs/dms/flows)
- [Loading to DataStore](/docs/dms/load)
- [Storage](/docs/dms/storage)
- [Blob Storage](/docs/dms/blob-storage)
- [Structured Storage - see DataStore](/docs/dms/data-api)
https://coggle.it/diagram/Xiw2ZmYss-ddJVuK/t/data-portal-feature-breakdown
<iframe width='540' height='480' src='https://embed.coggle.it/diagram/Xiw2ZmYss-ddJVuK/b24d6f959c3718688fed2a5883f47d33f9bcff1478a0f3faf9e36961ac0b862f' frameBorder='0' allowFullScreen></iframe>
## Frictionless
The Frictionless approach to data. See https://frictionlessdata.io/
Our team created this whilst at Open Knowledge Foundatioin and continue to co-steward it.
## OpenSpending
https://openspending.org/
## Developer Experience
Service Reliability Engineering (SRE) and Developer Experience (DX) for our CKAN cluster technology.
- [Developer Experience][dx]
- [DX - Deploy](/docs/dms/dx/deploy)
- [DX - Cluster](/docs/dms/dx/cluster)
Old cluster
- [Deploy in old cluster](/docs/dms/deploy)
- [Exporting from CKAN-Cloud](/docs/dms/migration)
- [Cloud](/docs/dms/cloud) - start on CKAN cloud documentation
## Research
- [Data Frames and what would a JS data frame library look like](/docs/dms/dataframe)
- [Dataset Relationships](/docs/dms/relationships)
## Miscellaneous
- [Glossary &raquo;](/docs/dms/glossary)
- [Notebook -- our informal blog &raquo;](/docs/dms/notebook)
[dx]: /docs/dms/dx

View File

@ -1,183 +0,0 @@
# Data Load
## Introduction
Data load covers functionality for automatedly loading structured data such as tables into a data management system. Data load is usually part of a larger [Data API (DataStore)][dapi] component.
Load is distinct from uploading raw files ("blobs") and from a "write" data API: from blobs because the data is structured (e.g. rows and columns) and that structure is expected to be preserved; from a write data API because the data is imported in bulk (e.g. a whole CSV file) rather than writing one row at a time.
The load terminology comes from ETL (extract, transform, load) though in reality this functionality will often include some extraction and transformation -- extracting the structured data from the source formats and potentially transformation if data needs some cleaning.
[dapi]: /docs/dms/data-api
### Features
As a Publisher i want to load my dataset (resource) into the DataStore quickly and reliably so that my data is available over the data API.
* Be “tolerant” where possible of bad data so that it still loads
* Get feedback on load progress, especially if something went wrong (with info on how I can fix it), so that I know my data is loaded (or if not what I can do about it)
* I want to update the schema for the data so that the data has right types (before and/or after load)
* I want to be able to update with a new resource file and only have it load the most recent
For sysadmins:
* Track Performance: As a Datopian Cloud Sysadmin I want to know if there are issues so that I can promptly address any problems for clients
* One Data Load Service per Cloud: As a Datopian Cloud Manager I may want to have one “DataLoad” service I maintain rather than one per instance for efficiency …
### Flows
#### Automatic load
* Users uploads a file to portal using the Dataset editor
* This is stored into the blob storage (i.e. local or cloud storage)
* A "PUSH" notification is triggered to loader service
* Loader service load file to data API backend (a structured database with web API)
```mermaid
sequenceDiagram
participant a as User
participant b as Blob Storage
participant c as CKAN
participant d as Loader
participant e as DataStore
a->>c: create a resource with a location of remote file
c->>d: push notification
d->>b: pull it
d->>e: push it
d-->>c: success (or failure) notification
```
#### Sequence diagram for manual load
The load to the data API system can also be triggered manually:
```mermaid
sequenceDiagram
participant a as User
participant b as Blob Storage
participant c as CKAN
participant d as Loader
participant e as DataStore
a->>c: click on upload button
c->>d: push notification
d->>b: pull it
d->>e: push it
d-->>c: success (or failure) notification
```
## CKAN v2
The actual loading is provided by either DataPusher or XLoader. There is a common UI. There is no explicit API to trigger a load -- instead it is implicitly triggered, for example when a Resource is updated.
### UI
The UI shows runs of the data loading system with information on success or failure. It also allows eidtors to manually retrigger a load.
![](https://i.imgur.com/fSh2cwK.png)
TODO: add more screenshots
### DataPusher
Service (API) For pushing tabular data to datastore. Do not confuse it with `ckanext/datapusher` in ckan core codbase which is simply an extension communicating with the DataPusher API. DataPusher itself is a standalone service, running separately from CKAN app.
https://github.com/ckan/datapusher
https://docs.ckan.org/projects/datapusher/en/latest/
https://docs.ckan.org/en/2.8/maintaining/datastore.html#datapusher-automatically-add-data-to-the-datastore
### XLoader
XLoader runs as async jobs within CKAN and bulk loads data via Postgres COPY command. This is fast but it does mean it only loads data as strings and explicit type-casting must be done after the load (the user must edit the data dictionary). XLoader was built to address 2 major issues with DataPusher:
* Speed: DataPusher converts data row by row and writes over the DataStore write API and hence is quite slow.
* Dirty data: DataPusher attempts to guess data types and then cast and this regularly led to failures which though logical were frustrating to users. XLoader gets the data in (as strings) and let's the user sort out types later.
https://github.com/ckan/ckanext-xloader
* `load_csv`: https://github.com/ckan/ckanext-xloader/blob/master/ckanext/xloader/loader.py#L40
* Loader: https://github.com/ckan/ckanext-xloader/blob/master/ckanext/xloader/jobs.py#L100
How does the queue system work: job queue is done by RQ, which is simpler and is backed by Redis and allows access to the CKAN model. Job results are currently still stored in its own database, but the intention is to move this relatively small amount of data into CKAN's database, to reduce the complication of install.
### Flow of Data in Data Load
```mermaid
graph TD
datastore[Datastore API]
datapusher[DataPusher]
pg[Postgres DB]
filestore[File Store]
xloader[XLoader]
filestore --"Tabular data"--> datapusher
datapusher --> datastore
datastore --> pg
filestore -. or via .-> xloader
xloader --> pg
```
Sequence diagram showing the journey of a tabular file into the DataStore:
```mermaid
sequenceDiagram
Publisher->>CKAN Instance: Create a resource (or edit existing one) in a dataset
Publisher->>CKAN Instance: Add tabular resource from disk or URL
CKAN Instance-->>FileStore: Upload data to storage
CKAN Instance-->>Datapusher: Add job to queue
Datapusher-->>Datastore: Run the job - push data via API
Datastore-->>Postgres: Create table per resource and insert data
```
### FAQs
Q: What happens with non-tabular data?
A: CKAN has a list of types of data it can process into the DataStore (TODO:link) and will only process those.
### What Issues are there?
Generally: the Data Load system is an hand-crafted, bespoke mini-ETL process. It would seem better to use high-quality third-party ETL tooling here rather than hand-roll be that for pipeline creation, monitoring, orchestration etc.
Specific examples:
* No connection between DataStore system and CKAN validation extension powered by GoodTables https://github.com/frictionlessdata/ckanext-validation Thus, for example, users may edit the DataStore Data Dictionary and be confused that this has no impact on validation. More generally, data validation and data loading might naturally be part of one overall ETL process but Data Load system is not architected in a way that makes this easy to add.
* No support for Frictionless Data spec sand their ability to specific incoming data structure (CSV format, encoding, column types etc).
* Dependent on error-prone guessing of types or manual type conversion
* Makes it painful to integrate with broader data processing pipeline (e.g. clean separation would allow type guessing to be optimized elsewhere in another part of the ETL pipeline)
* Excel loading won't work or won't load all sheets
* DataPusher
* https://github.com/ckan/ckanext-xloader#key-differences-from-datapusher
* Works terribly with loading a bit big data. It may for no reason crash after hour of loading. And after reload it goes along
* Is slow esp for large datasets and even smallish datasets e.g. 25Mb
* often fails due to e.g. data validation/casting errors but this not clear (and unsatisfying to the user)
* XLoader:
* Doesn't work with XLS(X)
* has problems fetching resources from Blob Storage (it fails and need to wait until the Resource is uploaded.)
* raising Exception NotFound when CKAN has a delay creating resources
* re-submits Resources when creating a new Resource
* XLoader sets `datastore_active` before data is uploaded
## CKAN v3
The v3 implementation is named 💨🥫 AirCan: https://github.com/datopian/aircan
Its a lightweight, standalone service using AirFlow.
Status: Beta (June 2020)
* Runs as a separate microservice with zero coupling with CKAN core (=> gives cleaner separation and testing)
* Uses Frictionless Data patterns and specs where possible e.g. Table Schema for describing or inferring the data schema
* Uses AirFlow as the runner
* Uses common ETL / [Data Flows][] patterns and frameworks
[Data Flows]: /docs/dms/flows
### Design
See [Design page &raquo;](/docs/dms/load/design/)

View File

@ -1,183 +0,0 @@
# Data Load Design
Key point: this is classic ETL so let's reuse those patterns and tooling.
## Logic
```mermaid
graph LR
usercsv[User has CSV,XLS etc]
userdr[User has Tabular Data Resource]
dr[Tabular Data Resource]
usercsv --1. some steps--> dr
userdr -. direct .-> dr
dr --2. load --> datastore[DataStore]
```
In more detail, dividing ET(transform) from L(oad):
```mermaid
graph TD
subgraph "Prepare (ET)"
rawcsv[Raw CSV] --> tidy[Tidy]
tidy --> infer[Infer types]
infer
end
infer --> tdr{{Tabular Data Resource<br/>csv/json + table schema}}
tdr --> dsdelete
subgraph "Loader (L)"
datastorecreate[DataStore Create]
dsdelete[DataStore Delete]
load[Load to CKAN via DataStore API or direct copy]
dsdelete --> datastorecreate
datastorecreate --> load
end
```
### Load step in even more detail
```mermaid
graph TD
tdr[Tabular Data Resource on disk from CSV in FileStore of a resource]
loadtdr[Load Tabular Data Resource Metadata]
dscreate[Create Table in DS if not exists]
cleartable[Clear DS table if existing content]
pushdatacopy[Load to DS via PG copy]
done[Data in DataStore]
tdr --> loadtdr
loadtdr --> dscreate
dscreate --> cleartable
cleartable --> pushdatacopy
pushdatacopy --> done
logstore[LogStore]
cleartable -. log .-> logstore
pushdatacopy -. log .-> logstore
```
## Runner
We will use AirFlow.
## Research
### What is a Tabular Data Resource?
See Frictionless Specs. For our purposes:
* A "Good" CSV file: Valid CSV - with one header row, No blank header etc...
* Encoding worked out -- usually we should have already converted to utf-8
* Dialect - https://frictionlessdata.io/specs/csv-dialect/
* Table Schema https://frictionlessdata.io/specs/table-schema
NB: even if you want to go direct loading route (a la XLoader) and forget types you still need encoding etc sorted -- and it still fits in diagram above (Table Schema is just trivial -- everything is strings).
### What is datastore and how to create the DataStore entry
https://github.com/ckan/ckan/tree/master/ckanext/datastore
* provides an ad hoc database for storage of structured data from CKAN resources
* Connection with Datapusher: https://docs.ckan.org/en/2.8/maintaining/datastore.html#datapusher-automatically-add-data-to-the-datastore
* Datastore API: https://docs.ckan.org/en/2.8/maintaining/datastore.html#the-datastore-api
* Making Datastore API requests: https://docs.ckan.org/en/2.8/maintaining/datastore.html#making-a-datastore-api-request
#### Create an entry
```
curl -X POST http://127.0.0.1:5000/api/3/action/datastore_create -H "Authorization: {YOUR-API-KEY}"
resource
-d '{
"resource": {"package_id": "{PACKAGE-ID}"},
"fields": [ {"id": "a"}, {"id": "b"} ]
}'
```
https://docs.ckan.org/en/2.8/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_create
### Options for Loading
There are 3 different paths we could take:
```mermaid
graph TD
pyloadstr[Load in python in streaming mode]
cleartable[Clear DS table if existing content]
pushdatacopy[Load to DS via PG copy]
pushdatads[Load to DS via DataStore API]
pushdatasql[Load to DS via sql over PG api]
done[Data in DataStore]
dataflows[DataFlows SQL loader]
cleartable -- 1 --> pyloadstr
pyloadstr --> pushdatads
pyloadstr --> pushdatasql
cleartable -- 2 --> dataflows
dataflows --> pushdatasql
cleartable -- 3 --> pushdatacopy
pushdatasql --> done
pushdatacopy --> done
pushdatads --> done
```
#### Pros and Cons of different approaches
|Criteria | Datastore Write API | PG Copy | Dataflows |
|---------|:--------- |:------- | ---------: |
| Speed | Low | High | ??? |
|Error Reporting| Yes | Yes | No(?) |
|Easy of implementation| Yes | No(?) | Yes |
Works Big data| No | Yes | Yes(?) |
|Works well in parrallel| No | Yes(?) | Yes(?)
### DataFlows
https://github.com/datahq/dataflows
Dataflows is a framework for loading, processing, manipulating data.
* Loader (Loading from external source (or disk)): https://github.com/datahq/dataflows/blob/master/dataflows/processors/load.py
* Load to an SQL db (Dump processed data) https://github.com/datahq/dataflows/blob/master/dataflows/processors/dumpers/to_sql.py
* What is error reporting, what is runner system ..., does it have a UI? does it have a queue system?
* Think data package pipelines is taking care of all of these. https://github.com/frictionlessdata/datapackage-pipelines
* DPP itself is also a ETL framework, just much heavier and a bit complicated.
### Notes an QA (Sep 2019)
* Note: TDR needs info on CKAN Resource source so we can create right datastore entry ..
* No need to validate as we assume it is good ...
* We might want to do that ... still
* Pros and Cons
* Speed
* Error reporting ...
* What happens with Copy if you hit an error (e.g. a bad cast?)
* https://infinum.co/the-capsized-eight/superfast-csv-imports-using-postgresqls-copy
* https://wiki.postgresql.org/wiki/Error_logging_in_COPY
* Ease of implementation
* Good with inserting Big data
* Create as strings and cast later ... ?
* xloader implementation with COPY command: https://github.com/ckan/ckanext-xloader/blob/fb17763fc7726084f67f6ebd640809ecc055b3a2/ckanext/xloader/loader.py#L40
Raw insert ~ 15m (on 1m rows)
Insert with begin / commit ~5m
copy ~82s (though may have limit on b/w) -- and what happens if pipe breaks
Q: Is it better to but everything in DB as a string and cast later or cast and insert in DB.
A: Probably cast first and insert after.
Q: Why do we rush to insert the data in DB? We will have to wait until it's casted anyways befroe use
A: It's much faster to do operations id DB than outside.

View File

@ -1,593 +0,0 @@
# Notebook
Our lab notebook. Informal thoughts. A very raw blog.
# Data Literate - a small Product Idea 2021-05-17 @rufuspollock
I want to write a README with data and vis in it and preview it ...
* Markdown is becoming a lingua franca for writing developer and even research docs
* It's quick and ascii-like
* It's widely supported
* It's extensible ...
* Frontend tooling is rapidly evolving ...
* The distant between code and a tool is declining => I might as well write code ...
* MDX = Markdown + react
* RStudio did this a while ago ...
* Missing part is data ...
* You have juputer notebooks etc ... => they are quite high end ...
```
Notebooks (jupyter, literate programming) ==>
Write text and code together
Write code like in a terminal
Data oriented
```
Visualization
React
Markdown ...
---
Here the kinds of doc i want to write
```
## A Dataset
\```
# Global Solar Supply (Annual)
Solar energy supply globally.
Source: International Energy Association https://www.iea.org/reports/solar-pv.
| Year | Generation (TWh) | % of total energy |
|--|--|
|2008|12|
|2009|20|
|2010|32|
|2011|63|
|2012|99|
|2013|139|
|2014|190|
|2015|251|
|2016|329|
|2017|444|
|2018|585|
|2019|720| 2.7 |
\```
Europe Brent Spot Prices (Annual/ Monthly/ Weekly/ Daily) from U.S. Energy Information Administration (EIA).
Source: https://www.eia.gov/dnav/pet/hist/RBRTEd.htm
```
### Notes
R Markdown - https://rmarkdown.rstudio.com
> Use a productive notebook interface to weave together narrative text and code to produce elegantly formatted output.
## A DMS is a tool, a Data Portal is a solution 2021-03-14 @rufuspollock
Over the years, we've seen many different terms used to describe software like CKAN and the solutions it is used to create e.g. data catalog, data portal, data management system, data platform etc.
Over time, personally, I've converged towards [data management system (DMS)](/docs/dms/dms) and [data portal](/docs/dms/data-portals). But I've still got two terms and even people in my own team ask me to clarify what the difference is. Recently it became clear to me:
**A data management system (DMS) is a tool. A data portal is a solution.**
And a data management system is a tool you can use to build a data portal. Just like you can use a hammer to build a house.
## Data Factory Concepts 2020-09-03 @rufuspollock
Had this conceptual diagram hanging round for a couple of years.
```
Objects
Row
File
Dataset
Transformations
Operator
Pipeline
Flow
````
NTS
* A factory could be a (DA)G of flows b/c could be dependencies between them ... e.g. run ComputeKeyMetrics only after all other flows have run ...
* But not always like that: flows can be completely independent.
## Current Data Factory Components (early 2019)
```
Factory - Runners, SaaS platform etc
datapackage-pipelines -> (dataflows-server / dataflows-runner)
dataflows-cli : generators, hello-world, 'init', 'run'
goodtables.io
"blueprints": standard setups for a factory (auto-guessed?)
DataFlows: Flow Libs
dataflows : processor definition and chaining
processors-library: stdlib, user contributed items [dataflows-load-csv]
Data Package Libs
data.py, DataPackage-py, GoodTables, ...
tabulator, tableschema
```
## Composition vs Inheritance approaches to building applications and esp web apps 2020-08-20 @rufuspollock
tl;dr: composition is better than inheritance but many systems are built with inheritance
Imagine we want a page like this:
```
<header>
{{title}}
</footer>
```
Inheritance / Slot model
```
def render_home:
return render('base_template.html'< {
title="hello world"
})
```
Composition / declarative
```
def render_home:
mytitle = 'hello world'
response.write(get_header())
response.write(mytitle)
response.write(get_footer()
```
You can write templates two ways:
### Inheritance
Base template
```html
# base.html
<header>
<title>{{title}}</title>
{{content}}
</footer>
```
`blog-post.html`
```
{%extends base.html %}
<block name="title">{{title}} -- Blog</block>
```
### Composition
`blog-post.html`
```
<include/partial name='header.html' title="..."/>
{{content}}
<include name='footer.html' />
```
## How Views should work in CKAN v3 (Next Gen) 2020-08-10 @rufuspollock
Two key points:
1. Views should become an explicit (data) project level object
2. Previews: should be very simple and work *without* data API
Why?
* Views should become an explicit (data) project level object
* So I can show a view on dataset page (atm I can't)
* I have multiview view inside reclinejs but rest are single views ... (this is confusing)
* I can't create views across multiple resources
* They are nested under resources but they aren't really part of a resource
* Previews: should be very simple and work *without* data API
* so they work with revisions (atm views often depend on data API which causes problems with revisions and viewing old revisions of resources)
Distinguish 3 concepts
* Preview: a very simple method for previewing specific raw data types e.g. csv, excel, json, xml, text, geojson etc ...
* Key aspect are ability to sample a part and to present.
* Viz: graph, map, ... (visualizations)
* Query UI: UI for creating queries
* Viz Builder: a UI for creating charts, tables, maps etc
* Explorer (dashboard): combines query UI, Viz Builder and Viz Renderer
## What a (future) Data Project looks like
NB: To understand what a project is see [DMS](/docs/dms/dms).
It helps me to be very concreate and imagine what this looks looks like on disk:
```
datapackage.json # dataset
data.csv # dataset resources (could be anywhere)
views.yml
data-api.yml
flows.yml
```
Or, a bit more elegantly:
```
data/
gdp.csv | gdp.pdf | ...
views/
graph.json | table.json | ...
api/
gdp.json | gdp-ppp.json | ...
flows/
...
README.md
datapackage.json # ? does this just contain resources or more than that? Just resources
```
## Data Factory 2020-07-23 @rufuspollock
I've used the term Data Factory but it's not in common use. At the same time, there doesn't seem to be a good term out there in common use for what I'm referring to.
What am I referring to?
I can start with terms we do have decent-ish terminlogy for: data pipelines or data flows.
Idea is reasonably simple: I'm doing stuff to some data and it involves a series of processing steps which I want to run in order. It may get more complex: rather than a linear sequence of tasks I may have branching and/or parallelization.
Because data "flows" through these steps from one end to the otehr we end up with terminology like "flow" or "pipeline".
Broken down into its components we have two things:
* Tasks: the individual processing nodes, i.e. a single operator on a unit of data (aka Processors / Operators)
* Pipeline: which combines these tasks into a whole (aka Flow, Graph, DAG ...). It is a DAG (directed acycle graph) of processors starting from one or more sources and ending in one or more sinks. Note the simple case of a linear flow is very common.
[NB: tasks in an actual flow could either be bespoke to that flow or are configured instances of a template / abstract tasks e.g. load from s3 might be be a template task which as a live task in an actual flow will be configured with a specific bucket.]
So far, so good.
But what is the name for a system (like AirFlow, Luigi) for creating, managing and running data pipelines?
My answer: a Data Factory.
What I don't like with this is that it messes with the metaphor: factories are different from pipelines. If we went with Data Factory then we should talk about "assembly lines" rather than "pipelines" and "workers" (?) rather than tasks. If one stuck with water we'd end up with something like a Data Plant but that sounds weird.
Analogy: for data we clearly have a file and dataset. And a system for organizing and managing datasets is a data catalog or data management system. So what's the name for the system for processing datasets ...
And finally :checkered_flag: I should mention [AirCan][], our own Data Factory system we've been developing built on AirFlow.
[AirCan]: https://github.com/datopian/aircan/
### Appendix: Terminology match up
| Concept | Airflow | Luigi |
|---------|----------|-------|
| Processor | Task | Task |
| Pipeline | DAG | ? |
| Pipeline (complex, branching) | DAG |
## Commonalities of Harvesting and Data(Store) Loading 2020-06-01 @rufuspollock
tags: portal, load, factory
Harvesting and data loading (for data API) are almost identical as mechanisms. As such, they can share the same "data factory" system.
Data API load to backing DB (CKAN DataStore + DataPusher stuff)
```mermaid
graph LR
subgraph Factory
read --> process
process --> load
orchestrator
end
load --> db[DB = DataStore]
orchestrator --> api
api --> wui[Dataset Admin UI]
```
Harvesting
```mermaid
graph LR
subgraph Factory
read --> process
process --> load[Load - sync with CKAN]
orchestrator
end
load --> db[DB = MetaStore]
orchestrator --> api
api --> wui[Harvesting Admin UI]
```
## 10 things I regret about NodeJS by Ryan Dahl (2018) 2020-05-17 @rufuspollock
Valuable generally and more great lessons for data packaging.
https://www.youtube.com/watch?v=M3BM9TB-8yA
### package.json and npm were a mistake
Why package.json was a mistake: https://youtu.be/M3BM9TB-8yA?t=595
![](https://i.imgur.com/Ia0qtVm.png)
* `npm` + a centralized repo are unnecessary (and were a mistake)
* doesn't like centralized npm repo (I agree) and look what go are doing. Sure, you probably have something via the backdoor at the end (e.g. go is getting that) for caching and reliability purposes, but it is not strictly necessary.
ASIDE: It's the core tool (node) that make a metadata format and packager relevant and successful. It's node require allowing using of `package.json` or bundling `npm` in by defaultonly
* Kind of obvious when you think about it
* Something i've always said re Data Packages (but not strongly enough and not followed enough): the tooling comes first and the format is, in many ways, a secondary conveinience. To be fair to myself (and others) we did write `dpm` first (in 2007!), and do a lot of stuff with the workflow and toolchain but its easy to get distracted.
![](https://i.imgur.com/5E0Hffs.png)
* modules aren't strictly necessary and `package.json` metadata is not needed
* on the web you just have js file and you can include them ...
* "package.json has all this unnecessary noise in it. like license, repository. Why am i filling this out. I feel like a book-keeper or something. This is just unnecessary stuff *to do* when all I am trying to do is link to a library" [ed: **I think this is a major relevance for Data Packages. There's a tension between the book-keepers who want lots of metadata for publishing etc ... and ... the data science and data engineers who just want to get something done. If I had a choice (and I do!) I would prioritize the latter massively. And they just care about stuff like a) table schema b) get me the data on my hard disk fast**]
* "If only relative files and URLs were used when importing, the path defines the version. There is no need to list dependencies" [ed: cf Go that did this right]
* And he's borrowed from Go for deno.land
### Vendoring by default with `node_modules` was a mistake
Vendoring by default with `node_modules` was a mistake - just use an env variable `$NODE_PATH`
* `node_modules` then becomes massive
* module resolution is (needlessly) complex
### General point: KISS / YAGNI
E.g. `index.js`was "cute" but unnecessary, allowing `require xxx` without an extension (e.g. `.js` or `.ts` ) means you have to probe the file system.
+data package. +data packaging. +frictionless. +lessons
## Go modules and dependency management (re data package management) 2020-05-16 @rufuspollock
Generally Go does stuff well. They also punted on dependency management initially. First, you just installed a url it was up to you to manage your depedencies. Then there was a period of chaos as several package/dependency managers fought it out (GoDeps etc). Then, ~ 2018 the community came together led by Russ Cox and came up with a very solid proposal which is official as of 2019.
Go's approach to module (package) and dependency management can be an inspiration for Frictionless and Data Packages. Just as we learnt and borrowed a lot from Python and Node so we can learn and borrow from Go.
1. Overview (by Russ Cox the author): https://research.swtch.com/vgo
2. The Principles of Versioning in Go https://research.swtch.com/vgo-principles
3. A Tour of Versioned Go (vgo) https://research.swtch.com/vgo-tour
4. cmd/go: add package version support to Go toolchain https://github.com/golang/go/issues/24301
5. Using Go Modules - https://blog.golang.org/using-go-modules (official introduction on go blog)
6. Publishing Go Modules https://blog.golang.org/publishing-go-modules
7. Main wiki article and overview https://github.com/golang/go/wiki/Modules
### Key principles
> These are the three principles of versioning in Go, the reasons that the design of Go modules deviates from the design of Dep, Cargo, Bundler, and others.
>
> 1. Compatibility. The meaning of a name in a program should not change over time.
> 2. Repeatability. The result of a build of a given version of a package should not change over time. https://research.swtch.com/vgo-principles#repeatability
> 3. Cooperation. To maintain the Go package ecosystem, we must all work together. Tools cannot work around a lack of cooperation.
Summary
* Go used urls for identifiers for packages (including special cases for github)
* e.g. `import rsc.io/quote`
* Brilliant! No more dependency resolution via some central service. Just use the internet.
* Go installed packages via `go get` e.g. `go get rsc.io/quote`. This would install the module into `$GOPATH` at `rsc.io/quote`
* They did the absolute minimum: grab the files onto your hard disk under `$GOPATH/src`. `import` would then search this (IIUC)
* There was no way originally to get a version but with go modules (go > 1.11) you could do `go get rsc.io/quote@[version]`
* Dependency management is actually complex: satisfying dependency requirements is NP complete. Solve this by ...
* The Node/Bundler/Cargo/Dep approach is one answer. Allow authors to specify arbitrary constraints on their dependencies. Build a given target by collecting all its dependencies recursively and finding a configuration satisfying all those constraints. => SAT solver => this is complex.
* Go has a different solution
* Versioning is central to dependency management => you need to get really clear on versioning. Establish a community rule that you can only break compatibility with major versions ...
* Put breaking version (e.g. major versions) **into the url** so that you actually have a different package ...
> For Go modules, we gave this old advice a new name, the import compatibility rule:
>> If an old package and a new package have the same import path,
>> the new package must be backwards compatible with the old package.
![](https://research.swtch.com/impver@2x.png)
* Install the minimal version of a package that satisfies the requirements (rather than the maximal version) => this yields repeatability (principle 2)
* In summary Go differs in that: all versions are explicit (no `<=`, `>=`). Since we can assume that all later versions of a module are backwards compatible (and any breaking change generates a new module with explicit `vX` in name) we can simply cycle through a module and its dependencies and find the highest version that are listed and install that.
* Publishing a module is just pushing to github/gitlab or putting it somewhere on the web -- see https://blog.golang.org/publishing-go-modules
* Tagging versions is done with git tag
* "A module is a collection of related Go packages that are versioned together as a single unit."
Layout on disk in a module (see e.g. https://blog.golang.org/publishing-go-modules). Main file `go.mod` and one extra for storing hashes for verification (it's not a lock file)
```
$ cat go.mod
module example.com/hello
go 1.12
require rsc.io/quote/v3 v3.1.0
$ cat go.sum
golang.org/x/text v0.0.0-20170915032832-14c0d48ead0c h1:qgOY6WgZOaTkIIMiVjBQcw93ERBE4m30iBm00nkL0i8=
golang.org/x/text v0.0.0-20170915032832-14c0d48ead0c/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
rsc.io/quote/v3 v3.1.0 h1:9JKUTTIUgS6kzR9mK1YuGKv6Nl+DijDNIc0ghT58FaY=
rsc.io/quote/v3 v3.1.0/go.mod h1:yEA65RcK8LyAZtP9Kv3t0HmxON59tX3rD+tICJqUlj0=
rsc.io/sampler v1.3.0 h1:7uVkIFmeBqHfdjD+gZwtXXI+RODJ2Wc4O7MPEh/QiW4=
rsc.io/sampler v1.3.0/go.mod h1:T1hPZKmBbMNahiBKFy5HrXp6adAjACjK9JXDnKaTXpA=
```
### Asides
#### Vendoring is an incomplete solution to package versioning problem
> More fundamentally, vendoring is an incomplete solution to the package versioning problem. It only provides reproducible builds. It does nothing to help understand package versions and decide which version of a package to use. Package managers like glide and dep add the concept of versioning onto Go builds implicitly, without direct toolchain support, by setting up the vendor directory a certain way. As a result, the many tools in the Go ecosystem cannot be made properly aware of versions. It's clear that Go needs direct toolchain support for package versions. https://research.swtch.com/vgo-intro
## 2020-05-16 @rufuspollock
Ruthlessly retain compatibility after v1 - inspiration from Go for Frictionless
> It is intended that programs written to the Go 1 specification will continue to compile and run correctly, unchanged, over the lifetime of that specification. Go programs that work today should continue to work even as future “point” releases of Go 1 arise (Go 1.1, Go 1.2, etc.).
>
> — https://golang.org/doc/go1compat
And they go further -- not just Go but also Go packages:
> Packages intended for public use should try to maintain backwards compatibility as they evolve. The Go 1 compatibility guidelines are a good reference here: dont remove exported names, encourage tagged composite literals, and so on. If different functionality is required, add a new name instead of changing an old one. If a complete break is required, create a new package with a new import path.
>
> The Go FAQ has since Go 1.2 in November 2013
+frictionless
## 2020-05-15 @rufuspollock
`Project` should be the primary object in a DataHub/Data Portal -- not Dataset.
Why? Because actually this is more than a Dataset. For example, it includes issues or workflows. A project is a good name for this that is both generic and specific.
cf Git-hub e.g. Gitlab (and Github). Gitlab came later and did this right: it's primary object is a Project which hasA Repository. Github still insits on calling them repositories (see primary menu item which is "Create a new repository"). This is weird, a github "repository" isn't actually a github repository: it has issues, stats, workflows and even a discussion board now. Calling it a project is the accurate description and the repository label is a historical artifact when that was all it was. I sometimes create "repos" on Github just to have an issue tracker. Gitlab understands this and actually allows me to have projects without any associated repository.
TODO: take a screenshot to illustrate Gitlab and Github.
+flashes of insight. +datahub +data portal. +domain model.
## 2020-04-23 @rufuspollock
4 Stores of a DataHub
>[!tip]Naming is one of the most important things -- and hardest!
* MetaStore [service]: API (and store) of the metadata for datasets
* HubStore [service]: API for registry of datasets (+ potentially organizations and ownership relationships to datasets)
* BlobStore [service]: API for blobs of data
* StructuredStore [service]: API for structured data
Origins:
* Started using MetaStore in DataHub.io back in 2016
* Not used in CKAN v2
* Conceptually CKAN originally was MetaStore and HubStore.
In CKAN v2:
* MetaStore and HubStore (no explicit name) => main Postgres DB
* StructuredStore (called DataStore) => another separate Postgres DB
* BlobStore (called FileStore) => local disk (or cloud with an extension)
In CKAN v3: propose to separate these explicitly ...
## Data Portal vs DataHub vs Data OS 2020-04-23 @rufuspollock
Data Portal vs DataHub vs Data OS -- naming and definitions.
Is a Data Portal a DataHub? Is a DataHub a DataOS? If not, what are the differences?
+todo
## Data Concepts - from Atoms to Organisms 2020-03-05 @rufuspollock
```
Point -> Line -> Plane (0d -> 1d -> 2d -> 3d)
Atom -> Molecule -> Cell -> Organism
```
```mermaid
graph TD
cell[Cell]
row[Row]
column[Column]
table["Table (Sheet)"]
cube["Database (Dataset)"]
cell --> row
cell --> column
row --> table
column --> table
table --> cube
```
```
Domain =>
Dimension
|
V
```
| Dimension | ... | Math | Spreadsheets | Databases | Tables etc | Frictionless | Pandas | R |
|--|--|--|--|--|--|--|--|--|
| 0d | Datum | Value | Cell | Value | Scalar? | N/A | Value | ? |
| 1d | .... | Array / Vector | Row | Row | Row | N/A |
| 1d | .... | N/A | Column |
| 2d | Grid? | Matrix | Sheet | Table | Table | TableSchema |
| 3d | Cube | 3d Matrix | Spreadsheet | Database (or Cube) | N/A | ? | ? | ? |
| 4d+ | HyperCube | n-d Matrix |
What's crucial about a table is that it is not just an array of arrays or a rowset but a rowset plus a fieldset.
```
Field => FieldSet
Row => RowSet
```
A Table is a FieldSet x RowSet (+ other information)
There is the question of whether there is some kind of connection or commonality at each dimension up ... e.g. you could have an array of arrays where each array has different fields ...
```json
{ "first": "joe", "height": 3 }
{ "last": "gautama", "weight": 50 }
```
But a table has common fields.
```json
{ "first": "joe", "height": 3 }
{ "first": "siddarta", "height": 50 }
```
(NB: one could always force a group of rows with disparate fields into being a table by creating the union of all the fields but that's hacky)
So a table is a RowSet plus a FieldSet where each row conforms to that FieldSet.
Similarly, we can aggregate tables. By default here the tables do *not* share any commonality -- sheets in a spreadsheet need not share any common aspect, nor do tables in a database. If they do, then we have a cube.
NEXT: moving to flows / processes.
## Is there room / need for a simple dataflow library? 2020-02-23 @rufuspollock
Is there room / need for a simple dataflow library ...?
What kind of library?
* So Apache Beam /Google DataFlow is great ... and it is pretty complex and designed for use out of the box in parallel etc
* Apache Nifi: got a nice UI, Java and heavy duty.
* My own experience (even just now with key metrics) is i want something that will load and pipe data between processors
Ideas
* create-react-app for data flows: quickly scaffold data flows
* What is the default runner

View File

@ -1,60 +0,0 @@
# Permissions (Authorization)
As a Data Portal Owner I want only authenticated and authorized users to view (e.g. my staff), to edit (specific groups) so that we can put data in the portal and know that only appropriate people can use and contribute
* Access to data is only to those we have authorized (and we don't give access to public or competitors unless we choose to!)
* we don't disclose information inappropriately internally (e.g. info with privacy restrictions)
* People don't accidentally edit others datasets
Permissions breaks down into two parts:
* Authentication: who are you?
* Authorization: what can you do? => much bigger
## Authorization
As a Dataset Owner I want to be able to limit access, editing etc to my datasets at several levels and using org/teams and potentially other mechanisms so that I can easily comply with PII restrictions whilst making my data as widely available as possible and enabling collaborators to contribute easily
### Differentiating Metadata and Data Access
As a Dataset Owner I want to allow viewing of the dataset metadata including the list of resources whilst limiting access to the data itself (e.g. restricting download) so that I can allow others to discover the data i have (and request access) whilst complying with restrictions on data access (e.g. PII)
* TODO: what about *pre*viewing?
### Editing Controls
As a Dataset Owner I want to restrict those who can edit my dataset so that only those I authorize can edit the dataset
* I probably want to do this in bulk e.g. add my whole organization/team
### Update Permissions
As a Dataset Owner I want to control who has the ability to change permisssions on my dataset so that only people i choose can do this ...
* Default would be e.g. Dataset Owner + Org Admin can do this ...
* Are other options desired / possible?
### Private Datasets
As a Dataset Owner I want to make a dataset "private" so that it is only visible to those who have "edit" access on the dataset and is invisible to everyone else
### Adding one-off collaborators
As a Dataset Owner I want to add someone outside of my organization to a restricted dataset so they can collaborate and review
### Differential resource access restrictions
As a Dataset Owner I want to grant different levels of access to resources in a dataset so that I can make some resources private and others public (because maybe one resource contains PII)
### I want to reuse the team/org structure already in LDAP
As an Org/Team manager I don't want to have to add everyone in my team again in CKAN when i have this already in LDAP so that I save time and avoid risk things go out of sync
### Not Permissions (?)
#### Pre-Release Limits on Datasets
As a Dataset Owner (?? maybe someone else) I want to have a workflow for reviewing datasets before they go "public" so that they are a) in a good quality state b) are compliant with any regulations (e.g. around PII)
TODO: Is this really related to permissions?? Seems a broader issue ...

View File

@ -1,427 +0,0 @@
# Publish Data
## Introduction
Publish functionality covers the whole area of creating and editing datasets and resources, including data upload. The core job story is something like:
> When a Data Curator has a data file or dataset they want to add it to their data portal/platform quickly and easily so that it is available there.
Publication can be divided by its *mode*:
* **Manual**: publication is done by people via a user interfaces or other tool
* **Programmatic**: publication is done programatically using APIs and is usually part of automated processes
* **Hybrid**: which combines manual and programmatic. An example would be harvesting where setup and configuration may be done in a UI manually with the process then running automatically and programmatically (in addition, some new harvesting flows require manual programmatic setup e.g. writing a harvester in Python for a new source data format).
**Focus on Manual** we will focus on the manual in this section: programmatic is by nature largely up to the client programmer (assuming the APIs are there) whilst [Harvesting][] has a section of its own. That said, many concepts here are relevant for other cases e.g. material on [profiles][] and [schemas][].
**Data uploading**: included in publish is the process of uploading data into the DMS, and specifically into [storage][] and especially [(blob) storage][blob].
.
[Harvesting]: /docs/dms/harvesting
[profiles]: /docs/dms/glossary#profile
[schemas]: /docs/dms/glossary#schema
[storage]: /docs/dms/storage
[blob]: /docs/dms/blob-storage
### Examples
At its simplest, a publishing process can just involve providing a few metadata fields in a form -- with the data itself stored elsewhere.
At the other end of the spectrum, we could have a multi-stage and complex process like this:
* Multiple (simultaneous) resource upload with shared metadata e.g. I'm creating a timeseries dataset with the last 12 months of data and I want each file to share the same column information but to have different titles
* A variety of metadata profiles
* Data validation (prior to ingest) including checking for PII (personally identifiable infromation)
* Complex workflow related to approval e.g. only publish if at least two people have approved
* Embargoing (only make public at time X)
### Features
* Create and edit datasets and resources
* File upload as part of resource creation
* Custom metadata for both profile and schemas
### Job Stories
When a Data Curator has a data file or dataset they want to add it manually (e.g. via drag and drop etc) to their data portal quickly and easily so that it is avaialble there.
More specifically: As a Data Curator I want to drop a file in and edit the metadata and have it saved in a really awesome interactive way so that the data is “imported” and of good quality (and i get feedback)
#### Resources
>[!tip]A resource is any data item in a dataset e.g. a file.
When adding a resource to a dataset I want metadata pre-entered for me (e.g. resource name from file name, encoding, ...) to save time and reduce errors
When adding a resource to a dataset I want to be able to edit the metadata whilst uploading so that I save time
When uploading a resource's data as part of adding a resource to a dataset I want to see upload progress so that I have a sense of how long this will take
When adding resources to a dataset I want to be able to add and upload multiple files at once so that I save time and make one big change
When adding a resource which is tabular (e.g. csv, excel) I want to enter the (table) schema (i.e. the names, description and types of columns) so that my data is more useable, presentable, importable (e.g. to DataStore) and validatable
When adding a resource which is currently stored in dropbox/gdrive/onedrive I want to pull the bytes directly from there so as to speed up the upload process
### Domain Model
The domain model here is that of the [DMS](/docs/dms/dms) and we recommend visiting that page for more information. The key items are:
* Project
* Dataset
* Resource
[DMS]: /docs/dms/dms
### Principles
* Most ordinary data users don't distinguish resources and datasets in their everyday use. They also prefer a single (denormalized) view onto their data.
* Normalization is not normal for users (it is a convenience, economisation and consistency device)
* And in any case most of us start from files not datasets (even if datasets evolve later).
* Minimize the information the user has to provide to get going. For example, does a user *have* to provide a license to start with? If that is not absolutely required leave this item for later.
* Automate where you can but only where you can guess reliably. If you do guess, give the user ability to modify. Otherwise, magic often turns into mud. For example, if we are guessing file types let the user check and correct this.
## Flows
* Publish flows are highly custom: different platforms have different needs
* At the same time there are core components that most people will use (and customize) e.g. uploading a file, adding dataset metadata etc
* The flows shown here are therefore illustrative and inspirational rather than definitive
### Evolution of a Flow
Here's a simple illustration of how a publishing flow might evolve:
```mermaid
graph LR
a[Add a file]
b[Add metadata]
c[Save]
a --> b
b --> c
```
```mermaid
graph LR
a[Add a file]
b[Add metadata]
c[Save]
d[Add table schema]
a --> b
b -.-> d
d -.-> c
```
```mermaid
graph LR
a[Add a file]
b[Add metadata]
c[Save]
d[Add table schema]
e[Check for PII]
a --> b
b -.-> d
d -.-> e
e --> c
```
PII = personally identifiable info
### The 30,000 foot view
```mermaid
graph TD
user[User with data and/or metadata]
publish[Publish process]
platform["Storage (metadata and blobs)"]
user --> publish --> platform
```
### Add Dataset: High Level
```mermaid
graph TD
project[Project/Dataset create]
addres["Add resource(s)"]
save[Save / Commit]
project --> addres
addres --> save
````
### Add Dataset: Mid Level
```mermaid
graph TD
project[Project/Dataset create]
addres["Add resource(s)"]
addmeta["Add dataset metadata"]
save[Save / Commit]
project --> addres
addres -.optional.-> addmeta
addmeta -.-> save
addres -.-> save
````
The approach above is "file driven" rather than "metadata driven", in the sense that you are start by providing a file rather than providing metadata.
Here's the metadata driven flow:
```mermaid
graph TD
project[Project/Dataset create]
addres["Add resource(s)"]
addmeta[Add dataset metadata]
save[Save / Commit]
project --> addmeta
addmeta --> addres
addres --> save
addmeta -.-> save
```
>[!tip]Comment: The file driven approach is preferable.
We think the "file driven" approach where the flow starts with a user adding and uploading a file (and then adding metadata) is preferable to a metadata driven approach where you start with a dataset and metadata and then add files (as is the default today in CKAN).
Why do we think a file driven approach is better? a) a file is what the user has immediately to hand b) it is concrete whilst "metadata" is abstract c) common tools for storing files e.g. Dropbox or Google Drive start with providing a file - only later, and optionally, do you rename it, move it etc.
That said, tools like GitHub or Gitlab require one to create a "project", albeit a minimal one, before being able to push any content. However, GitHub and Gitlab are developer oriented tools that can assume a willingness to tolerate a slightly more cumbersome UX. Furthermore, the default use case is that you have a git repo that you wish to push so the the use case of a non-technical user uploading files is clearly secondary. Finally, in these systems you can create a project just to have an issue tracker or wiki (without having fiile storage). In this case, creating the project first makes sense.
In a DMS, we are often dealing with non-technical or semi-technical users. Thus, providing a simple, intuitive flow is preferable. That said, one may still have a very lightweight project creation flow so that we have a container for the files (just as in, say, Google Drive you already have a folder to put your files in).
### Dataset Metadata editor
There are lots of ways this can be designed. We always encourage minimalism.
* Adding information e.g. license, description, author …
* ? Default the license (and explain what the license means …)
### Add a Resource
From here on, we'll zoom in on the "publish" part of that process. Let's start with the simplest case of adding a single resource in the form of an uploading file:
```mermaid
graph TD
addfile[Select a file]
metadata[Add Metadata]
upload[Upload to Storage]
save[Save]
addfile -.in the background.-> upload
addfile --> metadata
upload -.progress reporting.-> save
metadata --> save
```
Notes
* Alternative to "Select a file" would be to just "Link" to a file that is already online and available
### Schema (Data Dictionary) for a Resource
One part of a publishing flow would be to describe the [schema][] for a resource. Usually, we restrict this to tabular data resources and hence this is a Table Schema.
[schema]: /docs/dms/glossary#schema
Usually adding and editing a schema for a resource will be an integrated part of managing the overall metadata for the resource but it can also be a standalone step. The following flow focuses solely on the add schema:
```mermaid
graph TD
addfile[Select a file]
infer[Infer the fields, their names and perhaps their types]
edit[Edit the fields and their details e.g. description, types]
save[Save]
addfile --> infer
infer --> edit
edit --> save
```
Notes
* We recommend using [Frictionless Table Schema][] as format for storing table schema information
[Frictionless Table Schema]: https://frictionlessdata.io/table-schema/
#### Schema editor
**Fig 1.2: Schema editor wireframe**
<img src="https://docs.google.com/drawings/d/e/2PACX-1vRD7XUc9iBYjEH11Zqsfrk7tAv688UTqEJMxOg4Bc-9p4Vkrcq8Oghpe5OfimfVoEzjfDRMLeUNIP63/pub?w=695&amp;h=434" />
* can add title as well as description? Maybe we should have both but i often find them duplicative and why do people want a title …? (For display in charting …)
* Could pivot the display if lots of columns (e.g. have cols down left). This is traditional approach e.g. in CKAN 2 data dictionary
![](https://i.imgur.com/nhb5H7Q.png)
Advanced:
* Displaying validation errors could/should be live as you change types … (highlight with a hover)
* add semantic/taxonomy option (after format) i.e. ability to set rich type
#### Overview Deck
**Deck**: This deck (Feb 2019) provides an overview of the core flow publishing a single tabular file e.g. CSV and includes a a basic UI mockup illustrating the flow described below.
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vQD09jo3Mwq-jM32rns_ehyd6GOv7cQ7F9UAK1U_jzO5G4ZgZ8ktG9rwK03-N-0XmQyJx-9kSW7-U4I/embed?start=false&loop=false&delayms=3000" frameborder="0" width="550" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true" />
#### Overview
For v1: could assume small data (e.g. < 5Mb) so we can load into memory ...?
**Tabular data only**
1. Load
1. File select
2. Detect type
3. Preview {'<='} start preview here and continue throughout (?)
3. Choose the data
2. Structural check and correction
1. Structural validation
2. Error presentation
3. Mini-ETL to correct errors
3. Table metadata
1. [Review the headers]
2. Infer data-types and review
3. [Add constraints]
4. Data validation (and correction?)
4. General metadata (if necessary)
1. Title, description
2. License
5. Publish (atm: just download metadata (and cleaned data)
#### 1. Load
1. User drops a file or uploads a file
* What about a url? Secondary for now
* What about import from other sources e.g. google sheets, dropbox etc? KISS => leave for now
* Size restrictions? Let's assume we're ok
* Error reporting: any errors loading the data file should be reported ...
* [Future]: in the background we'd be uploading this file to a file store while we do the rest of this process
* Tooling options: https://uppy.io/ (note does lots more!), roll out own, filepicker.io (proprietary => no), ...
* How do we find something that just does file selection and provides us with the object
* [Final output] => a raw file object, raw file info (? or we already pass to data.js?)
2. Detect type / format (from file name ...)
* Prompt user to confirm the guess (or proceed automatically if guessed)?
* Tooling: data.js already does this ...
3. Choose the data (e.g. sheets from excel)
* Skip if CSV or if one sheet
* Multiple sheets:
* Present preview of the sheets ?? (limit to first 10 if a lot of sheets)
* Option of choosing all sheets
#### 2. Structural check and correction
1. Run a goodtables structure check on the data
* => ability to load a sample of the data (not all of it if big)
* => goodtables js version
2. Preview the data and show structural errors
3. [Optional / v2] Simple ETL in browser to correct this
#### 3. Table metadata
All done in a tabular like view if possible.
Infer the types and present this in a view that allows review:
1. [Review the headers]
2. Infer data-types and review
3. [Add constraints] - optional and could leave out for now.
Then we do data validation against types (could do this live whilst they are editing ...)
4. Data validation (and correction?)
#### 4. General metadata (if necessary)
Add the general metadata.
1. Title, description
2. License
#### 5. Publish (atm: just download metadata (and cleaned data)
Show the dataresource.json and datapackage.json for now ...
## CKAN v2
In CKAN 2 the data publishing flow is a integral part of core CKAN. See this section of the user guide for a walkhthrough: https://docs.ckan.org/en/2.9/user-guide.html#features-for-publishers
Key points of note:
* Classic python webapp approach using a combination of html templates (in Jinja) with processing code in Python backend code using controllers etc.
* Customization is done via client extensions using CKAN extensions model
* There is also a dedicated extension ckanext-scheming for creating froms from a JSON DSL.
### Data Dictionary
Integrated with DataStore extension since CKAN v2.7. Old documentation with visuals at:
https://extensions.ckan.org/extension/dictionary/
### Issues
* Classic webapp is showing its age vs modern javascript based web application development. Nowadays, you'd usually build a UI like this in javascript and e.g. React or VueJS. This has implications for both:
* UX: e.g. general lack of responsiveness, client side operations etc.
* Development: miss out on modern dev stack
* No client-side direct to cloud upload etc
* Extension model has got complex and cumbersome: the template inheritance model is now somewhat byzantine to navigate. Changing data structures can operate at multiple levels.
* The extension approach is "inheritance" based
* ckanext-scheming uses its own DSL. Today, one would probably use JSON Schema and use a javascript framework.
In short, building a rich UI like this today would almost certainly be done in pure JS in something like React.
## CKAN v3
We recommend a pure JS React-based approach. The CKAN dataset and resource editor becomes a React app.
We have developed a "DataPub(lishing)" framework that provides core components and template apps that you can use to get started building out a data publishing UI:
https://github.com/datopian/datapub/
### Design
See [Design page &raquo;][design]
[design]: /docs/dms/publish/design

View File

@ -1,287 +0,0 @@
# Publish - Design
## Introduction
Design of a DMS publishing system, with a focus on CKAN v3 / DataHub.
Goal: an elegant single page application (that is easy to reuse and customize) that takes you from choosing your data file to a data package. Aka "a javascript app for turning a CSV into a data package".
This application would form the UI part of a "manual" user flow of importing a data file into a Data Portal.
### The framework approach
As a product, the Publish system should be thought of more as a framework than a single solution: a set of patterns, components and flows that can be composed together to build different applications and workflows.
A given solution is created by composing together different components into an overall flow.
This approach is is designed to be extensible so that new workflows and their requirements can be easily accommodated.
## Design
### Principles and Remarks
* Simplicity: a focus on extraordinary, easy and elegant user experience
* Speed: not only responsiveness, but also speed to being "done"
* Tabular: focus on tabular data to start with
* Adaptable / Reusable: this system should be rapidly adaptable for other import flows
* A copy-and-paste-able template that is then adapt with minor changes
* Over time this may evolve to something configurable
* CKAN-compatible (?): (Longer term) e.g. be able to reuse ckanext-scheming config
### Technology
* Build in Javascript as a SPA (single page application)
* Use React as the framework
* ? Assume we can use NextJS as the SSR/SSG app
* Use Apollo local storage (rather than Redux) for state management
### Architecture
* Each step in a flow is "roughly" a react component
* Each component will pass information between itself and the central store / other components. Those structures that are related to Datasets and Resources wll follow Frictionless formats.
* Encapsulate interaction with backend in a library. Backend is CKAN MetaStore and Blob Storage access, raw Blob Storage itself (almost certainly cloud)
* Split UI into clear components and even sub-applications (for example, a sub-application for resource adding)
* Use [Frictioness specs][f11s] for structuring storage of data objects such as Dataset, Resource, Table Schema etc will be in
* The specification of these formats themselves will be done in JSON schema and JSON Schema is what we use for specifying new metadata profiles (usually extensions or customizations of the base Frictionless ones)
* Data schemas are described using Table Schema (for tabular data) or JSON Schema.
Diagram: SDK library encapsulates interaction with backend
```mermaid
graph TD
client[Client]
subgraph Application
ui1[UI Component 1]
ui2[UI Component 2]
end
client --> ui1
client --> ui2
ui1 --> apollo[Apollo State Management]
apollo --internal state management--> apollo
apollo --> sdk[SDK]
sdk --> metastore[MetaStore]
sdk --> storagegate[Storage Gatekeeper]
sdk --> storage[Storage - likely Cloud]
client -.-> storage
```
Working assumptions
* Permissions is "outside" of the UI: we can assume that UI is only launched for people with relevant permissions and that permissions are ultimately enforced in the backend (i.e. if someone attempts to save a dataset they don't have permission for that will fail in the call to the backend). => we don't need to show/hide/restrict based on permissions.
## Key Flows
* **Ultra-simple resource(s) publishing**: Publish/share a resource (file) *fast* to a new project (and dataset) - ultra-simple version (like adding a file to Drive or DropBox)
* Implicitly creates a project and dataset
* **Publish resource(s) and make a dataset** Publish a file and create the dataset explicitly (ie. add title, license etc)
* **Add resource(s) to a dataset**: a new resource to an existing dataset
* Add multiple resources at once
* Edit the metadata of an existing dataset
Qu:
* Do we even permit the super simple option - it's attractive but it brings some complexity in the UI
(either we need to make user provide project/dataset level metadata at end or we guess it for them and guessing usually goes wrong). Note that Github makes you create the "project" and its repo before you can push anything.
## Components
UI
* File uploader
* Resource
SDK
* File upload
* MetaStore
* (Project creation / updating)
* Dataset creation / updating
* ...
## Plan of Work
Task brainstorm
*
```mermaid
graph TD
```
## Design Research
### Uploading library
Atm we implement from "scratch". Could we use an "off the shelf" solution e.g. uppy
Impressions of uppy:
- Good support
- Open-source MIT
- Beautiful design
- Customizable
- Support (dropbox, google drive, AWS S3, Tus, XHR)
Question:
* how to implement uppy + CKAN SDK?
* support for azure https://github.com/transloadit/uppy/issues/1591 (seems like it can work but maybe issue with large files (?))
[f11s]: https://f11s.com/
## Previous work
* 2019: https://github.com/datopian/import-ui - alpha React App. Working demo at http://datopian.github.io/import-ui
* 2018: https://github.com/datopian/data-import-ui (unfinished React App)
* https://github.com/datahq/pm/issues/90
* https://github.com/frictionlessdata/datapackage-ui
* Cf also openspending version
* can take from openspending but do it right :-)
* the spreadsheet view is best - see [example](https://docs.google.com/spreadsheets/d/1RoKbiTXaxT_N5Vio93Er-BA3ev3iwWlu4KYv-M7kvqc/edit#gid=0)
* maybe given option to rotate if a lot of rows
* v1 should assume tidy data
* (?) v2 should allow for some simple wrangling to get tidy data (??)
* This is a template for people building their own configurers
### Original Flow for DataHub `data` cli in 2016
Context:
* you are pushing the raw file
* and the extraction to get one or more data tables ...
* in the background we are creating a data package + pipeline
```
data push {file}
```
Algorithm:
1. Detect type / format
2. Choose the data (e.g. sheet from excel)
3. Review the headers
4. Infer data-types and review
5. [Add constraints]
6. Data validation
7. Upload
8. Get back a link - view page (or the raw url) e.g. http://datapackaged.com/core/finance-vix
* You can view, share, publish, [fork]
Details
1. Detect file type
1. file extension
1. Offer guess
2. Probable guess (options?)
3. Unknown - tell us
2. Detect encoding (for CSV)
2. Choose the data
1. Good data case
1. 1 sheet => ok
2. Multiple sheets guess and offer
3. Multiple sheets - ask them (which to include)
2. bad data case - e.g. selecting within table
3. Review the headers
* Here is what we found
* More than one option for headers - try to reconcile
## Appendix: Integration into CKAN v2 Flow
See https://github.com/datopian/datapub/issues/38
### Current system
```mermaid
graph TD
dnew[Click Dataset New]
dmeta[Dataset Metadata]
dnew --> dmeta
rnew[New Resource]
dmeta --saves dataset as draft--> rnew
rnew --finish--> dpage[Dataset Page]
rnew --add another--> rnew
rnew --previous-->dmeta
dedit[Click Edit Dataset] --> dupdate[Update Dataset Metadata]
dupdate --save/update--> dpage
dedit --resources--> res[Resources page]
res --add new--> rnew
res --click on existing resource --> redit[Resource Edit]
redit --delete--> dpage
redit --update--> resview[Resource view page]
```
### New system
```mermaid
graph TD
dnew[Click Dataset New]
dmeta[Dataset Metadata]
dnew --> dmeta
rnew[New Resource]
dmeta --saves dataset as draft--> rnew
rnew --save and publish + redirect--> dpage[Dataset Page]
dedit[Click Edit Dataset] --> dupdate[Update Dataset Metadata]
dupdate --save/update--> dpage
dedit --resources--> res[Resources page]
res --add new--> rnew
res --click on existing resource --> redit[Resource Edit]
redit --delete--> dpage
redit --update--> resview[Resource view page]
```
Resource editor
```mermaid
graph TD
start --> remove
remove --> showupload[Show Upload/Link options]
```
## Appendix: Project Creation Flow Comparison
### Github
Step 1
![](https://i.imgur.com/J5logpK.png)
Step 2
![](https://i.imgur.com/ruAQxCL.png)
### Gitlab
Step 1
![](https://i.imgur.com/jhtK4Ew.png)
Step 2
![](https://i.imgur.com/me8uTRo.png)
![](https://i.imgur.com/8HjRRFR.png)

View File

@ -1,27 +0,0 @@
# Relationships (between Datasets)
## Why dataset relationships are a ~~bad~~ complex idea ...
Idea is simple: I want to show relationships between datasets e.g.
* This is part of that one
* This derives from that one
* This is a parent of (new version) of that one ...
But **why** do you want that? What does the user want to do? That turns out to be very variable. Furthermore, actual user experience and behaviour is quite complex. To take a couple of examples:
* **Parent-child**: parent/child is *really* "revisioning" where you have new revisions of a dataset (or maybe it is something else, something a bit like derivations …). But in revisioning the simple relationship is much less important than e.g. efficient storage of data versions, permissions structure etc.
* **Part-of**: this seems simple to start with. But actually quite complex. How many levels of nesting? Can a child have multiple parents? Furthermore, it's not clear what actual user experience this information supports? It is collections? Is it concatenation of data? Or is it just the modelling team have got excited and there's no actual end user need (common!).
Background
* CKAN relationships were implemented pretty early (at Rufus' suggestion)
* depends_on
* dependency_of
* derives_from
* has_derivation
* child_of
* parent_of
* CKAN relationships module not really maintained - https://github.com/ckan/ckan/issues/4212 (Fix, document and rewrite tests for dataset relationships (or remove) - 30 apr 2018 - with no movement on it)
* Some good example use cases
* https://github.com/ckan/ckan/wiki/Dataset-relationships

View File

@ -1,10 +0,0 @@
# Data Storage for Data Portals
Data Portals often need to *store* data 😉 As such, they require a system for doing this. The systems for storing data can be classified into two types:
* [Blob Storage][] (aka Bulk or Raw): for storing data as "blobs", a raw stream of bytes like files on a filesystem. Think local filesystem or cloud storage like S3, GCS, etc. See [Blob Storage][] for more on this.
* Structured Storage: for storing data in a structured way e.g. tabular data in a relational database or "documents" in a document database.
>[!tip]In CKAN 2 Blob Storage (and associated functionality) was known as FileStore and Structured Storage was known as DataStore.
[Blob Storage]: /docs/dms/blob-storage

Some files were not shown because too many files have changed in this diff Show More