[examples/openspending] - openspending v0.2 (#907)
* [examples/openspending] - openspending v0.2 * [examples/openspending][m] - fix build * [examples/openspending][xs] - fix build * [examples/openspending][xs] - add prebuild step * [examples/openspending][m] - fix requested by demenech * [examples/openspending][sm] - remove links + fix bug
This commit is contained in:
@@ -0,0 +1,89 @@
|
||||
---
|
||||
author: $authornamehere
|
||||
redirect_from: /2012/03/uk-25k-spending-data/
|
||||
title: UK 25k Spending Data
|
||||
---
|
||||
Since the question just came up about what's going on there, figure
|
||||
this is a good day to tidy this up and send it out...
|
||||
|
||||
|
||||
Most people who managed to submit sources sent in data that was
|
||||
substantially correct, with the only inconveniences being:
|
||||
|
||||
- arbitrary variation in spelling of column headers. Standardise this
|
||||
- reporting of VAT: some include it, some exclude it, some do both, and most do not say which they do
|
||||
- currency of amounts. Talk to accountants and come up with a good
|
||||
way to report this; it is clear that the bulk of sources just want to
|
||||
use GBP, but some departments that operate overseas have complexities
|
||||
and no clear exchange rate.
|
||||
- reporting of VAT numbers is rare. We should reconsider whether this
|
||||
is worth bothering with
|
||||
- reporting of dates: invoice date, or payment date?
|
||||
|
||||
Many people included too many or too few fields. Standardise on a set
|
||||
of mandatory and optional fields. Strongly discourage the inclusion of
|
||||
data not in the optional set, because people tend to add extra columns
|
||||
reflecting their opinion of how the data should be structured, and
|
||||
neglect the recommended ones. 20% of submitted sources used exactly
|
||||
the recommended columns without prompting; let's get this number up.
|
||||
|
||||
Most people wanted to include a unique transaction reference
|
||||
number. Add this to the set of standard columns.
|
||||
|
||||
Some people wanted to include a narrative/description field. This
|
||||
should be encouraged; add as optional field.
|
||||
|
||||
Some people wanted to include commentary or cover notes in their
|
||||
spreadsheets. This should be strongly discouraged. It needs to be
|
||||
emphasised that they are supposed to be releasing *raw* data for
|
||||
analysis, not documents for people to read.
|
||||
|
||||
Invalid data is heavily skewed towards the same errors:
|
||||
|
||||
1. The URL supplied to data.gov.uk does not point to a csv or
|
||||
spreadsheet. This accounts for about 10% of all entries on data.gov.uk
|
||||
and in the bulk of those cases, the URL simply points to nothing; the
|
||||
largest remaining case is a URL pointing to a web page that talks
|
||||
about the data or lists places it can be downloaded, instead of
|
||||
pointing directly to the data files.
|
||||
|
||||
This could be eliminated entirely by fetching URLs when they are
|
||||
submitted to data.gov.uk, and rejecting anything that is not a csv or
|
||||
spreadsheet. A simple check of the first few bytes of each file is
|
||||
sufficient to identify almost every error immediately and reject URLs
|
||||
which are obviously wrong, and this would eliminate over 90% of bad
|
||||
submissions. No other action could be so valuable in terms of data
|
||||
gained from time spent, so this should be done first and soon. I would
|
||||
estimate the engineering cost to be substantially less than a day for
|
||||
a person familiar with the code.
|
||||
|
||||
A subset of these cases will be URLs that were once valid, but the
|
||||
files have since been removed. Data sources should be reminded of the
|
||||
need to maintain a permanent archive of this data at fixed
|
||||
URLs. data.gov.uk should regularly revalidate URLs and automatically
|
||||
mail responsible people when they go away.
|
||||
|
||||
2. Automated data extraction/reporting that went wrong - spreadsheets
|
||||
full of formulas or errors. Automated reporting is a good idea; nobody
|
||||
looked at these files before uploading them because they are obviously
|
||||
wrong. It should be straightforward to get them fixed if anybody ever
|
||||
tells the creator.
|
||||
|
||||
Errors not falling into the above two categories are mostly cases of
|
||||
complete nonsense or lack of understanding from the data
|
||||
submitter. These should be handled on a case-by-case basis.
|
||||
|
||||
There is no evidence of widespread difficulty or need for
|
||||
education. Clear and precise guidelines about the format to release
|
||||
data in, and validation of submitted URLs, should be the focus. Only a
|
||||
tiny number of submitters (<10) had an ignorance problem, and these
|
||||
are likely to be a simple case of the problem being dumped on a junior
|
||||
employee because nobody thought it was important.
|
||||
|
||||
Areas for further work after everything above this line has been done:
|
||||
|
||||
- unique identification of suppliers. Name isn't very good at
|
||||
identifying companies, and we should be able to link in other data
|
||||
about companies
|
||||
- what value should be in the departmentfamily, expensearea, and expensetype fields?
|
||||
- character set of submitted data
|
||||
Reference in New Issue
Block a user