[examples/openspending] - openspending v0.2 (#907)
* [examples/openspending] - openspending v0.2 * [examples/openspending][m] - fix build * [examples/openspending][xs] - fix build * [examples/openspending][xs] - add prebuild step * [examples/openspending][m] - fix requested by demenech * [examples/openspending][sm] - remove links + fix bug
This commit is contained in:
@@ -0,0 +1,17 @@
|
||||
---
|
||||
lead: true
|
||||
title: Appendix
|
||||
authors:
|
||||
- Neil Ashton
|
||||
---
|
||||
The appendix contains material which will later be integrated into the Spending Data Handbook, but which was important as direct reference material for this report.
|
||||
|
||||
* [Putting the Open Data into Open Budgets](./open-budgets-open-data/)
|
||||
* [Tool Ecosystem: What tools do people use to work with financial data?](./tool-ecosystem/)
|
||||
* [Common arguments against publishing data](./machinereadfaq/)
|
||||
* [How to publish spending data without disclosing personal information?](./privacyguide/)
|
||||
* [Other handy datasets](./other-handy-datasets/)
|
||||
|
||||
**Next**: [Putting the Open Data Into Open Budgets](./open-budgets-open-data)
|
||||
|
||||
**Up**: [Mapping the Open Spending Data Community](../)
|
||||
@@ -0,0 +1,72 @@
|
||||
---
|
||||
lead: true
|
||||
title: Common arguments against publishing data
|
||||
authors:
|
||||
- Neil Ashton
|
||||
---
|
||||
Across the community almost everyone can explain stories about how struggling with government officials for transactional spending data in machine-readable format. Often publishers simply do not know that civil society wants data in a particular format, but there are also deliberate obstructions. In this FAQ we provide a list of the most typical excuses for rejecting to release data in computer-friendly formats.
|
||||
|
||||
## ... in machine-readable format
|
||||
|
||||
### “PDFs are on my computer - therefore they are machine-readable”
|
||||
|
||||
FALSE: The fact they are on your computer means they are electronic copies, but not that they are machine-readable. PDFs are essentially a set of instructions for a printer on how to print a page, they look nice and appealing to the human eye, but to a computer, they are little more than a picture.
|
||||
|
||||
PDFs go from bad to worse from the perspective of someone trying to do data work:
|
||||
|
||||
* [Better PDFs are machine-generated](https://www.gov.uk/service-manual/design-and-content/resources/creating-accessible-PDFs.html), typically something like an Excel or Structured Word Documents converted into a PDF [(see example)](https://docs.google.com/a/okfn.org/file/d/1En9UbXiVwinRiMPf6gwL7LY-1rClPdEoM_aj75FWNgm5qLbIa42fg6y81YFv/edit). Often, you can copy and paste information from them, but there may be some formatting or issues.
|
||||
* Worse PDFs are typically scanned documents. Often, to add to the misery, they will be copies of faxes, smudged, speckled, tea- water- or mould-stained or crooked (sometimes all of the above).
|
||||
* Image files are not machine-readable for the same reasons.
|
||||
|
||||
### “If we publish in machine-readable, open formats - someone will alter the data and use it to discredit us.”
|
||||
|
||||
Again, FALSE. If someone wants to use data badly enough, they will use it even if they have to get it out of documents manually. If they have to get it out manually - mistakes could be introduced. Publishing the data in machine-readable format simply allows the user to start working with the data straight away.
|
||||
|
||||
Our advice would be the following:
|
||||
|
||||
<ul>
|
||||
<li>Publish both machine-readable and non-machine readable formats. We insist on the former for analysis, but the latter can also be useful e.g. to cross reference numbers and be an easily readable form to read and share reports. </li>
|
||||
<li>Encourage users of the data to show their working. A good data project will usually:
|
||||
<li>
|
||||
<ul>
|
||||
<li>Link back to the original source data </li>
|
||||
<li>Link to any modified data with an explanation of how it was changed, with the calculations to any underlying working clearly visible. When you provide such a clear audit trail others will be able to replicate your work and examine transparently that everything was done without errors. In journalism this is sometimes known as the “nerd box”. </li>
|
||||
<li>Offer the data source the chance to comment on calculations from the data in order to clear out misunderstandings.</li>
|
||||
<li>This allows anyone to check the accuracy of the working and verify the results.</li>
|
||||
</ul>
|
||||
</ul>
|
||||
## ... in sufficient levels of detail
|
||||
|
||||
### “We cannot release spending data as it contains personal information”
|
||||
|
||||
FALSE, public authorities holding spending data, which includes personal information should not refrain from responsibility of publishing the data. Instead authorities should conduct the proper examination and redact personal data accordingly (workflows can be developed so that this effort is minimal). We see real risks of local and national governments holding back spending data with this excuse and have therefore co-written a guide for public authorities on how to deal with personal information in spending data (see the <a href="../privacyguide/">privacy guide</a>).
|
||||
|
||||
The current access to data from the EU farm subsidy programme is a clear example of a case where privacy (in this case for farmers) was used as argument to decide a case at the European Court of Justice, which significantly [reduced access to data on farm subsidy payments](http://farmsubsidy.org/news/features/2012-data-harvest/).
|
||||
|
||||
### “We cannot release spending data due to third parties due to confidentiality concerns”
|
||||
|
||||
Public authorities should publish information about transactions between them, contractors and commercial vendors. It is not uncommon however that either public officials or commercial contractors will attempt to block releases due to commercial confidentiality of the supplier (the third party).
|
||||
|
||||
The argument is most commonly argued when requests are made for actual contracts, but even contracts are often [released in full](http://www.asktheeu.org/en/request/292/response/805/attach/2/Signed%20Framework%20Agreement%20with%20Eurocontrol.PDF.pdf) without redactions.
|
||||
|
||||
### “We cannot release granular data. You can get aggregated expenditures”
|
||||
|
||||
NOT USEFUL, access to line-by-line transactional spending data is essential in order to ensure accountability. In order to be able to investigate suppliers and procurement practices, detailed transaction-level spending data is required.
|
||||
|
||||
There are currently a few countries who release such data, the UK, US, Brazil and Slovenia being some of the leaders in this field. While they are leaders, there is still work to do there.
|
||||
|
||||
We have also noticed a that several countries have introduced fairly high disclosure thresholds in relation to their decision to disclose transactional data. Such practises should be challenged and remain a serious concern, as large shares of public spending can be covered below such disclosure thresholds.
|
||||
|
||||
Between countries disclosure thresholds vary widely:
|
||||
|
||||
* United States (federal level): USDollar 25,000
|
||||
* United Kingdom, National: GBP 25,000
|
||||
* United Kingdom, Councils: GBP 500 (for spending data), GBP 50,000 (for contracts)
|
||||
* Slovenia: No minimum disclosure threshold
|
||||
* Greece: No minimum disclosure threshold
|
||||
|
||||
Without knowing more about why these levels have been set across countries, it is hard to fathom why they were so positioned or whether they are reasonable.
|
||||
|
||||
**Next**: [How to publish spending data without disclosing personal information](./privacyguide/)
|
||||
|
||||
**Up**: [Appendix](../)
|
||||
@@ -0,0 +1,108 @@
|
||||
---
|
||||
lead: true
|
||||
title: Putting the Open Data Into Open Budgets
|
||||
authors:
|
||||
- Neil Ashton
|
||||
---
|
||||
We have looked in detail in this report at criteria which make it difficult for organisations to use data that has been released by governments. In January 2013, we hosted a community call with to look at what the demands of the Open Data Community are with regard to Open Budgets. Despite both featuring the word “Open” - there is still a disconnect between the use of the word “open” in many circles to signify availability and “open” in technical spheres to signify absence of legal, technical and social restrictions.
|
||||
|
||||
The purpose of the call was to investigate whether it would be possible to specify the demands of the Open Data Community with relation to budget and spending data.
|
||||
|
||||
## What do we need and how do we need it?
|
||||
|
||||
### Structured data
|
||||
|
||||
So it’s not so labour-intensive to do analysis!
|
||||
|
||||
For definitions of structured data, please see section below: *Structured data: What data formats to provide*
|
||||
|
||||
### Bulk access
|
||||
|
||||
* *It should also be possible to download all of the budget information in bulk*.
|
||||
* Preventing bulk downloads by using systems such as CAPTCHA is not acceptable.
|
||||
* Some interviewees requested data to be released via an API. This is indeed a useful move particularly when data is updated regularly, but should not be the only method to acquire the data - many non-technical users require simply bulk download of the data.
|
||||
|
||||
### Updates and amendments
|
||||
|
||||
If there is a requirement to update or change the budget documents e.g. as new drafts are produced - it's important to show the versions and keep track of the changes. Some suggestions:
|
||||
|
||||
* Displaying what date the data was "updated on", or using version numbers would be acceptable.
|
||||
* Crucial is that there should be access to all drafts (i.e. they should not be removed from their place of publication and should remain available) even when new versions are published.
|
||||
|
||||
### Timely data (that stays around)
|
||||
|
||||
Data is required:
|
||||
|
||||
* Within a period of time that would allow change to take place
|
||||
* Early in budget formulation process so that it is possible to participate in discussion about where the funds should actually go
|
||||
* After budget formulation so that you could monitor whether things had actually happened
|
||||
* Planned versus execution data while such comparisons still matter - for example, so that one might complain that a project didn’t actually happen, and the guy who would have been responsible for that is still in that job, and the people who would have benefited from it are still going to be angry
|
||||
|
||||
<div class="well">
|
||||
<h3>How long should data be available online? </h3>
|
||||
<ul>
|
||||
<li>The costs of storing information online nowadays are so minimal, that this question is essentially redundant (i.e. the answer is "forever"). </li>
|
||||
<li>If a government feels it is absolutely necessary to remove data after a certain period of time *(this should be a minimum of several years after original publication, longer if the period to which the budget relates is greater than a year)*, they should **specify at time of publication, clearly the time and date on which the information will be removed**. This will allow civil society organisations sufficient time to make a backup copy for themselves.</li>
|
||||
</ul>
|
||||
</div>
|
||||
### Classifications
|
||||
|
||||
Different users are interested in different aspects of budgets. Not all classifications will be available, and the availability and structure of classifications, as well as the requirements of individuals and organisations, will vary from country to country.
|
||||
|
||||
* All available classifications should be published.
|
||||
* Functional classifications are often the most comprehensible to citizens. They explain the particular themes or sectors on which money is spent. There are also international standards for comparing functional spending (e.g. COFOG).
|
||||
* Programmatic classifications are used particularly in developing countries for relating to multi-year development plans
|
||||
* Administrative classifications show which department or agency received the money – and are therefore important for the accountability of funds down the chain.
|
||||
|
||||
##### Breakdown
|
||||
* Information can then be aggregated up to create more meaningful and digestible information, but the reverse (from aggregate to disaggregate information) is not possible.
|
||||
* Again, the availability of detailed information, as well as the requirements of individuals and organisations, will vary from country to country.
|
||||
* Therefore, budgets should be as detailed and disaggregated as possible.
|
||||
|
||||
### Spending standard
|
||||
|
||||
In the [Technology for Transparent and Accountable Public Finance Report](http://community.openspending.org/research/gift/), we identified the need for a global standard for opening up transaction-level spending data. A couple of further comments on this topic.
|
||||
|
||||
* This is probably going to be more useful at the international level – e.g. to pull all the data together and look at super-aggregate information.
|
||||
* It could also be useful at country level though, for inter-country comparisons.
|
||||
|
||||
The number one low-hanging fruit which could be solved in order to vastly improve the usability of available budget and spending (plus procurement and other types listed above) information is to make data **machine-readable**.
|
||||
|
||||
<div class="well">
|
||||
<h2>What does Machine-readable mean?: Implementation guidelines from the UK government.</h2>
|
||||
<quote>
|
||||
The UK government have now issued very good clear, <a href="https://www.gov.uk/service-manual/design-and-content/choosing-appropriate-formats.html">plain-language guides</a> for service managers on which data formats are appropriate for publishing data. The US government has also decreed that all data shall be published in machine-readable formats. An extract from the UK service manual from gov.uk is copied below for the convenience of the reader:
|
||||
|
||||
<ul>
|
||||
<li><quote><strong>“For data, use CSV or a similar ‘structured data’ format (see also JSON and XML). Do not publish structured data in unstructured formats such as PDF</strong></quote>.</li>
|
||||
<li><quote><strong>If you are regularly publishing data (financial reports, statistical data, etc.) then your users may well wish to process this data programmatically, and it becomes especially important that your data is ‘machine-readable’. PDFs, Word documents and the like are not suitable formats for data publication. In addition, you should consider making your data available through an API if this will simplify your users’ interactions with your publications. [...]</quote></strong> </li>
|
||||
<li><quote><strong>If you are publishing a written report that contains statistical tables, provide the tables alongside or in addition to your report in suitable data formats.</quote></strong>
|
||||
</ul>
|
||||
</quote>
|
||||
|
||||
[...]
|
||||
|
||||
<quote>
|
||||
|
||||
<h2>Don’t assume your users can read proprietary formats</h2>
|
||||
Wherever possible, publish in accessible, patent-free, <a href="https://en.wikipedia.org/wiki/Open_format">open formats</a>, for which software is widely available on a variety of platforms. If publishing in proprietary formats, you should always make a non-proprietary alternative available.
|
||||
[...] For tabular data, provide <strong> <a href="http://en.wikipedia.org/wiki/Comma-separated_values">CSV</a> or <a href="http://en.wikipedia.org/wiki/Tab-separated_values">TSV</a> </strong> rather than Excel spreadsheets (.xls/.xlsx).
|
||||
|
||||
</quote>
|
||||
Read the full version of the guidelines <a href="https://www.gov.uk/service-manual/design-and-content/choosing-appropriate-formats.html">here</a>.
|
||||
|
||||
</div>
|
||||
## Why is this so important?
|
||||
|
||||
Civil Society Organisations currently waste a huge amount of time and resources in converting data from non-machine-readable formats into ones that they can use for analysis, visualisations or other projects. Any data project has a data pipeline:
|
||||
|
||||

|
||||
|
||||
Typically, in the projects we have analysed in this report, finding data and getting data (including extracting data out of formats such as PDFs into usable formats) are the most time intensive part of all of the projects. Extracting data out of non-machine readable formats:
|
||||
|
||||
* **is a waste of time and resources for all involved**. In an ideal world, re-users of the data should be able to dedicate the majority of their time to analysis, presentation and action around the data. It can take weeks or months for organisations to extract all the information from one document or file, enough to make a visualisation or simple analysis.
|
||||
* **introduces transcription errors during conversion**. Even current software to extract information from PDFs automatically and can introduce errors.
|
||||
|
||||
**Next**: [Tool Ecosystem](./tool-ecosystem/)
|
||||
|
||||
**Up**: [Appendix](../)
|
||||
@@ -0,0 +1,46 @@
|
||||
---
|
||||
lead: true
|
||||
title: Other handy datasets
|
||||
authors:
|
||||
- Neil Ashton
|
||||
---
|
||||
In our conclusions section, we highlighted the main types of data which are in demand (budgets, transaction-level spending, procurement...). We have kept the demands in the conclusion short for clarity's sake, however there are lots of other datasets which are essential for many organisations to be able to combine with the key data:
|
||||
|
||||
<ul>
|
||||
<li><strong>Information on targets or outputs</strong>, these should be clearly mappable to the project or programme area to which they relate in order to be able to answer questions such as “What is the delivery rate?” or “Did that injection of funds/stimulus package result in better performance?”. These are not always produced by governments, but frequently in demand.</li>
|
||||
<li><strong>Geographic information</strong>, should be available at reasonable granular levels. Governments often transfer grants and other payments to geographically identified areas, eg. as part of redistribution schemes. Providing access to such regionalised accounts can be crucial in order to enable CSOs to assess equality and distribution of budgetary priorities. Note that regional data does often provide too little granularity to expose local inequality.</li>
|
||||
</ul>
|
||||
Two cases exemplify to what extent including geographic information can be helpful for different missions:
|
||||
|
||||
> “What we would like to be able to do is pull out ward-level data [...] or very very micro-level data, neighbourhood level, most of the data which is released [in the UK] is Local Authority Level, and that’s just too big for us.” - <strong> Jez Hall of the Participatory Budgeting Unit (UK) </strong>
|
||||
|
||||
Additionally, in the [case study](../../case-studies-budgets/cbga/) from the <strong>Centre for Budget and Governance Accountability (India)</strong>, a lot of questions the group strove to answer could be answered simply by ensuring that the data contained information on, which state received the funds (this is pretty high level information, but still was unavailable.
|
||||
|
||||
<ul>
|
||||
<li><strong>Information on demographics:</strong> Most policy researchers want to be able to answer questions more specific than per-capita estimates. This makes data such as "Household Surveys" particularly important. They might ask questions such as:
|
||||
<ul>
|
||||
<li>“How many users of a particular service are there?”</li>
|
||||
<li>“How many households benefit from a particular policy?” </li>
|
||||
<li>“Of those households, how many are living below the poverty line?”</li>
|
||||
<li>“What is the income bracket of those people?” </li>
|
||||
<li>“How many young people/women/people with a disability/ people of a specific ethnicity/[...] benefit from a particular policy/programme?”</li>
|
||||
<li>“How much does a particular school place cost in different boroughs or regions?”</li>
|
||||
</ul>
|
||||
<li><strong>Information on the actual goods purchased as part of government funded work.</strong> The majority of questions related to state purchasing require details on the quantity, price and frequency of purchases. (e.g. journalists will often want to know "how many computers?" were purchased, or even "how many computers were published for X amount of money?", rather than "how much was spent on computers?") By way of illustration of the types of analysis groups need to be able to do, see a recent open data success story: <a href="http://www.bj-hc.co.uk/bjhc-news/news-detail.html?news=2327&lang=en&feed=130">Open Data probe shows NHS statin bill twice what it should be</a>. As an example the US Medicare programme published a database on prices charged by various hospitals for thousands of the most regular treatments.</li>
|
||||
<li><strong>Economic & Macroeconomic projections</strong> are becoming increasingly more important as national fiscal policies are measured against models from international organisations eg. the EU deficit thresholds. For both EU member states and EU Neighbor Countries economic reviews by the EU Commission can have substantial impacts locally on policy. The public accessibility of macroeconomic governance models has until now not had a prominent place in the debate around these often contested models. However it is clear that the public should be able to scrutinise conditions and calculations for such models.</li>
|
||||
<li><strong>Structured information on the planned pattern of cuts</strong> that could be tied to e.g. particular programmes / geographical area</li>
|
||||
<li><strong>Data showing how much governments / political candidates spend on media advertising</strong> (both through taxpayer funds and from campaign contributions).</li>
|
||||
</ul>
|
||||
This list is clearly not comprehensive, we list here only frequently recurring requests from users.
|
||||
|
||||
## Country-specific requests
|
||||
|
||||
Some countries have formulated their own detailed requests for information and reviews of currently available information, either as part of a public consultation, research or spontaneously:
|
||||
|
||||
* [Romania](https://docs.google.com/spreadsheet/ccc?key=0Anbfx9yMO3c8dGptNHF5aGhjdXdRN2U5aVlEMUJiMmc#gid=0) (In Romanian)
|
||||
* [India](http://www.accountabilityindia.in/accountabilityblog/2241-dating-data-what-are-characteristics-dream-government-data)
|
||||
* [Hungary](http://kmonitor.hu/files/page/OGP_ajanlasok_KM_TASZ.pdf) (In Hungarian)
|
||||
|
||||
See also the [user testimonials](http://community.openspending.org/research/gift/testimonials/) from the earlier report: “Technology for Transparent and Accountable Public Finance.”
|
||||
|
||||
**Up**: [Appendix](../)
|
||||
@@ -0,0 +1,379 @@
|
||||
---
|
||||
lead: true
|
||||
title: How to publish spending data without disclosing personal information
|
||||
authors:
|
||||
- Neil Ashton
|
||||
---
|
||||
<p class="c3">
|
||||
<div class="well">by OpenSpending team and Ian Makgill, Ticon</div>
|
||||
|
||||
<p class="c13 c3"><span class="c7"></span>
|
||||
|
||||
<p class="c3"><span>This guide is purposed to help governments to publish spending without compromising personal information. It has been drafted with UK local councils and other public authorities who wish to publish transactional spending in accordance with the UK regulations, but who are concerned if their data include personal information (such as personal names or addresses). While we recognise that governmental accounting systems as well as privacy regulations differs vastly across countries, we think this guide provide key practical advice, which should to some extent be replicable. </span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span class="c2">
|
||||
<h2>Background</span></h2>
|
||||
|
||||
<p class="c3"><span>In</span><span>January 2013 a freelance data specialist</span><span> from the community used OpenSpending to identify a number of privacy breaches in an individual dataset published by a local council. This was due to inconsistent redaction of sensitive data by the local authority. Whilst the majority of these payments were to organisations (hence probably not highly sensitive), there were also a few unredacted payments to individuals. The person who uploaded the data immediately notified their local council, who in turn referred this to their audit committee. As a precaution the dataset in question, the UK Local Council £500 spending data was taken off the site immediately.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span>Data privacy should never be a valid justification for shutting off access to public spending information, as there should be simple processes in place to prevent private data from being published. As more spending data becomes public, government agencies will have to implement release procedures, which prevents privacy breaches. For the financial transparency community the data privacy issue represents a challenge, as governments might be tempted to use this as reason for limiting public disclosures.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span>
|
||||
<h2>So why are we writing this guide?</span></h2>
|
||||
|
||||
<ol class="c6" start="1">
|
||||
<li class="c4 c3"><span>because we care about </span><span class="c0"><a class="c1" href="http://blog.okfn.org/2013/02/22/open-data-my-data/">privacy of individual citizens</a></span></li>
|
||||
<li class="c4 c3"><span>because we care about Open Data, we think it is vital part of making Government transparent and accountable</span></li>
|
||||
<li class="c4 c3"><span>because the presence of personal data within transactional spending data has been identified as a barrier for making such data available to the public. (In April 2013 Copenhagen City Council rejected a Freedom of Information request for 1 mi. transactions worth EUR 2.5bn due to the fact that the data contained personal data, which could not be removed without extensive use of personal resources.)</span></li>
|
||||
</ol>
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span class="c2">
|
||||
<h2>What are the rules on data privacy and obligations for publishing transactional level spending data?</h2>
|
||||
</span>
|
||||
|
||||
<p class="c3"><span>Incidents of privacy breaches highlights the importance of proper procedures to ensure that data from public sector bodies is properly redacted before being published. The UK government produces a</span><span><a class="c1" href="http://data.gov.uk/blog/local-spending-data-guidance"> </a></span><span class="c0"><a class="c1" href="http://data.gov.uk/blog/local-spending-data-guidance">guideline document for data publishers</a></span><span>, which ensures that issues like this are prevented and hence very rare.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span>The Local Governments Association (UK) has published this</span><span><a class="c1" href="http://localnewcontracts.readandcomment.com/appendix-c-inclusions-and-exemptions-for-publishing-data-2/"> </a></span><span class="c0"><a class="c1" href="http://localnewcontracts.readandcomment.com/appendix-c-inclusions-and-exemptions-for-publishing-data-2/">guide</a></span><span>.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span class="c2">
|
||||
<h2>Understanding the problem</span></h2>
|
||||
|
||||
<p class="c3"><span>In a broad sense, the law is quite simple: you can’t publish anything that might identify an individual. Complying with the law is less straightforward. It would be nice if we could just search our output files for</span><span class="c2"> </span><span>for all the occurrences of "Mr.", "Mrs." or "Miss" and redact accordingly, but personal data is often quite difficult to locate and successfully repressing the data requires diligent checking and good organisational practices.</span><span class="c2"> </span>
|
||||
|
||||
<p class="c3"><span> </span>
|
||||
|
||||
<p class="c3"><span>To complicate matters, many companies and organisations use personal names as their identifiers. Many companies in the Companies House register include “Mr.” in their name, and there’s still more companies with titles that could be confused with personal names.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span class="c2">
|
||||
<h2>Where personal information is usually found in spending data</h2>
|
||||
</span>
|
||||
|
||||
<p class="c3"><span>The primary source of personal data is in the</span><span> “name” field</span><span> from the transaction. Ensuring that this data has been cleansed is likely to ensure that most of your potential breaches have been resolved. However, at times transactions can include privacy sensitive information in the “description” field of the invoice which could include name, case file or social security number. For this reason all columns in the dataset should be analysed.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span>Columns to pay special attention to:</span>
|
||||
|
||||
<ol class="c6" start="1">
|
||||
<li class="c4 c3"><span>Name</span></li>
|
||||
<li class="c4 c3"><span>Address</span></li>
|
||||
<li class="c4 c3"><span>Narrative / description</span></li>
|
||||
<li class="c4 c3"><span>Department</span></li>
|
||||
<li class="c4 c3"><span>Category</span></li>
|
||||
</ol>
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span class="c2">Identifying names</span>
|
||||
|
||||
<p class="c3"><span>There are a number of typical indicators that the payment is made to an individual:</span>
|
||||
|
||||
<ol class="c6" start="1">
|
||||
<li class="c4 c3"><span>Use of "Mr", "Mrs", "Miss" etc at the start of the supplier name</span></li>
|
||||
<li class="c4 c3"><span>Use of an initial followed by a name e.g. "D. Harrison"</span></li>
|
||||
<li class="c4 c3"><span>Payment is not associated with an invoice </span></li>
|
||||
<li class="c4 c3"><span>The payment instruction details specify a refund or specifics such as "Direct Payment"</span></li>
|
||||
</ol>
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span>It is possible to use a procedure called 'pattern matching' that can highlight any items in a database that match a certain pattern of characters. Using these routines will make it possible to highlight entries that may include personal name data.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span>Globally it seems to vary to what extent countries will allow companies and sole proprietorships to be disclosed to the public.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<h2 class="c3"><a name="h.sxy4odoqd9n2"></a><span>How to address the issue of personal data?</span></h2>
|
||||
<p class="c3"><span>As mentioned before the most important field in spending data is the supplier name, as this will most likely contain the most valuable personal data, but publishers need to be aware of the potential for identities to be triangulated from additional data, such as narratives and transaction descriptions. It is therefore necessary to review all fields in a data set.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span class="c2">
|
||||
<h3>Step 1: Flagging at source</span></h3>
|
||||
|
||||
<p class="c3"><span>Evaluating the entries in the supplier name field to assess whether the data includes an individual's name is an inefficient and largely ineffective method for flagging personal data breaches. Instead the most effective method of suppressing publication of this data is to ensure that personal data is flagged as such when payments are made. Every Department in the Council will have a legitimate reason for issuing payments to an individual, so it is advisable to establish an organisation-wide protocol for flagging payments to individuals. Most Councils (UK) have a standard procedure for co-ordinating payments that includes raising a purchase order (PO). Users that generate POs for payments to individuals should simply append a predetermined code to the recipients name, which can then in turn be picked up by the IT department so that the data can be suppressed before publication.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span class="c2">
|
||||
<h3>Step 2: Monitoring the Supplier Database</span></h3>
|
||||
|
||||
<p class="c3"><span>The data used in payment reports will originate from the organisation's finance system, which includes a record of suppliers to whom payments are made. To prevent fraud, there is normally a strict procedure for adding suppliers to this database, this procedure should include a requirement that any personal data is flagged for later suppression, effectively creating a second filter to prevent personal data breaches. It is important to note that this procedure should not form the primary prevention mechanism, as the name in the supplier field is often simply not enough to identify whether a payment is for an individual or not, however, this step should be used in order to flag any payments that appear to be to an individual.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span class="c2">
|
||||
<h3>Step 3: Pre publication reviews</span>
|
||||
|
||||
</h3>
|
||||
<p class="c3"><span>All data that is to be published should go through a two-stage pre-publication review. The first part should include an automated review of the data, where a script is used to select entries that look like they may include personal data. </span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span>The scripts should be capable of screening for the following:</span>
|
||||
|
||||
<ol class="c6" start="1">
|
||||
<li class="c4 c3"><span>Pre-determined flags that show that a payment is to an individual.</span></li>
|
||||
<li class="c4 c3"><span>Common pattern matches used in names (e.g. "Miss")</span></li>
|
||||
<li class="c4 c3"><span>Names of payees that are known to the Council, (e.g. they have been identified as personal payments before)</span></li>
|
||||
<li class="c4 c3"><span>Any specific funding codes that are likely to indicate that a payment is going to an individual (e.g. Social Care Direct Payments).</span></li>
|
||||
</ol>
|
||||
<p class="c3 c13"><span></span>
|
||||
|
||||
<p class="c3"><span>Once data has been selected, it should be reviewed manually to confirm whether the data provides sufficient information to identify an individual. There is no need to manually review data that has already been flagged as an individual by the Department making the payment, or has been previously been identified as an individual through previous work to prevent breaches. Data that has been flagged because it triggered a pattern match or through a funding code should be checked manually.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span>Once a data line has been identified as a payment to an individual, then the key pieces of text should be stored as a record that allows the Council to suppress that data in the future, (see step 3 above) and for use in the automated flagging procedures for ensuing months.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span>A further, manual review needs to be undertaken to ensure that personal payments are not missed. Typically payments to individuals will involve small sums (relative to the amount paid to companies) in a small number of transactions. Therefore ordering the data by the lowest value transactions and then looking through the payment lines to try and identify any payments to individuals. Care should be taken to ensure that reviewers are aware of the potential for foreign names to appear in the text and steps should be taken to ensure that a foreign language review is undertaken where necessary. Although this work sounds onerous, in actuality it is a very small task; a regular monthly review should occupy just minutes of staff time, not hours.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span class="c2">
|
||||
<h3>Step 4: Removing data</span></h3>
|
||||
|
||||
<p class="c3"><span>You should never delete whole rows of data, instead you should over-write any data that might constitute a data breach. In particular, there should be no reason for removing either date or value fields from a transaction as these cannot be used to identify an individual. Additional data such as department and narrative information should only be overwritten if it contains data that could identify an individual. Councils should also take steps to detail why the data has been censored. For example, it would be suitable to replace a person's name with the following "Redacted to comply with the Data Protection Act". Providing this additional information gives the data user a good understanding of the nature of the underlying data and advises data consumers that the Council is undertaking it's role as a data producer responsibly.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<h2 class="c3"><a name="h.wi2gl3oiyddl"></a><span>What not to do: The use of overly restrictive disclosure procedures</span></h2>
|
||||
<p class="c3"><span>Data privacy issues should never act as justification for avoiding public disclosure. An example of this, might be the suppression of spend data to a Barrister, on the grounds that the data could be used to identify the individual. Whilst it is right to suppress personal data, the Barrister will be the member of a Chamber of Barristers and any transaction could be allocated to the Chambers rather than to the individual Barrister. The same applies to payments to Doctors, payments should be allocated to the Doctor's practice, not to the individual Doctor concerned.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span>Ticon has noticed a worrying trend of Councils to redact data on the basis that the payment was made to a 'sensitive supplier', or that the transaction was 'commercially sensitive'. The LGA cites the issues of arbitration, commercial confidence and transactions relating to the underwriting of debt as suitable reasons to redact spend information (see below).</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<a href="#" name="db587461f48ca9d947b08d54b3901ee8dd196ccf"></a><a href="#" name="0"></a>
|
||||
<table border="1" cellpadding="0" cellspacing="0" class="c58">
|
||||
<tbody>
|
||||
<tr class="c25">
|
||||
<td class="c38">
|
||||
<p class="c3"><span class="c2">No</span>
|
||||
|
||||
</td>
|
||||
<td class="c29">
|
||||
<p class="c3"><span class="c2">Examples of transactions that may be excluded from publication</span>
|
||||
|
||||
</td>
|
||||
<td class="c17">
|
||||
<p class="c3"><span class="c2">Reason</span>
|
||||
|
||||
</td>
|
||||
<td class="c35">
|
||||
<p class="c3"><span class="c2">Redacted or Excluded</span>
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr class="c25">
|
||||
<td class="c38">
|
||||
<p class="c3"><span>1</span>
|
||||
|
||||
</td>
|
||||
<td class="c29">
|
||||
<p class="c3"><span>Salary payments to staff (including bonuses), except when published under the senior salary scheme. These will be published separately </span>
|
||||
|
||||
</td>
|
||||
<td class="c17">
|
||||
<p class="c3"><span>Personal information protected by the Data Protection Act</span>
|
||||
|
||||
</td>
|
||||
<td class="c35">
|
||||
<p class="c3"><span>Excluded</span>
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr class="c25">
|
||||
<td class="c38">
|
||||
<p class="c3"><span>2</span>
|
||||
|
||||
</td>
|
||||
<td class="c29">
|
||||
<p class="c3"><span>Pension contributions (excluding service charge) and National Insurance Contributions</span>
|
||||
|
||||
</td>
|
||||
<td class="c17">
|
||||
<p class="c3"><span>Personal information protected by the Data Protection Act</span>
|
||||
|
||||
</td>
|
||||
<td class="c35">
|
||||
<p class="c3"><span>Excluded</span>
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr class="c25">
|
||||
<td class="c38">
|
||||
<p class="c3"><span>3</span>
|
||||
|
||||
</td>
|
||||
<td class="c29">
|
||||
<p class="c3"><span>Severance payments</span>
|
||||
|
||||
</td>
|
||||
<td class="c17">
|
||||
<p class="c3"><span>Personal information protected by the Data Protection Act</span>
|
||||
|
||||
</td>
|
||||
<td class="c35">
|
||||
<p class="c3"><span>Excluded</span>
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr class="c25">
|
||||
<td class="c38">
|
||||
<p class="c3"><span>4</span>
|
||||
|
||||
</td>
|
||||
<td class="c29">
|
||||
<p class="c3"><span>Payments to individuals from legal process - compensation payments, legal settlements, fraud payments</span>
|
||||
|
||||
</td>
|
||||
<td class="c17">
|
||||
<p class="c3"><span>Personal information protected by the Data Protection Act</span>
|
||||
|
||||
</td>
|
||||
<td class="c35">
|
||||
<p class="c3"><span>Redacted </span>
|
||||
|
||||
<p class="c3"><span>(in exceptional cases exclude the data)</span>
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr class="c25">
|
||||
<td class="c38">
|
||||
<p class="c3"><span>5</span>
|
||||
|
||||
</td>
|
||||
<td class="c29">
|
||||
<p class="c3"><span>Competition prizes – where a normal part of operations</span>
|
||||
|
||||
</td>
|
||||
<td class="c17">
|
||||
<p class="c3"><span>Personal information protected by the Data Protection Act</span>
|
||||
|
||||
</td>
|
||||
<td class="c35">
|
||||
<p class="c3"><span>Redacted</span>
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr class="c25">
|
||||
<td class="c38">
|
||||
<p class="c3"><span>6</span>
|
||||
|
||||
</td>
|
||||
<td class="c29">
|
||||
<p class="c3"><span>Settlements made with companies as an arbitration which is conditional on confidentiality</span>
|
||||
|
||||
</td>
|
||||
<td class="c17">
|
||||
<p class="c3"><span>Commercial-in-confidence – exempt under FOI</span>
|
||||
|
||||
</td>
|
||||
<td class="c35">
|
||||
<p class="c3"><span>Redacted</span>
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr class="c25">
|
||||
<td class="c38">
|
||||
<p class="c3"><span>7</span>
|
||||
|
||||
</td>
|
||||
<td class="c29">
|
||||
<p class="c3"><span>Potential betrayal of a commercial confidence, or prejudice to a legitimate commercial interest</span>
|
||||
|
||||
</td>
|
||||
<td class="c17">
|
||||
<p class="c3"><span>Very rare and will need to be justified</span>
|
||||
|
||||
</td>
|
||||
<td class="c35">
|
||||
<p class="c3"><span>Redacted</span>
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr class="c25">
|
||||
<td class="c38">
|
||||
<p class="c3"><span>8</span>
|
||||
|
||||
</td>
|
||||
<td class="c29">
|
||||
<p class="c3"><span>Transactions relating to the financing or underwriting of debt e.g. purchase of credit default swaps</span>
|
||||
|
||||
</td>
|
||||
<td class="c17">
|
||||
<p class="c3"><span>Outside the definition of expenditure for this purpose</span>
|
||||
|
||||
</td>
|
||||
<td class="c35">
|
||||
<p class="c3"><span>Excluded</span>
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr class="c25">
|
||||
<td class="c38">
|
||||
<p class="c3"><span>9</span>
|
||||
|
||||
</td>
|
||||
<td class="c29">
|
||||
<p class="c3"><span>Provisions or promises to pay not yet realised</span>
|
||||
|
||||
</td>
|
||||
<td class="c17">
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
</td>
|
||||
<td class="c35">
|
||||
<p class="c3"><span>Excluded </span>
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span>If Councils are to be genuinely open about their activities, there should be no suppression of payments to a commercial entity listed in Companies House (or any other Companies register). Individual companies can chose to protect their tender submissions and other commercial data that they send to the Council from release under the Freedom of Information Act, however, there really should be no position where the simple fact of a payment is seen as commercially sensitive. Whilst the suppression of data on payments to companies as part of settlements is a common legal practice, Councils should do their utmost to resist activity which is so protectionist and undemocratic. Perhaps financing payments should be recorded separately, but it is hard to justify their exclusion from open publication.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span class="c2">
|
||||
<h3>Privacy for farmers receiving farm subsidies in EU</h3>
|
||||
</span>
|
||||
|
||||
<p class="c3"><span>In 2010 the European Court of Justice ruled that mandatory disclosure of names of</span><span><a class="c1" href="http://www.ft.com/intl/cms/s/0/16973ef0-ec2d-11df-9e11-00144feab49a.html#axzz2KIkhha4O"> </a></span><span class="c0"><a class="c1" href="http://www.ft.com/intl/cms/s/0/16973ef0-ec2d-11df-9e11-00144feab49a.html#axzz2KIkhha4O">German recipients from EU farm subsidies</a></span><span> amounted to a breach of their privacy. The ECJ decision has not been overturned since and has led to a substantial decrease in access to overall EU spending, due to the large share that the Common Agricultural Programme (CAP) occupy within the total budget. Much like the Barristers, these farms are commercial entities and the farms should be named on each transaction. </span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span>Whilst the need to protect individuals in receipt of non-commercial payments from Government, e.g. Housing Benefit or Pensions payments should be recognised, Open Spending believes that all commercial payments should be published openly. If individuals who receive monies from Government in exchange for services, or as part of a grant for a commercial enterprise need to remain anonymous, they can always choose to reject that payment.</span>
|
||||
|
||||
<p class="c13 c3"><span></span>
|
||||
|
||||
<p class="c3"><span class="c2">
|
||||
<h3>Summary</h3>
|
||||
</span>
|
||||
|
||||
<p class="c3"><span>The world is just getting used to the existence of open spending data, but as the data attracts increased usage governments will come under greater pressure to create dependable, consistent and accurate datasets. Now is the time to ensure that your data is correctly presented, free of data that may breach regulations and can be used by organisations like openspending.org.
|
||||
|
||||
**Next**: [Other handy datasets](./other-handy-datasets)
|
||||
|
||||
**Up**: [Appendix](../)
|
||||
@@ -0,0 +1,268 @@
|
||||
---
|
||||
lead: true
|
||||
title: Tool Ecosystem
|
||||
authors:
|
||||
- Neil Ashton
|
||||
---
|
||||
## Spending Data: The Tool Ecosystem
|
||||
|
||||
There are a set of staple tools that can be used to tackle many of the issues highlighted by the organisations in this report. For each one - we’ve outlined the tool - what it’s useful for and what the barrier to entry is.
|
||||
|
||||
We continue to hunt for more and better tools to do the job and hope that some of the problems, such as governments publishing their data in PDFs or HTML, will soon be irrelevant, so that we can all focus on more important things.
|
||||
|
||||
*If you would like to suggest a tool to be added to this ecosystem - please email info [at] openspending.org*
|
||||
|
||||
### Key
|
||||
|
||||
For each tool - we’ve outlined the its use and what the barrier to entry is, here's a guide to the rough categorisation we used:
|
||||
|
||||
<strong>Basic = An off-the-shelf tool that can be learned and first independent usage made of within 1 day. No installation on servers etc required.</strong>
|
||||
|
||||
<strong>Intermediate = Between 1 day - 1 week to master basic functionality. May require tweaking of code but not new creation thereof. </strong>
|
||||
|
||||
<strong>Advanced = Requires code. </strong>
|
||||
|
||||
## Stage 1: Extracting and getting data
|
||||
|
||||
<table border="1">
|
||||
<tr>
|
||||
<td><strong>Issue</strong></td>
|
||||
<td><strong>Tools</strong></td>
|
||||
<td><strong>Level</strong></td>
|
||||
<td><strong>Notes</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Data not available</td>
|
||||
<td>Freedom of Information Portals (e.g. <a href="https://www.whatdotheyknow.com/">What Do They Know</a>, <a href="https://fragdenstaat.de/">Frag den Staat</a>).
|
||||
|
||||
<td>Basic - though some education may be required to inform people that they have the right to ask, how to phrase an FOI request, whether it is possible to submit these requests electronically etc.</td>
|
||||
<td>While Freedom of Information portals are a good way of getting data - results often end up scattered. It would be useful to have results structured into data directories so that it was possible to search successful responses together with proactively released data so that there was one common source for data.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Data available online but not downloadable. (e.g. in HTML tables on webpages).</td>
|
||||
<td> <em>For simple sites</em> (information on an individual webpage) Google Spreadsheets and ImportHTML Function, or the <a href ="https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd?hl=en">Google scraper extension</a> (basic).
|
||||
<em>For more complex webpages</em> (information spread across numerous pages) - a scraper will be required. Scrapers are ways to extract structured information from websites using code. There is a useful tool to make doing this easier online - <a href="https://scraperwiki.com/">Scraperwiki</a>.(advanced).
|
||||
|
||||
</td>
|
||||
<td>For the basic level, anyone who can use a spreadsheet and functions can use it. It is not, however, a well-known command and awareness must be spread about how it can be used. (People often daunted because they presume scraping involves code). Scraping using code is advanced, and requires knowledge of at least one programming language. </td>
|
||||
<td>
|
||||
The need to be able to scrape was mentioned in <em>every</em> country we interviewed in the Athens to Berlin Series.
|
||||
|
||||
For more information, or to learn to start scraping, see the <a href="http://schoolofdata.org/handbook/courses/scraping/">School of Data course on Scraping</a>.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Data available only in PDFS (or worse, images)</td>
|
||||
<td>A variety of tools are available to extract this information. Most promising non-code variants are <a href="http://finereader.abbyy.com/">ABBYY Finereader</a> (not free) and <a href="http://tabula.nerdpower.org/">Tabula</a> (new software, still a bit buggy and requires people to be able to host it themselves to use.)</td>
|
||||
<td>Most require knowledge of coding - some progress being made on non-technical tools. For more info and to see some of the advanced methods - see the <a href "http://schoolofdata.org/handbook/courses/extracting-data-from-pdf/">School of Data course.</a></td>
|
||||
<td>
|
||||
Note: these tools are still imperfect and it is still vastly preferable to advocate for data in the correct formats, rather than teach people how to extract.
|
||||
|
||||
Recently published <a href="https://www.gov.uk/service-manual/design-and-content/choosing-appropriate-formats.html">guidelines</a> coming directly from government in the UK and US can now be cited as examples to get data in the required formats.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Leaked data</td>
|
||||
<td>Several projects made use of secure dropboxes and services for whistleblowers. </td>
|
||||
<td>Advanced - security of utmost concern.</td>
|
||||
<td>For example: <a href="http://atlatszo.hu/magyarleaks/">MagyarLeaks</a>
|
||||
</tr>
|
||||
</table>
|
||||
## Stage 2: Cleaning, Working with and Analyzing Data
|
||||
|
||||
<table border="1">
|
||||
<tr>
|
||||
<td><strong>Issue</strong></td>
|
||||
<td><strong>Tools</strong></td>
|
||||
<td><strong>Level</strong></td>
|
||||
<td><strong>Notes</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Messy data, typos, blanks (various)</td>
|
||||
<td>Spreadsheets, <a href="http://openrefine.org/">Open Refine</a>, Powerful text editors e.g. <a href="http://www.barebones.com/products/textwrangler/">Text Wrangler</a> plus knowledge of Regular Expressions.</td>
|
||||
<td>Basic -> Advanced</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Need to reconcile entities against one another to answer questions such as, "what is company X?", "Is company X Ltd. the same as company X?" (ditto for other types of entities e.g. departments, people).</td>
|
||||
<td><a href="http://nomenklatura.okfnlabs.org/">Nomenklatura</a>, <a href="http://opencorporates.com/">OpenCorporates</a>, <a href="http://publicbodies.org/"</a></td>
|
||||
<td>Advanced (all)</td>
|
||||
<td>
|
||||
Reconciling entities is complicated both due to the tools needed as well due to the often inaccurate state of the data.
|
||||
|
||||
Working with data without common identifiers and data of poor quality makes entity reconciliation highly complicated and can cause big gaps in analysis.
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Need to be able to conceptualize networks and relationships between entities (See dedicated section on Network Mapping below).</td>
|
||||
<td>Gephi</td>
|
||||
<td>Intermediate - advanced.</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td> Need to be able to work with many many lines of data (too big to be able to fit in Excel).</td>
|
||||
<td> OpenSpending.org, Other database software (PostGres, MySQL), Command line tools</td>
|
||||
<td> OpenSpending.org - easy for basic upload search and interrogation, in OpenSpending and other databases some advanced queries may require knowledge of coding. </td>
|
||||
<td> Note: As few countries currently release transaction level data, this is not a frequent problem, but is already problematic in places such as Brazil, US and the UK. As we push for greater disclosure, this will be needed ever more.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Performing repetitive tasks or modelling </td>
|
||||
<td>Macros - Excel</td>
|
||||
<td>Basic - Intermediate.</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Entity Extraction (e.g. from large bodies of documents) </td>
|
||||
<td> <a href="http://www.opencalais.com/">Open Calais</a>, <a href="http://developer.yahoo.com/contentanalysis/">Yahoo/YQL Content Analysis API</a>, <a href="http://openup.tso.co.uk/des">TSO data enrichment service</a></td>
|
||||
<td> Intermediate</td>
|
||||
<td> This is far from a perfect method and it would be vastly easier to answer questions relating to entities if they were codified by a unique identifier. </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Analysis needs to be performed on datasets that are published in different languages (e.g. in India) </td>
|
||||
<td>To some extent: Google Translate for web based data.</td>
|
||||
<td>Basic</td>
|
||||
<td>Still searching for a solution to automatically translate offline spreadsheets. </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td> Figures change in data after publication </td>
|
||||
<td>
|
||||
For non-machine readable data - tricky.
|
||||
|
||||
For simple, machine readable file formats, such as CSV - version control is a possibility.
|
||||
|
||||
For web-based data - some scrapers can be configured to trigger (e.g. email someone) whenever a field changes.
|
||||
|
||||
<td> Intermediate to advanced </td>
|
||||
<td> Future projects that are likely to tackle this problem: <a href="http://dansinker.com/post/49856260511/opennews-code-sprints-do-some-spring-cleaning-on-data">DeDupe</a>.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td> Finding statistical patterns in spending data (such analysis is depends on high data quality) </td>
|
||||
<td> R (free), SPSS (proprietary) and other statistical software for clustering and anomaly detection (also see note). </td>
|
||||
<td> Advanced </td>
|
||||
<td> Examples: Data from Supervizor has been used to track changes in spending on contractors changes in government.
|
||||
(<a href="https://www.kpk-rs.si/en/project-transparency/supervizor-73">Supervizor.)
|
||||
<em>A note on statistical analysis software can be found below</em></a></td>
|
||||
</tr>
|
||||
</table>
|
||||
<strong>Note on SPSS and R:</strong> It’s our impression that interviewees seemed largely to have been trained to use SPSS. R is however important to mention as it offers a free access to a broad section of the same models, though based on a programming interface.
|
||||
|
||||
A few examples of analysis on spending data, which can be done with statistical software such as SPSS or R:
|
||||
|
||||
|
||||
<strong>a)</strong> <a href="http://en.wikipedia.org/wiki/Hidden_Markov_model">Hidden Markov</a>: Hidden Markov was originally developed for finding patterns in bioinformatics, but has turned out useful for predicting fraudulent and corrupt behaviour. Using Hidden Markov requires high quality data, and was for instance used to analyse spending data from 50 mio transactions in the Slovenian platform Supervizor.
|
||||
|
||||
|
||||
<strong>b)</strong> <a href="http://en.wikipedia.org/wiki/Benford%27s_law">Benford's law</a>: Benford's law examines the distribution of figures in your data, against how it should actually look. Diversions from the normal distribution can help detect fraudulent reporting (eg. if companies tend to report ernings less than $500 mio. in order to fit a particular regulation Benford’s law could be a tool to detect that). Check this example using Benford’s law to test the release of all <a href="http://friism.com/tax-records-for-danish-companies">Danish corporate tax filings</a> and check this <a href="http://friism.com/tax-records-for-danish-companies">R blog post on the topic</a>.
|
||||
|
||||
Finally a few notes on the differences between SPSS and R: Though SPSS is fairly easy to get started using, it can be difficult to collaborate around as it applies its own SPSS data format. Some models might also be unavailable from the basic SPSS package. R is the free alternative, uses a programme interface, where all extensions are accessible, and where community support and code samples are widely available. One possible compromise bridging the convenience of SPSS and the wide usability of R, is the proprietary software <a href="http://www.revolutionanalytics.com/">R Revolution</a>.
|
||||
|
||||
## Stage 3: Presenting Data
|
||||
|
||||
<table border="1">
|
||||
<tr>
|
||||
<td><strong>Issue</strong></td>
|
||||
<td><strong>Tools</strong></td>
|
||||
<td><strong>Level</strong></td>
|
||||
<td><strong>Notes</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td> Basic visualisation, time series, bar charts </td>
|
||||
<td> <a href="http://datawrapper.de/">DataWrapper</a>, <a href="http://www.tableausoftware.com/public/">Tableau Public</a>, <a href="http://www-958.ibm.com/software/analytics/manyeyes/">Many Eyes</a>, Google Tools </td>
|
||||
<td> Basic</td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td> More advanced visualisation</td>
|
||||
<td> <a href="http://d3js.org/">D3.js</a> </td>
|
||||
<td> Advanced </td>
|
||||
<td> Used in e.g. <a href="http://openbudgetoakland.org/2012-2013-sankey.html">OpenBudgetOakland</a> </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td> Mapping </td>
|
||||
<td> <a href="http://www.mapbox.com/tilemill/">TileMill</a>, <a href="http://www.google.com/drive/apps.html#fusiontables">Fusion Tables</a>, <a href="http://kartograph.org/">Kartograph</a> <a href="http://www.qgis.org/">QGIS</a> </td>
|
||||
<td> Basic -> Advanced </td>
|
||||
<td></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td> Creating a citizen’s budget </td>
|
||||
<td> OpenSpending.org, Off-the shelf tools listed above. Disqus commenting module added to OS for commenting and feedback.</td>
|
||||
<td> OpenSpending.org - making a custom visualisation - basic. Making a custom site enabling discussion - advanced. </td>
|
||||
<td> Used in e.g. <a href="http://openbudgetoakland.org/2012-2013-sankey.html">OpenBudgetOakland</a> </td>
|
||||
</tr>
|
||||
</table>
|
||||
## Publishing Data
|
||||
|
||||
<table border="1">
|
||||
<tr>
|
||||
<td><strong>Issue</strong></td>
|
||||
<td><strong>Tools</strong></td>
|
||||
<td><strong>Level</strong></td>
|
||||
<td><strong>Notes</strong></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td> Need a place online to store and manage data, raw, especially from Freedom of Information Requests. </td>
|
||||
<td> DataNest, CKAN, Socrata - various Data Portal Software options. </td>
|
||||
<td> Basic to use. </td>
|
||||
<td> Can require a programmer to get running and set up a new instance. </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Individual storage of and online collaboration around datasets</td>
|
||||
<td> Google Spreadsheets, Google Fusion Tables, Github </td>
|
||||
<td> 1-3 Basic. Github - intermediate. </td>
|
||||
<td> </td>
|
||||
</tr>
|
||||
</table>
|
||||
### Notes
|
||||
|
||||
See also the resources section in the <a href="http://openspending.org/resources/handbook/ch014_resources.html">Spending Data Handbook</a>
|
||||
|
||||
Note: Many of these tools will have difficulty working on Internet Explorer (especially older versions), but we have yet to find more powerful tools which also work there.
|
||||
|
||||
## A note on Network Analysis
|
||||
|
||||
As you will see from the case studies in the videos, Network Analysis is an area that more and more people are looking into with regard to public procurement and contracts.
|
||||
|
||||
Network visualisations are commonly used as a solution to this problem, however, we offer a note of caution to use them sparingly; due to the amount of data on which they are often used, they can sometimes be overwhelming and the average non-expert can find them hard to interpret.
|
||||
|
||||
Often the types of information that it is possible to extract from a network visualisation e.g. “who is best connected?”, “are there links between person X and person Y?” - could be more easily be found with a searchable database of connections.
|
||||
|
||||
It may also be wise to separate tools suitable for investigating the data, and tools used to present the data. In the latter case, clarity and not-overloading the visualisation will most likely yield a clearer result - so this is one area in which custom infographics may win out in terms of delivering value.
|
||||
|
||||
### Existing solutions for network mapping:
|
||||
|
||||
For producing network visualisations there are currently open source solutions:
|
||||
|
||||
* [Gephi](https://gephi.org/) (Again note that Gephi has non-visualisation functions to explore the data, which at times may be more useful in exploring the interconnections than the visualisations themselves).
|
||||
* [Mapa 76](http://mapa76.com/) - This is also interesting due to the function which is being developed to extract individual entities.
|
||||
* [RelFinder](http://www.visualdataweb.org/relfinder.php) Based off DBPedia, this tool structures and maps out relations between entities based on which articles they feature in on Wikipedia.
|
||||
* Google Fusion Tables has a network function
|
||||
* NodeXL is a powerful network toolkit for Excel.
|
||||
* [Cytoscape](http://www.cytoscape.org/)
|
||||
|
||||
## Some favourite examples of (non) Network ways of presenting hierarchies, relationships and complex systems:
|
||||
|
||||
* [Connected China (Reuters)](http://connectedchina.reuters.com/) - enables the user to easily see family connections, political coalitions, leaders and connections. Additionally - it gives a detailed organisational diagram of the Communist Party of China, as well as timelines of people’s rise to power.
|
||||
* [Little Sis](http://littlesis.org/). This is an American database of political connections, including party donations, career histories and family members. Read their About Page for more details of the questions they seek to answer.
|
||||
|
||||
### Further reading:
|
||||
|
||||
<ul>
|
||||
<li><a href="http://www.ucl.ac.uk/secret/events/event-tabbed-box/seminars-accordian/social-network">UCL Materials</a></li>
|
||||
<li> <a href="http://www.cgi.com/sites/cgi.com/files/white-papers/Implementing-social-network-analysis-for-fraud-prevention.pdf">CGI Materials</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li>A pipeline for local councils to address privacy concerns about publishing transaction-level data. In the UK, despite clear guidelines about what should be removed from data before publication, a few councils have published sensitive data over the past year. Some companies are looking at maintaining suppression lists for this data, however ideally this should be done in government, prior to publication - so workflows need to be developed for this. </li>
|
||||
<li>Tools to help spot absence of publication as it happens. Currently, civil-society led initiatives such as the Open Budget Survey can only monitor publication of key budget documents retrospectively, and using large amounts of people power. There are a couple of possibilities which spring to mind:</li>
|
||||
<ul>
|
||||
<li>In the UK - the OpenSpending team have been working with the team of data.gov.uk to <a href="http://community.openspending.org/2012/09/uk-departmental-government-spending-improving-the-quality-of-reporting/">produce automated reports</a> to help those enforcing transparency obligations to see which departments are not compliant with said regulations. The reports check both timeliness as well as structure and format of the data. This proved very successful at prompting data release initially - departments were given advance warning that the tool would be featured and any departments without up to date data would be flagged up in red; by launch date, nearly all departments had updated data. This is possible where:</li>
|
||||
<ul>
|
||||
<li>The data are published via a central platform (e.g. data.gov.uk)</li>
|
||||
<li>The data are machine readable, so a computer program can quickly ascertain whether the required fields are present.</li>
|
||||
<li>There is a standard layout for the data, so a computer can quickly verify whether column headings are correct and all present.</li>
|
||||
</ul>
|
||||
<li>Introducing a calendar of expected dates of publication of a particular dataset so that organisations could know when a document is expected to be published and enforce that it is. This could be done either on a country by country basis, or simply by aligning with internationally recognised, best practice guidelines.</li>
|
||||
</ul>
|
||||
<li>Tools which help to remove duplication of effort. For example, if one organisation has already cleaned up or extracted data from a PDF, encouraging them to share that data so another organisation does not have to waste time doing the same. </li>
|
||||
</ul>
|
||||
**Next**: [Common arguments against publishing data](./machinereadfaq/)
|
||||
|
||||
**Up**: [Appendix](../)
|
||||
Reference in New Issue
Block a user