---
lead: true
title: Tool Ecosystem
authors:
- Neil Ashton
---
## Spending Data: The Tool Ecosystem
There are a set of staple tools that can be used to tackle many of the issues highlighted by the organisations in this report. For each one - we’ve outlined the tool - what it’s useful for and what the barrier to entry is.
We continue to hunt for more and better tools to do the job and hope that some of the problems, such as governments publishing their data in PDFs or HTML, will soon be irrelevant, so that we can all focus on more important things.
*If you would like to suggest a tool to be added to this ecosystem - please email info [at] openspending.org*
### Key
For each tool - we’ve outlined the its use and what the barrier to entry is, here's a guide to the rough categorisation we used:
Basic = An off-the-shelf tool that can be learned and first independent usage made of within 1 day. No installation on servers etc required.
Intermediate = Between 1 day - 1 week to master basic functionality. May require tweaking of code but not new creation thereof.
Advanced = Requires code.
## Stage 1: Extracting and getting data
| Issue |
Tools |
Level |
Notes |
| Data not available |
Freedom of Information Portals (e.g. What Do They Know, Frag den Staat).
| Basic - though some education may be required to inform people that they have the right to ask, how to phrase an FOI request, whether it is possible to submit these requests electronically etc. |
While Freedom of Information portals are a good way of getting data - results often end up scattered. It would be useful to have results structured into data directories so that it was possible to search successful responses together with proactively released data so that there was one common source for data. |
| Data available online but not downloadable. (e.g. in HTML tables on webpages). |
For simple sites (information on an individual webpage) Google Spreadsheets and ImportHTML Function, or the Google scraper extension (basic).
For more complex webpages (information spread across numerous pages) - a scraper will be required. Scrapers are ways to extract structured information from websites using code. There is a useful tool to make doing this easier online - Scraperwiki.(advanced).
|
For the basic level, anyone who can use a spreadsheet and functions can use it. It is not, however, a well-known command and awareness must be spread about how it can be used. (People often daunted because they presume scraping involves code). Scraping using code is advanced, and requires knowledge of at least one programming language. |
The need to be able to scrape was mentioned in every country we interviewed in the Athens to Berlin Series.
For more information, or to learn to start scraping, see the School of Data course on Scraping. |
| Data available only in PDFS (or worse, images) |
A variety of tools are available to extract this information. Most promising non-code variants are ABBYY Finereader (not free) and Tabula (new software, still a bit buggy and requires people to be able to host it themselves to use.) |
Most require knowledge of coding - some progress being made on non-technical tools. For more info and to see some of the advanced methods - see the School of Data course. |
Note: these tools are still imperfect and it is still vastly preferable to advocate for data in the correct formats, rather than teach people how to extract.
Recently published guidelines coming directly from government in the UK and US can now be cited as examples to get data in the required formats. |
| Leaked data |
Several projects made use of secure dropboxes and services for whistleblowers. |
Advanced - security of utmost concern. |
For example: MagyarLeaks
|
## Stage 2: Cleaning, Working with and Analyzing Data
| Issue |
Tools |
Level |
Notes |
| Messy data, typos, blanks (various) |
Spreadsheets, Open Refine, Powerful text editors e.g. Text Wrangler plus knowledge of Regular Expressions. |
Basic -> Advanced |
|
| Need to reconcile entities against one another to answer questions such as, "what is company X?", "Is company X Ltd. the same as company X?" (ditto for other types of entities e.g. departments, people). |
Nomenklatura, OpenCorporates, |
Advanced (all) |
Reconciling entities is complicated both due to the tools needed as well due to the often inaccurate state of the data.
Working with data without common identifiers and data of poor quality makes entity reconciliation highly complicated and can cause big gaps in analysis.
|
| Need to be able to conceptualize networks and relationships between entities (See dedicated section on Network Mapping below). |
Gephi |
Intermediate - advanced. |
|
| Need to be able to work with many many lines of data (too big to be able to fit in Excel). |
OpenSpending.org, Other database software (PostGres, MySQL), Command line tools |
OpenSpending.org - easy for basic upload search and interrogation, in OpenSpending and other databases some advanced queries may require knowledge of coding. |
Note: As few countries currently release transaction level data, this is not a frequent problem, but is already problematic in places such as Brazil, US and the UK. As we push for greater disclosure, this will be needed ever more. |
| Performing repetitive tasks or modelling |
Macros - Excel |
Basic - Intermediate. |
|
| Entity Extraction (e.g. from large bodies of documents) |
Open Calais, Yahoo/YQL Content Analysis API, TSO data enrichment service |
Intermediate |
This is far from a perfect method and it would be vastly easier to answer questions relating to entities if they were codified by a unique identifier. |
| Analysis needs to be performed on datasets that are published in different languages (e.g. in India) |
To some extent: Google Translate for web based data. |
Basic |
Still searching for a solution to automatically translate offline spreadsheets. |
| Figures change in data after publication |
For non-machine readable data - tricky.
For simple, machine readable file formats, such as CSV - version control is a possibility.
For web-based data - some scrapers can be configured to trigger (e.g. email someone) whenever a field changes.
| Intermediate to advanced |
Future projects that are likely to tackle this problem: DeDupe. |
| Finding statistical patterns in spending data (such analysis is depends on high data quality) |
R (free), SPSS (proprietary) and other statistical software for clustering and anomaly detection (also see note). |
Advanced |
Examples: Data from Supervizor has been used to track changes in spending on contractors changes in government.
(Supervizor.)
A note on statistical analysis software can be found below |
| Issue |
Tools |
Level |
Notes |
| Basic visualisation, time series, bar charts |
DataWrapper, Tableau Public, Many Eyes, Google Tools |
Basic |
|
| More advanced visualisation |
D3.js |
Advanced |
Used in e.g. OpenBudgetOakland |
| Mapping |
TileMill, Fusion Tables, Kartograph QGIS |
Basic -> Advanced |
|
| Creating a citizen’s budget |
OpenSpending.org, Off-the shelf tools listed above. Disqus commenting module added to OS for commenting and feedback. |
OpenSpending.org - making a custom visualisation - basic. Making a custom site enabling discussion - advanced. |
Used in e.g. OpenBudgetOakland |
## Publishing Data
| Issue |
Tools |
Level |
Notes |
| Need a place online to store and manage data, raw, especially from Freedom of Information Requests. |
DataNest, CKAN, Socrata - various Data Portal Software options. |
Basic to use. |
Can require a programmer to get running and set up a new instance. |
| Individual storage of and online collaboration around datasets |
Google Spreadsheets, Google Fusion Tables, Github |
1-3 Basic. Github - intermediate. |
|
### Notes
See also the resources section in the