Python for open data projects: who is using it and how you can too

Written by Dr Timothy Mansfield
Published on 16 July 2019

Tagged under:

About the author

Tim Mansfield is a strategist, culture consultant and futures researcher, specialising in the cultural sector. He has been the CEO of the Interaction Consortium since August 2016.

Visit profile

Open data – detailed, freely available information – opens the way to many kinds of projects. Visualisations and analytics add value to it for use or resale. Public and community sector organisations, academic researchers, governments and market researchers can discover trends and needs. The possibilities are as varied as the data.

The Open Data Handbook lists cases where open data has:

improved transparency, democratic control and participation
improved or created new commercial products and services
improved efficiency and effectiveness of government services
generated new knowledge by combining data sources and uncovering patterns in large data volumes

Python is arguably the leading language for work with open data. Many valuable open data tools and code libraries are written in the language. Python is suitable for anything from quick scripts to elaborate applications.

Let’s talk more about open data and what makes Python good for open data projects.

What is open data?

According to the Open Definition from the Open Knowledge Foundation (OKF):

“Open data and content can be freely used, modified, and shared by anyone for any purpose”

In my view, there are two aspects to being open: the legal and the technical.

Legally, open data should be available to everyone. Its licence should allow any kind of use, with no restrictions. It may require attributing the information to its source. Anyone who republishes the information may be required to keep it open. Open data licences won't prohibit republishing the data in a different format or creating an interactive application that uses it.

The OKF lists some conformant licences.

On the technical side, the data should use standard formats and open APIs. Using the data shouldn't require proprietary software. It's not obligatory to use open-source software to access open data, but it should always be an option. Commonly used formats include RDF, XML, CSV and JSON.

I think that ideally the data should also be well-structured. Presentation-focused formats such as HTML and PDF are good for human-readable reports, but usable data should also be available in a structured, machine-friendly format.

There are important differences between numerical data, structured data and linked data, which I’m going to gloss over in this short article. Open data may be any of these three types and you may use different toolsets for each of them.

Linked open data is a big topic, which we’ll come back to in a later article (but for now, have a peek at the Linked Open Data Cloud diagram).

Where does open data come from?

Open data can come from any source that has a lot of data and is willing to make it available. We hear about it a lot in connection with opening up government information. For example, several governments host directories of their open data:

Australia’s data.gov.au
The US government's data.gov
The UK's data.gov.uk
Uruguay's Open Data Catalog
Open Data Philippines
Open Data For Africa collects data for many countries across the continent

… it’s everywhere.

It can also come from research institutions and nonprofit organisations. For example, the Esri UK Open Data Portal catalogs any UK-based source of open data. The WikiMedia Foundation hosts Wikidata, the “central storage for the structured data of […] Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others”.

The Open Data Institute promotes the idea internationally, exercising advocacy and building peer networks. It promotes the idea that "everyone must be able to take part in making data work for us all”. It recognises that not all kinds of data should be open and works towards defining ethical standards.

Python tools

Python regularly appears in lists of the top five programming languages. It isn't hard to learn, and it's easier to write quick scripts in Python than in Java or C. Python is an interpreted language, which means you can run it directly from the source code. Being able to write a few lines of Python and run them immediately is often handy.

The language includes a lot of built-in ways to handle data structures, which makes it easy to write code that reads and manipulates complex data. It has outstanding support for complex data structures. There are also well-tested modules for all kinds of purposes to be found in the Python Package Index.

Specifically, there are many packages to handle open data. Python has lots of generic tools to access APIs, and also to use and manipulate CSV, RDF, JSON and XML data. Some specific tools that stand out:

The Pandas data analysis library gives Python developers additional tools for modelling and analysis. It saves a lot of coding effort on indexing, merging, format conversion and statistical work. Pandas is open source and available at no cost.
Jupyter Notebook, likewise based on Python, is an interactive environment for experimenting with ideas and generating reports and visualisations. Jupyter understands Pandas, and the two together make a great environment for creating anything from experimental views to polished reports.
Datasette is a tool for exploring and publishing data. It focuses on the least enjoyable, but most used data format: CSV files.

Perhaps the most significant platform for publishing open data is CKAN.

The CKAN platform

A powerful management system for open data, written primarily in Python, goes by the name of CKAN. The name once stood for "Comprehensive Knowledge Archive Network”, but no one could remember that, so now it just stands for itself. A lot of public institutions use it, including the governments of the US, Canada, Italy, and Australia. Within Australia, the governments of New South Wales and South Australia have done significant data publication using CKAN.

CKAN is similar to data management systems such as Dataverse and DSpace. The latter two show up more in academic and research settings, while public institutions favour CKAN. Software modules make it possible to use them together, moving datasets between them or using the best features of each system.

A lot of CKAN's strength comes from its larger software community. Over 200 third-party extensions are available. Open-source applications are available for CKAN clients and processing of extracted data. The CKAN Association Government Working Group encourages development of policies, taxonomies and tools for open government data.

Uses for CKAN open data

Government agencies have found many exciting uses for open data. Data.SA, the South Australian Government's open data portal, is CKAN-powered and makes over 1,500 datasets available. Topics include legislation, air quality, crime, roads, public transportation and many more. All of them are available for analysis or research. The information is available in a variety of formats, including CSV, XML and JSON.

New South Wales boasts over 3,300 CKAN-based datasets. Some of the more intriguing titles include "StormTracker”, "NSW Live Trains”, and "Koala corridors in south-west Sydney”. Almost any kind of information can become an open dataset, as long as it doesn't expose government secrets or infringe on privacy.

The information as published is raw data. What's really exciting is the uses imaginative people can discover. Identifying trends or combining different datasets can lead to discoveries that weren't at all apparent previously.

Entrepreneurial opportunities for open data

The beauty of open data is that anyone can take it and add value to it by applying it to the needs of citizens or customers. Having a large collection of numbers is one thing; having graphs and analytics that identify and address the needs and opportunities of real people is quite another.

Consulting businesses can specialise in analyses for clients, and businesses can analyse data for their own use. Putting together information in the right way provides material for identifying growing markets and new opportunities. Public-interest groups will find information that lets them argue for or against policy changes.

The beauty of Python and the ecology of open data tools available in it is that it makes it possible for an ordinary person with a big idea to muck around and see what you can make. It’s easy to find other developers to collaborate with. There’s lots of support for problem solving.

So… Try! Explore some of the catalogues we linked to above and see what ideas you can dream up. See how far you can get with some of the awesome tools you can access for free.

If you’d like some help – we love Python, we love open data, and we’ve got years of experience integrating weird stuff for cool purposes. So, if you have a great idea (and some budget to make it real), let’s talk and see if we can help.