What is data journalism?

  • Sourcing reliable datasets
  • Finding stories in the data
  • Visualising them for our audience

Many different types of datasets

  • Open data: Government departments, Office for National Statistics, Eurostat, United Nations, World Health Organisation etc
  • Exclusive data: Freedom of Information requests, leaks, contacts, web scraping
  • Creating your own datasets

A project using open data

A project using exclusive data

A project using data we collected

How do we find stories in the data?

  • Combine different datasets together
  • Ask the right questions of the data
  • Work closely with subject-matter experts
  • Look for anything unexpected
  • Examine the outliers

But first, we need to clean the data

  • Each dataset has its own challenges to clean
  • Take nothing at face value...make sure each variable means what you think it means
  • Just because the data is from an official source, doesn't mean it's clean
  • Lots of data still in PDFs
  • Data from the web may come in unfamiliar, or difficult to work with, formats
  • Even when the data comes in spreadsheet format, it often needs a lot of cleaning before it is ready for analysis
  • Trailing spaces, merged cells, spelling errors, concatenated columns, 0s meaning no data and many many more

Data analysis

  • In its simplest form, this will invove some basic summing, ranking, finding averages, looking at change over time
  • Using statistical models and techniques to analyse data
  • Analysing very large datasets - millions of observations

Grouping and ranking

Statistical modelling

var properties = ['weight', 'height'];

var sum = function (values) {
	return _.reduce(values, function (a, b) { return a + b;});

var getSquaredDifference = function (a, b) {
	return Math.pow(a - b, 2);

var getDateDifference = function(a, b) {
	return Math.abs(new Date(a) - new Date(b));

var setDissimilarityDistance = function (athletes, user) {
	_.each(athletes, function (athlete) {
		var distance;
		var squareDifferences = [];
		_.each(properties, function (prop) {
			squareDifferences.push(getSquaredDifference(athlete[prop], user[prop]));
		athlete['distance'] = sum(squareDifferences);
		athlete['birthDifference'] = getDateDifference(athlete.dob, user.dob);
	return athletes;

var getBodyMatches = function (athletes, user) {
	var dataWithDistances = setDissimilarityDistance(athletes, user);
	var athleteMatches = _.sortBy(_.sortBy(dataWithDistances, 'birthDifference'), 'distance');
	return athletesMatches.slice(0,3);


Thankfully, it worked...

Working with large datasets

A detailed look at house prices

  • Started off with more than 22 million rows of all property transactions in England and Wales
  • Analysed 8.5 million property transactions over a 10-year period
  • Carried out inflation-adjustment to find how house prices had changed at very detailed geographical levels
  • Reproducible data analysis was key
  • Added benefit is that the analysis can be re-run quickly when the data is updated

What tools do we use to clean and analyse data?

  • Excel
  • Open Refine
  • Tabula: extract data tables from PDFs
  • R
  • Python
  • QGIS

Visualising data

A great data visualisation should...

  • Tell a story
  • Be visually stimulating
  • Simple to understand
  • Rewarding

Visualising data

  • Most of the data visualisations we do are static
  • If you ask the reader to do anything other than scroll, it needs to be worth it
  • If it doesn't work well on mobile, you are excluding more than half your audience
  • When dealing with large numbers, try to put them in terms that the audience can relate to

Personal relevance calculators

Tell me about me

  • Lets people expore the part of the data that is relevant to them
  • People like to compare themselves and see where they stand in relation to others
  • But make sure you are telling people something about THEMSELVES they don’t know – often for this you need granular data
  • Ask for as little info as you need upfront: don't create unnecessary barriers to entry
  • A personal relevance calculator can be much more engaging than a regular story or dataviz
  • More shareable than a regular story or dataviz

Thank you

