Working in data journalism

Nassos Stylianou

Data journalist | BBC News

@nassos_

What is data journalism?

  • Sourcing reliable datasets
  • Finding stories in the data
  • Visualising them for our audience

Many different types of datasets

  • Open data: Government departments, Office for National Statistics, Eurostat, United Nations, World Health Organisation etc
  • Exclusive data: Freedom of Information requests, leaks, contacts, web scraping
  • Creating your own datasets

A project using open data

A project using exclusive data

A project using data we collected

How do we find stories in the data?

  • Combine different datasets together
  • Ask the right questions of the data
  • Work closely with subject-matter experts
  • Look for anything unexpected
  • Examine the outliers

But first, we need to clean the data

  • Each dataset has its own challenges to clean
  • Take nothing at face value...make sure each variable means what you think it means
  • Just because the data is from an official source, doesn't mean it's clean
  • Lots of data still in PDFs
  • Data from the web may come in unfamiliar, or difficult to work with, formats
  • Even when the data comes in spreadsheet format, it often needs a lot of cleaning before it is ready for analysis
  • Trailing spaces, merged cells, spelling errors, concatenated columns, 0s meaning no data and many many more

Data analysis

  • In its simplest form, this will invove some basic summing, ranking, finding averages, looking at change over time
  • Using statistical models and techniques to analyse data
  • Analysing very large datasets - millions of observations

Grouping and ranking

Statistical modelling


var properties = ['weight', 'height'];

var sum = function (values) {
	return _.reduce(values, function (a, b) { return a + b;});
													};

var getSquaredDifference = function (a, b) {
	return Math.pow(a - b, 2);
};

var getDateDifference = function(a, b) {
	return Math.abs(new Date(a) - new Date(b));
};

var setDissimilarityDistance = function (athletes, user) {
	_.each(athletes, function (athlete) {
		var distance;
		var squareDifferences = [];
		_.each(properties, function (prop) {
			squareDifferences.push(getSquaredDifference(athlete[prop], user[prop]));
		});
		athlete['distance'] = sum(squareDifferences);
		athlete['birthDifference'] = getDateDifference(athlete.dob, user.dob);
	});
	
	return athletes;
};

var getBodyMatches = function (athletes, user) {
	var dataWithDistances = setDissimilarityDistance(athletes, user);
	var athleteMatches = _.sortBy(_.sortBy(dataWithDistances, 'birthDifference'), 'distance');
	return athletesMatches.slice(0,3);
};


					

Thankfully, it worked...

Working with large datasets

A detailed look at house prices

  • Started off with more than 22 million rows of all property transactions in England and Wales
  • Analysed 8.5 million property transactions over a 10-year period
  • Carried out inflation-adjustment to find how house prices had changed at very detailed geographical levels
  • Reproducible data analysis was key
  • Added benefit is that the analysis can be re-run quickly when the data is updated

What tools do we use to clean and analyse data?

  • Excel
  • Open Refine
  • Tabula: extract data tables from PDFs
  • R
  • Python
  • QGIS

Visualising data

A great data visualisation should...

  • Tell a story
  • Be visually stimulating
  • Simple to understand
  • Rewarding

Visualising data

  • Most of the data visualisations we do are static
  • If you ask the reader to do anything other than scroll, it needs to be worth it
  • If it doesn't work well on mobile, you are excluding more than half your audience
  • When dealing with large numbers, try to put them in terms that the audience can relate to

Personal relevance calculators

Tell me about me

  • Lets people expore the part of the data that is relevant to them
  • People like to compare themselves and see where they stand in relation to others
  • But make sure you are telling people something about THEMSELVES they don’t know – often for this you need granular data
  • Ask for as little info as you need upfront: don't create unnecessary barriers to entry
  • A personal relevance calculator can be much more engaging than a regular story or dataviz
  • More shareable than a regular story or dataviz

Links to all the stories mentioned in this presentation (1)

Links to all the stories mentioned in this presentation (2)

Thank you

Working in data journalism

Nassos Stylianou

Data journalist | BBC News

@nassos_