Case Studies

In this section we take a more in-depth, behind-the-scenes look at several data journalism projects—from apps developed in a day to nine-month investigations. We learn about how data sources have been used to augment and improve coverage of everything from elections to spending, riots to corruption, the performance of schools to the price of water. As well as larger media organizations such as the BBC, the Chicago Tribune, the Guardian, the Financial Times, Helsingin Sanomat, La Nación, Wall Street Journal, and the Zeit Online, we learn from smaller initiatives such as California Watch, Hack/HackersBuenos Aires, ProPublica, and a group of local Brazilian citizen-journalists called Friends of Januária.

The Opportunity Gap

The Opportunity Gap used never-before-released U.S. Department of Education civil rights data and showed that some states, like Florida, have levelled the field and offer rich and poor students roughly equal access to high-level courses, while other states, like Kansas, Maryland, and Oklahoma, offer less opportunity in districts with poorer families.

Figure 01. The Opportunity Gap project (ProPublica)

The data included every public school in a district with 3,000 students or more. More than three-quarters of all public-school children were represented. A reporter in our newsroom obtained the data and our Computer-Assisted Reporting Director cleaned it very extensively.

It was roughly a three-month project. Altogether, six people worked on the story and news application: two editors, a reporter, a CAR person, and two developers. Most of us weren’t working on it exclusively throughout that period.

The project really required our combined skills: deep domain knowledge, an understanding of data best practices, design and coding skills, and so on. More importantly it required an ability to find the story in the data. It also took editing, not only for the story that went with it, but for the news app itself.

For the data cleaning and analysis we used mostly Excel and cleaning scripts, as well as MS Access. The news app was written in Ruby on Rails and uses JavaScript pretty extensively.

In addition to an overview story, our coverage included an interactive news application, which let readers understand and find examples within this large national dataset that related to them. Using our news app, a reader could find their local school—say, for example, Central High School in Newark, N.J.—and immediately see how well the school does in a wide variety of areas. Then they could hit a button that says "Compare to High and Low Poverty Schools", and immediately see other high schools, their relative poverty, and the extent to which they offer higher math, Advanced Placement, and other important courses. In our example, Central High is bookended by Millburn Sr. High. The Opportunity Gap shows how only 1% of Milburn students get Free or Reduced Price lunch but 72% of them are taking at least one AP course. In the other extreme, International High has 85% of its students getting Free/Reduced Price lunch and only 1% taking AP courses.

Through this example a reader can use something they know—a local high school—to understand something they don’t know: the distribution of educational access, and the extent to which poverty is a predictor of that access.

We also integrated the app with Facebook, so readers could log into Facebook and our app would automatically let them know about schools that might interest them.

Traffic to all of our news apps is excellent, and we’re particularly proud of the way this app tells a complex story; more to the point, it helps readers tell their own particular story for themselves.

As with many projects that start with government data, the data needed a lot of cleaning. For instance, while there are only around 30 possible Advanced Placement courses, some schools reported having hundreds of them. This took lots of manual checking and phone calls to schools for confirmation and corrections.

We also worked really hard at making sure the app told a far'' story and a near'' story. That is, the app needed to present the reader with a broad, abstract national picture; a way to compare how states did relative to each other on educational access. But given that abstraction sometimes leaves readers confused as to what the data means to them, we also wanted readers to be able to find their own local school and compare it to high- and low-poverty schools in their area.

If I were to advise aspiring data journalists interested in taking on this kind of project, I’d say you have to know the material and be inquisitive! All of the rules that apply to other kinds of journalism apply here. You have to get the facts right, make sure you tell the story well, and—crucially—make sure your news app doesn’t disagree with a story you’re writing. If it does, one of the two might be wrong.

Also, if you want to learn to code, the most important thing is to start. You might like learning through classes or through books or videos, but make sure you have a really good idea for a project and a deadline by which you have to complete it. If there’s a story in your head that can only come out as a news app, then not knowing how to program won’t stop you!

— Scott Klein, ProPublica

A Nine Month Investigation into European Structural Funds

In 2010, the Financial Times and the Bureau of Investigative Journalism (BIJ) joined forces to investigate European Structural Funds. The intention was to review who the beneficiaries of European Structural Funds are and check whether the money was put to good use. At €347 billion over seven years, Structural Funds is the second largest subsidy program in the EU. The program has existed for decades, but apart from broad, generalized overviews, there was little transparency about who the beneficiaries are. As part of a rule change in the current funding round, authorities are obliged to make public a list of beneficiaries, including project description and amount of EU and national funding received.

Figure 02. EU Structural Funds investigation (Financial Times and The Bureau of Investigative Journalism)

The project team was made up of up to 12 journalists and one full-time coder collaborating for nine months. Data gathering alone took several months.

The project resulted in five days of coverage in the Financial Times and the BIJ, a BBC radio documentary, and several TV documentaries.

Before you tackle a project of this level of effort, you have to be certain that the findings are original, and that you will end up with good stories nobody else has.

The process was broken up into a number of distinct steps.

1. Identify who keeps the data and how it is kept

The European Commission’s Directorate General for the Regions have a portal to the websites of regional authorities that publish the data. We believed that the Commission would have an overarching database of project data that we could either access directly, or which we could obtain through a Freedom of Information request. No such database exists to the level of detail we required. We quickly realized that many of the links the Commission provided were faulty and that most of the authorities published the data in PDF format, rather than analysis-friendly formats such as CSV or XML.

A team of up to 12 people worked to identify the latest data and collate the links into one spreadsheet we used for collaboration. Since the data fields were not uniform (for example, headers were in different languages, some datasets used different currencies, and some included breakdowns of EU and National Funding) we needed to be as precise as possible in translating and describing the data fields available in each dataset.

2. Download and prepare the data

The next step consisted of downloading all the spreadsheets, PDFs, and, in some cases, web scraping the original data.

Each dataset had to then be standardized. Our biggest task was extracting data out of PDFs, some hundreds of pages long. Much of this was done using UnPDF and ABBYY FineReader, which allow data to be extracted to formats such as CSV or Excel.

It also involved checking and double-checking that the PDF extraction tools had captured the data correctly. This was done using filtering, sorting, and summing up totals (to ensure it corresponded with what was printed on the PDFs).

3. Create a database

The team’s coder set up a SQL database. Each of the prepared files was then used as a building block for the overall SQL database. A once-a-day process would upload all the individual data files into one large SQL database, which could be queried on the fly through its front end via keywords.

4. Double-checking and analysis

The team analyzed the data in two main ways:

Via the database front end: This entailed typing particular keywords of interest (e.g., "tobacco," "hotel," "company A" into the search engine. With help of Google Translate, which was plugged into the search functionality of our database, those keywords would be translated into 21 languages and would return appropriate results. These could be downloaded and reporters could do further research on the individual projects of interest.
By macro-analysis using the whole database: Occasionally, we would download a full dataset, which could then be analyzed (for example, using keywords, or aggregating data by country, region, type of expenditure, number of projects by beneficiary, etc.)

Our story lines were informed by both these methods, but also through on-the-ground and desk research.

Double-checking the integrity of the data (by aggregating and checking against what authorities said had been allocated) took a substantial amount of time. One of the main problems was that authorities would for the most part only divulge the amount of "EU and national funding". Under EU rules, each program is allowed to fund a certain percentage of the total cost using EU funding. The level of EU funding is determined, at program level, by the so-called co-financing rate. Each program (e.g., regional competitiveness) is made up of numerous projects. At the project levels, technically one project could receive 100 percent EU funding, and another none at all, as long as grouped together, the amount of EU funding at the program level is not more than the approved co-financing rate.

This meant that we needed to check each EU amount of funding we cited in our stories with the beneficiary company in question.

— Cynthia O’Murchu, Financial Times

The Eurozone Meltdown

So we’re covering the Eurozone meltdown. Every bit of it. The drama as governments clash and life savings are lost; the reaction from world leaders, austerity measures, and protests against austerity measures. Every day in the Wall Street Journal, there are charts on jobs loss, declining GDP, plunging world markets. It is incremental. It is numbing.

The Page One editors call a meeting to discuss ideas for year-end coverage and as we leave the meeting, I find myself wondering: what must it be like to be living through this?

Is this like 2008 when I was laid off and dark news was incessant? We talked about jobs and work and money every night at dinner, nearly forgetting how it might upset my daughter. And weekends, they were the worst. I tried to deny the fear that seemed to have a permanent grip at the back of my neck and the anxiety tightening my rib cage. Is this what was it like right now to be a family in Greece? In Spain?

I turned back and followed Mike Allen, the Page One editor, into his office and pitched the idea of telling the crisis through families in the Eurozone by looking first at the data, finding demographic profiles to understand what made up a family and then surfacing that along with pictures and interviews‚ audio of the generations. We’d use beautiful portraiture, the voices—and the data.

Back at my desk, I wrote a précis and drew a logo.

Figure 03. The Eurozone Meltdown: precis (Wall Street Journal)

For the next three weeks I chased numbers: metrics on marriage, mortality, family size, and health spending. I read up on living arrangements and divorce rates, looked at surveys on well-being and savings rates. I browsed national statistics divisions, called the UN population bureau, the IMF, Eurostat, and the OECD until I found an economist who had spent his career tracking families. He led me to a scholar on family composition. She pointed me to white papers on my topic.

With my editor, Sam Enriquez, we narrowed down the countries. We gathered a team to discuss the visual approach and which reporters could deliver words, audio and story. Matt Craig, the Page One photo editor, set to work finding the shooters. Matt Murray, the Deputy Managing Editor for world coverage, sent a memo to the bureau chiefs requesting help from the reporters. (This was crucial: sign-off from the top.)

But first the data. Mornings I’d export data into spreadsheets and make charts to see trends: savings shrinking, pensions disappearing, mothers returning to work, health spending, along with government debt and unemployment. Afternoons I’d look at those data in clusters, putting the countries against each other to find stories.

I did this for a week before I got lost in the weeds and started to doubt myself. Maybe this was the wrong approach. Maybe it wasn’t about countries, but it was about fathers and mothers, and children and grandparents. The data grew.

And shrank. Sometimes I spent hours gathering information only to find out that it told me, well, nothing. That I had dug up the entirely wrong set of numbers. Sometimes the data were just too old.

Figure 04. Judging the usefulness of a dataset can be a very time-consuming task (Sarah Slobin)

And then the data grew again as I realized I still had questions, and I didn’t understand the families.

I needed to see it, to shape it. So I made a quick series of graphics in Illustrator, and began to arrange and edit them.

As the charts emerged, so did a cohesive picture of the families.

Figure 05. Graphic visualization: making sense of trends and patterns hidden in the datasets (Sarah Slobin)

Figure 06. Numbers are people: the value of data lies in the individual stories they represent (Wall Street Journal)

We launched. I called each reporter. I sent them the charts, the broad pitch and an open invitation to find stories that they felt were meaningful, that would bring the crisis closer to our readers. We needed a small family in Amsterdam, and larger ones in Spain and Italy. We wanted to hear from multiple generations to see how personal history shaped responses.

From here on in, I would be up early to check my email to be mindful of the time-zone gap. The reporters came back with lovely subjects, summaries, and surprises that I hadn’t anticipated.

For photography, we knew we wanted portraits of the generations. Matt’s vision was to have his photographers follow each family member through a day in their lives. He chose visual journalists who had covered the world, covered news and even covered war. Matt wanted each shoot to end at the dinner table. Sam suggested we include the menus.

From here it was a question of waiting to see what story the photos told. Waiting to see what the families said. We designed the look of the interactive. I stole a palette from a Tintin novel, we worked through the interaction. And when it was all together and we had storyboards, we added back in some (not much but some) of the original charts. Just enough to punctuate each story, just enough to harden the themes. The data became a pause in the story, a way to switch gears.

Figure 07. Life in the Euro Zone (Wall Street Journal)

In the end, the data were the people; they were the photographs and the stories. They were what was framing each narrative and driving the tension between the countries.

By the time we published, right before the New Year as we were all contemplating what was on the horizon, I knew all the family members by name. I still wonder how they are now. And if this doesn’t seem like a data project, that’s fine by me. Because those moments that are documented in Life in the Eurozone‚ these stories of sitting down for a meal and talking about work and life with your family was something we were able to share with our readers. Understanding the data is what made it possible.

— Sarah Slobin, Wall Street Journal

Covering the Public Purse with OpenSpending.org

In 2007, Jonathan came to the Open Knowledge Foundation with a one page proposal for a project called Where Does My Money Go?, which aimed to make it easier for UK citizens to understand how public funds are spent. This was intended to be a proof-of-concept for a bigger project to visually represent public information, based on the pioneering work of Otto and Marie Neurath’s Isotype Institute in the 1940s.

Figure 08. Where Does My Money Go? (Open Knowledge Foundation)

The Where Does My Money Go? project enabled users to explore public data from a wide variety of sources using intuitive open source tools. We won an award to help to develop a prototype of the project, and later received funding from Channel 4’s 4IP to turn this into a fully fledged web application. Information design guru David McCandless (from Information is Beautiful) created several different views of the data that helped people relate to the big numbers—including the "Country and Regional Analysis," which shows how money is disbursed in different parts of the country, and "Daily Bread", which shows citizens a breakdown of their tax contributions per day in pounds and pence.

Figure 09. The Where Does My Money Go? Daily Bread tax calculator (Open Knowledge Foundation)

Around that time, the holy grail for the project was the cunningly acronymed Combined Online Information System (or COINS) data, which was the most comprehensive and detailed database of UK government finance available. Working with Lisa Evans (before she joined the Guardian Datablog team), Julian Todd and Francis Irving (now of Scraperwiki fame), Martin Rosenbaum (BBC), and others, we filed numerous requests for the data—many of them unsuccessful (the saga is partially documented by Lisa in the sidebar Using FOI to Understand Spending).

When the data was finally released in mid-2010, it was widely considered a coup for transparency advocates. We were given advance access to the data to load it into our web application, and we received a significant attention from the press when this fact was made public. On the day of the release, we had dozens of journalists showing up on our IRC channel to discuss and ask about the release, as well as to enquire about how to open and explore it (the files were tens of gigabytes in size). While some pundits claimed the massive release was so complicated it was effectively obscurity through transparency, lots of brave journalists got stuck in the data to give their readers an unprecedented picture of how public funds are spent. The Guardian live-blogged about the release and numerous other media outlets covered it, and gave analyses of findings from the data.

It wasn’t long before we started to get requests and enquiries about running similar projects in other countries around the world. Shortly after launching OffenerHaushalt—a version of the project for the German state budget created by Friedrich Lindenberg—we launched OpenSpending, an international version of the project, which aimed to help users map public spending from around the world a bit like OpenStreetMap helped them to map geographical features. We implemented new designs with help from the talented Gregor Aisch, partially based on David McCandless’s original designs.

Figure 10. OffenerHaushalt, the German version of Where Does My Money Go? (Open Knowledge Foundation)

With the OpenSpending project, we have worked extensively with journalists to acquire, represent, interpret, and present spending data to the public. OpenSpending is first and foremost an enormous, searchable database of public spending—both high-level budget information and transaction-level actual expenditure. On top of this are built a series of out-of-the-box visualizations such as treemaps and bubbletrees. Anyone can load in their local council data and produce visualizations from it.

While initially we thought there would be a greater demand for some of our more sophisticated visualizations, after speaking to news organizations we realized that there were more basic needs that needed to be satisfied first, such as the the ability to embed dynamic tables of data in their blogposts. Keen to encourage news organizations to give the public access to the data alongside their stories, we built a widget for this too.

Our first big release was around the time of the first International Journalism Festival in Perugia. A group of developers, journalists and civil servants collaborated to load Italian data into the OpenSpending platform, which gave a rich view of how spending was broken down amongst central, regional, and local administrations. It was covered in Il Fatto Quotidiano, Il Post, La Stampa, Repubblica, and Wired Italia, as well as in the Guardian.

Figure 11. The Italian version of Where Does My Money Go? (La Stampa)

In 2011 we worked with Publish What You Fund and the Overseas Development Institute to map aid funding to Uganda from 2003-2006. This was new because for the first time you could see aid funding flows alongside the national budget—enabling you to see to what extent the priorities of donors aligned with the priorities of governments. There were some interesting conclusions—for example, both counter HIV programs and family planning emerged as almost entirely funded by external donors. This was covered in the Guardian.

We’ve also been working with NGOs and advocacy groups to cross-reference spending data with other sources of information. For example, Privacy International approached us with a big list of surveillance technology companies and a list of agencies attending a well-known international surveillance trade show, known colloquially as the `wiretappers ball'. By systematically cross-referencing company names with spending datasets, it was possible to identify which companies had government contracts—which could then be followed up with FOI requests. This was covered by the Guardian.

We’re currently working to increase fiscal literacy among journalists and the public as part of a project called Spending Stories, which lets users link public spending data to public spending related stories to see the numbers behind the news, and the news around the numbers.

Through our work in this area, we’ve learned that:

Journalists are often not used to working with raw data, and many don’t consider it a necessary foundation for their reporting. Sourcing stories from raw information is still a relatively new idea.
Analyzing and understanding data is a time-intensive process, even with the necessary skills. Fitting this into a short-lived news cycle is hard, so data journalism is often used in longer-term, investigative projects.
Data released by governments is often incomplete or outdated. Very often, public databases cannot be used for investigative purposes without the addition of more specific pieces of information requested through FOI.
Advocacy groups, scholars, and researchers often have more time and resources to conduct more extensive data-driven research than journalists. It can be very fruitful to team up with them, and to work in teams.

— Lucy Chambers and Jonathan Gray, Open Knowledge Foundation

Finnish Parliamentary Elections and Campaign Funding

In recent months there have been ongoing trials related to the election campaign funding of the Finnish general elections of 2007.

After the elections in 2007, the press found out that the laws on publicizing campaign funding had no effect on politicians. Basically, campaign funding has been used to buy favors from politicians, who have then failed to declare their funding as mandated by Finnish law.

After these incidents, the laws became stricter. After the general election in March 2011, Helsingin Sanomat decided to carefully explore all the available data on campaign funding. The new law stipulates that election funding must be declared, and only donations below 1,500 euros may be anonymous.

1. Find data and developers

Helsingin Sanomat has organized HS Open hackathons since March 2011. We invite Finnish coders, journalists, and graphic designers to the basement of our building. Participants are divided into groups of three, and they are encouraged to develop applications and visualizations. We have had about 60 participants in each of our three events so far. We decided that campaign funding data should be the focus of HS Open #2, May 2011.

The National Audit Office of Finland is the authority that keeps records of campaign funding. That was the easy part. Chief Information Officer Jaakko Hamunen built a website that provides real-time access to their campaign funding database. The Audit Office made this in just two months after our request.

The Vaalirahoitus.fi website will provide the public and the press with information on campaign funding for every election from now on.

Figure 12. Election financing (Helsingin Sanomat)

2. Brainstorm for ideas

The participants of HS Open 2 came up with twenty different prototypes about what to do with the data. You can find all the prototypes on our website (text in Finnish).

A bioinformatics researcher called Janne Peltola noted that campaign funding data looked like the gene data they research, in terms of containing many interdependencies. In bioinformatics there is an open source tool called Cytoscape that is used to map these interdependencies. So we ran the data through Cytoscape, and got a very interesting prototype.

3. Implement the idea on paper and on the Web

The law on campaign funding states that elected members of parliament must declare their funding two months after the elections. In practice this meant that we got the real data in mid-June. In HS Open, we had data only from MPs who had filed before the deadline.

There was also a problem with the data format. The National Audit Office provided the data as two CSV files. One contained the total budget of campaigns, the other listed all the donors. We had to combine these two, creating a file that contained three columns: donor, receiver, and amount. If the politicians had used their own money, in our data format it looked like Politician A donated X euros to Politician A. Counter-intuitive perhaps, but it worked for Cytoscape.

When the data was cleaned and reformatted, we just ran it through Cytoscape. Then our graphics department made a full-page graphic out of it.

Finally we created a beautiful visualization on our website. This was not a network analysis graphic. We wanted to give people an easy way to explore how much campaign funding there is and who gives it. The first view shows the distribution of funding between MPs. When you click on one MP, you get the breakdown of his or her funding. You can also vote on whether this particular donor is good or not. The visualization was made by Juha Rouvinen and Jukka Kokko, from an ad agency called Satumaa.

The web version of campaign funding visualization uses the same data as the network analysis.

4. Publish the data

Of course, the National Audit Office already publishes the data, so there was no need to republish. But, as we had cleaned up the data and given it a better structure, we decided to publish it. We give out our data with a Creative Commons Attribution licence. Subsequently several independent developers have made visualizations of the data, some of which we have published.

The tools we used for the project were Excel and Google Refine for data cleaning and analysis; Cytoscape for network analysis; and Illustrator and Flash for the visualizations. The Flash should have been HTML5, but we ran out of time.

What did we learn? Perhaps the most important lesson was that data structures can be very hard. If the original data is not in suitable format, recalculating and converting it will take a lot of time.

Electoral Hack in Realtime (Hacks/Hackers Buenos Aires)

Figure 13. Elections 2011 (Hacks/Hackers Buenos Aires)

Electoral Hack is a political analysis project that visualizes data from the provisional ballot results of the October 2011 elections in Argentina. The system also features information from previous elections and socio-demographic statistics from across the country. The project was updated in real time with information from the provisional ballot count of the national elections of 2011 in Argentina, and gave summaries of election results. It was an initiative of Hacks/Hackers Buenos Aires with the political analyst Andy Tow, and was a collaborative effort of journalists, developers, designers, analysts, political scientists, and others from the local chapter of Hacks/Hackers.

What Data Did We Use?

All data came from official sources: the National Electoral Bureau provided access to data of the provisional count by Indra; the Department of the Interior provided information about elected posts and candidates from different political parties; a university project provided biographical information and the policy platforms of each presidential ticket; while socio-demographic information came from the 2001 National Census of Population and Housing (INDEC), the 2010 Census (INDEC), and the Ministry of Health.

How Was It Developed?

The application was generated during the 2011 Election Hackathon by Hacks/Hackers Buenos Aires the day before the election on October 23, 2011. The hackathon saw the participation of 30 volunteers with a variety of different backgrounds. Electoral Hack was developed as an open platform that could be improved over time. For the technology, we used Google Fusion Tables, Google Maps, and vector graphics libraries.

We worked on the construction of polygons for displaying geographic mapping and electoral demographics. Combining polygons in GIS software and geometries from public tables in Google Fusion Tables, we generated tables with keys corresponding to the electoral database of the Ministry of Interior, Indra, and sociodemographic data from INDEC. From this, we created visualizations in Google Maps.

Using the Google Maps API, we published several thematic maps representing the spatial distribution of voting with different tones of color, where the intensity of color represented the percentage of votes for the various presidential tickets in different administrative departments and polling stations, with particular emphasis on major urban centers: the City of Buenos Aires, the 24 districts of Greater Buenos Aires, the City of Cordoba, and Rosario.

We used the same technique to generate thematic maps of previous elections, namely the presidential primaries of 2011 and the election of 2007, as well as of the distribution of sociodemographic data, such as for poverty, child mortality, and living conditions, allowing for analysis and comparison. The project also showed the spatial distribution of the differences in percentage of votes obtained by each ticket in the general election of October compared to the August primary election.

Later, using partial data from the provisional ballot counts, we created an animated map depicting the anatomy of the count, in which the progress of the vote count is shown from the closing of the local polls until the following morning.

Pros

We set out to find and represent data and we were able to do that. We had the UNICEF’s database of child sociodemographics at hand, as well as the database of candidates created by the yoquierosaber.org group of Torcuato Di Tella University. During the hackathon we gathered a large volume of additional data that we did not end up including.
It was clear that the journalistic and programming work was enriched by scholarship. Without the contributions of Andy Tow and Hilario Moreno Campos, the project would have been impossible to achieve.

Cons

The sociodemographic data we could use was not up to date (most was from the 2001 census), and it was not very granular. For example, it did not include detail about local average GDP, main economic activity, education level, number of schools, doctors per capita, and lots of other things that it would have been great to have.
Originally the system was intended as a tool that could be used to combine and display any arbitrary data, so that journalists could easily display data that interested them on the Web. But we had to leave this for another time.
As the project was built by volunteers in a short time frame, it was impossible to do everything that we wanted to do. Nevertheless, we made a lot of progress in the right direction.
For the same reason, all the collaborative work of 30 people ended up condensed into a single programmer when the data offered by the government began to appear, and we ran into some problems importing data in real time. These were solved within hours.

Implications

The Electoral Hack platform had a big impact in the media, with television, radio, print and online coverage. Maps from the project were used by several media platforms during the elections and in subsequent days. As the days went by, the maps and visualizations were updated, increasing traffic even more. On Election Day, the site created that very day received about 20 thousand unique visitors and its maps were reproduced on the cover page of the newspaper Página/12 for two consecutive days, as well as in articles in La Nación. Some maps appeared in the print edition of the newspaper Clarín. It was the first time that an interactive display of real-time maps had been used in the history of Argentine journalism. In the central maps one could clearly see the overwhelming victory of Cristina Fernandez de Kirchner by 54 percent of the vote, broken up by color saturation. It also served to help users understand specific cases where local candidates had landslide victories in the provinces.

— Mariano Blejman, Mariana Berruezo, Sergio Sorín, Andy Tow, and Martín Sarsale from Hacks/Hackers Buenos Aires

Data in the News: WikiLeaks

It began with one of the investigative reporting team asking, ``You’re good with spreadsheets, aren’t you?'' And this was one hell of a spreadsheet: 92,201 rows of data, each one containing a detailed breakdown of a military event in Afghanistan. This was the WikiLeaks war logs. Part one, that is. There were to be two more episodes to follow: Iraq and the cables. The official term was SIGACTS: the US military Significant Actions Database.

The Afghanistan war logs—shared with The New York Times and Der Spiegel—was data journalism in action. What we wanted to do was enable our team of specialist reporters to get great human stories from the information—and we wanted to analyze it to get the big picture, to show how the war really is going.

It was central to what we would do quite early on that we would not publish the full database. WikiLeaks was already going to do that and we wanted to make sure that we didn’t reveal the names of informants or unnecessarily endanger NATO troops. At the same time, we needed to make the data easier to use for our team of investigative reporters led by David Leigh and Nick Davies (who had negotiated releasing the data with Julian Assange). We also wanted to make it simpler to access key information, out there in the real world, as clear and open as we could make it.

The data came to us as a huge Excel file; over 92,201 rows of data, some with nothing in it at all or poorly formatted. It didn’t help reporters trying to trawl through the data for stories and was too big to run meaningful reports on.

Our team built a simple internal database using SQL. Reporters could now search stories for key words or events. Suddenly the dataset became accessible and generating stories became easier.

The data was well structured: each event had the following key data: time, date, a description, casualty figures, and—crucially—detailed latitude and longitude.

Figure 14. The WikiLeaks war logs (the Guardian)

We also started filtering the data to help us tell one of the key stories of the war: the rise in IED (improvised explosive device) attacks, homemade roadside bombs which are unpredictable and difficult to fight. This dataset was still massive, but easier to manage. There were around 7,500 IED explosions or ambushes (an ambush is where the attack is combined with, for example, small arms fire or rocket grenades) between 2004 and 2009. There were another 8,000 IEDs which were found and cleared. We wanted to see how they changed over time—and how they compared. This data allowed us to see that the south, where British and Canadian troops were based then, was the worst-hit area—which backed up what our reporters who had covered the war knew.

The Iraq war logs release in October 2010 dumped another 391,000 records of the Iraq war into the public arena.

This was in a different league to the Afghanistan leak; there’s a good case for saying this made the war the most documented in history. Every minor detail was now there for us to analyze and break down. But one factor stands out: the sheer volume of deaths, most of which are civilians.

As with Afghanistan, the Guardian decided not to republish the entire database, largely because we couldn’t be sure the summary field didn’t contain confidential details of informants and so on.

But we did allow our users to download a spreadsheet containing the records of every incident where somebody died, nearly 60,000 in all. We removed the summary field so it was just the basic data: the military heading, numbers of deaths, and the geographic breakdown.

We also took all these incidents where someone had died and put it on a map using Google Fusion tables. It was not perfect, but a start in trying to map the patterns of destruction that had ravaged Iraq.

December 2010 saw the release of the cables. This was in another league altogether, a huge dataset of official documents: 251,287 dispatches, from more than 250 worldwide US embassies and consulates. It’s a unique picture of US diplomatic language—including over 50,000 documents covering the current Obama administration. But what did the data include?

The cables themselves came via the huge Secret Internet Protocol Router Network, or SIPRNet. SIPRNet is the worldwide US military Internet system, kept separate from the ordinary civilian Internet and run by the Department of Defense in Washington. Since the attacks of September 2001, there had been a move in the US to link up archives of government information, in the hope that key intelligence no longer gets trapped in information silos or "stovepipes." An increasing number of US embassies have become linked to SIPRNet over the past decade, so that military and diplomatic information can be shared. By 2002, 125 embassies were on SIPRNet: by 2005, the number had risen to 180, and by now the vast majority of US missions worldwide are linked to the system—which is why the bulk of these cables are from 2008 and 2009. As David Leigh wrote:

An embassy dispatch marked SIPDIS is automatically downloaded onto its embassy classified website. From there, it can be accessed not only by anyone in the state department, but also by anyone in the US military who has a security clearance up to the "Secret" level, a password, and a computer connected to SIPRNet.

…which astonishingly covers over 3 million people. There are several layers of data in here; all the way up to SECRET NOFORN, which means that they are designed never be shown to non-US citizens. Instead, they are supposed to be read by officials in Washington up to the level of Secretary of State Hillary Clinton. The cables are normally drafted by the local ambassador or subordinates. The ``Top Secret'' and above foreign intelligence documents cannot be accessed from SIPRNet.

Unlike the previous releases, this was predominantly text, not quantified or with identical data. This is what was included:

A source: The embassy or body which sent it.
A list of recipients: Normally cables were sent to a number of other embassies and bodies.
A subject field: A summary of the cable.
Tags: Each cable was tagged with a number of keyword abbreviations.
Body text: The cable itself. We opted not to publish these in full for obvious security reasons.

One interesting nuance of this story is how the cables have almost created leaks on demand. They led the news for weeks upon being published; now, whenever a story comes up about some corrupt regime or international scandal, access to the cables gives us access to new stories.

Analysis of the cables is an enormous task which may never be entirely finished.

— This is an edited version of a chapter first published in Facts are Sacred: The Power of Data by Simon Rogers, the Guardian (published on Kindle)

Mapa76 Hackathon

We opened the Buenos Aires chapter of Hacks/Hackers in April 2011. We hosted two initial meetups to publicize the idea of greater collaboration between journalists and software developers, with between 120 and 150 people at each event. For a third meeting we had a 30-hour hackathon with eight people at a digital journalism conference in the city of Rosario, 300 kilometers from Buenos Aires.

A recurring theme in these meetings was the desire to scrape large volumes of data from the Web, and then to represent it visually. To help with this, a project called Mapa76.info was born, which helps users to extract data and then to display it using maps and timelines. Not an easy task.

Figure 15. Mapa76 (Hacks/Hackers Buenos Aires)

Why Mapa76? On March 24, 1976 there was a coup in Argentina, which lasted until 1983. In that period, there were an estimated 30,000 disappeared people, thousands of deaths, and 500 children born in captivity appropriated for the military dictatorship. Over 30 years later, the number of people in Argentina convicted of crimes against humanity committed during the dictatorship amounts to 262 people (September 2011). Currently there are 14 ongoing trials and 7 with definite starting dates. There are 802 people in various open court cases.

These prosecutions generate large volumes of data that are difficult for researchers, journalists, human rights organizations, judges, prosecutors, and others to process. Data is produced in a distributed manner and investigators often don’t take advantage of software tools to assist them with interpreting it. Ultimately this means that facts are often overlooked and hypotheses are often limited. Mapa76 is an investigative tool providing open access to this information for journalistic, legal, juridical, and historical purposes.

To prepare for the hackathon, we created a platform which developers and journalists could use to collaborate on the day of the event. Martin Sarsale developed some basic algorithms to extract structured data from simple text documents. Some libraries were also used from DocumentCloud.org project, but not many. The platform would automatically analyze and extract names, dates and places from the texts—and would enable users to explore key facts about different cases (e.g., date of birth, place of arrest, alleged place of disappearance, and so on).

Our goal was to provide a platform for the automatic extraction of data on the judgments of the military dictatorship in Argentina. We wanted a way to automatically (or at least semi-automatically) display key data related to cases from 1976-1983 based on written evidence, arguments and judgments. The extracted data (names, places and dates) are collected, stored, and can be analyzed and refined by the researcher, as well as being explored using maps, timelines, and network analysis tools.

The project will allow journalists and investigators, prosecutors and witnesses to follow the story of a person’s life, including the course of their captivity and subsequent disappearance or release. Where information is absent, users can comb through a vast number of documents for information that could be of possible relevance to the case.

For the hackathon, we made a public announcement through Hacks/Hackers Buenos Aires, which then had around 200 members (at the time of writing, there are around 540). We also contacted many human rights associations. The meeting was attended by about forty people, including journalists, advocacy organizations, developers and designers.

During the hackathon, we identified tasks that different types of participants could pursue independently to help things run smoothly. For example, we asked designers to work on an interface that combined maps and timelines, we asked developers to look into ways of extracting structured data and algorithms to disambiguate names, and we asked journalists to look into what happened with specific people, to compare different versions of stories, and to comb through documents to tell stories about particular cases.

Probably the main problem we had after the hackathon was that our project was very ambitious, our short-term objectives were demanding, and it is hard to coordinate a loose-knit network of volunteers. Nearly everyone involved with the project had a busy day job and many also participated in other events and projects. Hacks/Hackers Buenos Aires had 9 meetings in 2011.

The project is currently under active development. There is a core team of four people working with over a dozen collaborators. We have a public mailing list and code repository through which anyone can get involved with the project.

— Mariano Blejman, Hacks/Hackers Buenos Aires

The Guardian Datablog’s Coverage of the UK Riots

During the summer of 2011, the UK was hit by a wave of riots. At the time, politicians suggested that these actions were categorically not linked to poverty and those who did the looting were simply criminals. Moreover, the Prime Minister, along with leading conservative politicians, blamed social media for causing the riots, suggesting that incitement had taken place on these platforms and that riots were organized using Facebook, Twitter, and Blackberry Messenger (BBM). There were calls to temporarily shut social media down. Because the government did not launch an inquiry into why the riots happened, the Guardian, in collaboration with the London School of Economics, set up the groundbreaking Reading the Riots project to address these issues.

Figure 16. The UK Riots: every verified incident (the Guardian)

The newspaper extensively used data journalism to enable the public to better understand who was doing the looting and why. What is more, they also worked with another team of academics, led by Professor Rob Procter at the University of Manchester, to better understand the role of social media, which the Guardian itself had extensively used in its reporting during the riots. The Reading the Riots team was led by Paul Lewis, the Guardian’s Special Projects Editor. During the riots Paul reported on the front line in cities across England (most notably via his Twitter account, @paullewis). This second team worked on 2.6 million riot tweets donated by Twitter. The main aim of this social media work was to see how rumors circulate on Twitter, the function different users/actors have in propagating and spreading information flows, to see whether the platform was used to incite, and to examine other forms of organization.

In terms of the use of data journalism and data visualizations, it is useful to distinguish between two key periods: the period of the riots themselves and the ways in which data helped tell stories as the riots unfolded; and then a second period of much more intense research with two sets of academic teams working with the Guardian, to collect data, analyze it, and write in-depth reports on the findings. The results from the first phase of the Reading the Riots project were published during a week of extensive coverage in early December 2011. Below are some key examples of how data journalism was used during both periods.

Phase One: The Riots As They Happened

By using simple maps, the Guardian data team showed the locations of confirmed riots spots and through mashing up deprivation data with where the riots took place, started debunking the main political narrative that there was no link to poverty. Both of these examples used off-the-shelf mapping tools, and in the second example, combine location data with another dataset to start making other connections and links.

In relation to the use of social media during the riots (in this case, Twitter), the newspaper created a visualization of riot-related hashtags used during this period, which highlighted that Twitter was mainly used to respond to the riots rather than to organize people to go looting, with #riotcleanup, the spontaneous campaign to clean up the streets after the rioting, showing the most significant spike during the riot period.

Phase Two: Reading the Riots

When the paper reported its findings from months of intensive research and working closely with two academic teams, two visualizations stand out and have been widely discussed. The first one, a short video, shows the results of combining the known places where people rioted with their home address and showing a so-called "riot commute." Here the paper worked with transport mapping specialist, ITO World, to model the most likely route traveled by the rioters as they made their way to various locations to go looting, highlighting different patterns for different cities, with some traveling long distances.

The second one deals with the ways in which rumors spread on Twitter. In discussion with the academic team, seven rumors were agreed on for analysis. The academic team then collected all data related to each rumor and devised a coding schedule that coded the tweet according to four main codes: people simply repeating the rumor (making a claim), rejecting it (making a counter claim), questioning it (query), or simply commenting (comment). All tweets were coded in triplicate and the results were visualized by the Guardian Interactive Team. The Guardian team has written about how they built the visualization.

What is so striking about this visualization is that it powerfully shows what is very difficult to describe and that is the viral nature of rumors and the ways in which their life cycle plays out over time. The role of the mainstream media is evident in some of these rumors (for example, outright debunking them, or indeed confirming them quickly as news), as is the corrective nature of Twitter itself in terms of dealing with such rumors. This visualization not only greatly aided the storytelling, but also gave a real insight into how rumors work on Twitter, which provides useful information for dealing with future events.

What is clear from the last example is the powerful synergy between the newspaper and an academic team capable of an in-depth analysis of 2.6 million riot tweets. Although the academic team built a set of bespoke tools to do their analysis, they are now working to make these widely available to anyone who wishes to use them in due course, providing a workbench for their analysis. Combined with the how-to description provided by the Guardian team, it will provide a useful case study of how such social media analysis and visualization can be used by others to tell such important stories.

— Farida Vis, University of Leicester

Illinois School Report Cards

Each year, the Illinois State Board of Education releases school "report cards," data on the demographics and performance of all the public schools Illinois. It’s a massive dataset—this year’s drop was ~9,500 columns wide. The problem with that much data is choosing what to present. (As with any software project, the hard part is not building the software, but building the right software.)

We worked with the reporters and editor from the education team to choose the interesting data. (There’s a lot of data out there that seems interesting but which a reporter will tell you is actually flawed or misleading.)

We also surveyed and interviewed folks with school-age kids in our newsroom. We did this because of an empathy gap—nobody on the news apps team has school-age kids. Along the way, we learned much about our users and much about the usability (or lack thereof!) of the previous version of our schools site.

Figure 17. 2011 Illinois school report cards (Chicago Tribune)

We aimed to design for a couple of specific users and use cases:

Parents with a child in school who want to know how their school measures up
Parents who’re trying to sort out where to live, since school quality often has a major impact on that decision.

The first time around, the schools site was about a six-week, two-developer project. Our 2011 update was a four-week, two-developer project. (There were actually three people actively working on the recent project, but none were full-time, so it adds up to about two.)

A key piece of this project was information design. Although we present far less data than is available, it’s still a lot of data, and making it digestible was a challenge. Luckily, we got to borrow someone from our graphics desk—a designer who specializes in presenting complicated information. He taught us much about chart design and, in general, guided us to a presentation that is readable, but does not underestimate the reader’s ability or desire to understand the numbers.

The site was built in Python and Django. The data is housed in MongoDB—the schools data is heterogeneous and hierarchical, making it a poor fit for a relational database. (Otherwise we probably would have used PostgreSQL.)

We experimented for the first time with Twitter’s Bootstrap user interface framework on this project, and were happy with the results. The charts are drawn with Flot.

The app is also home to the many stories about school performance that we’ve written. It acts as sort of a portal in that way; when there’s a new school performance story, we put it at the top of the app, alongside lists of schools relevant to the story. (And when a new story hits, readers of www.chicagotribune.com are directed to the app, not the story.)

Early reports are that readers love the schools app. The feedback we’ve received has been largely positive (or at least constructive!), and page views are through the roof. As a bonus, this data will remain interesting for a full year, so although we expect the hits to trail off as the schools stories fade from the homepage, our past experience is that readers have sought out this application year-round.

A few key ideas we took away from this project are:

The graphics desk is your friend. They’re good at making complex information digestible.
Ask the newsroom for help. This is the second project for which we’ve conducted a newsroom-wide survey and interviews, and it’s a great way to get the opinion of thoughtful people who, like our audience, are diverse in background and generally uncomfortable with computers.
Show your work! Much of our feedback has been requests for the data that the application used. We’ve made a lot of the data publicly available via an API, and we will shortly release the stuff that we didn’t think to include initially.

— Brian Boyer, Chicago Tribune

Hospital Billing

Investigative reporters at CaliforniaWatch received tips that a large chain of hospitals in California might be systematically gaming the federal Medicare program that pays for the costs of medical treatments of Americans aged 65 or older. The particular scam that was alleged is called upcoding, which means reporting patients having more complicated conditions—worth higher reimbursement—than actually existed. But a key source was a union that was fighting with the hospital chain’s management, and the CaliforniaWatch team knew that independent verification was necessary for the story to have credibility.

Luckily, California’s department of health has public records that give very detailed information about each case treated in all the state’s hospitals. The 128 variables include up to 25 diagnosis codes from the "International Statistical Classification of Diseases and Related Health Problems" manual (commonly known as ICD-9) published by the World Health Organization. While patients aren’t identified by name in the data, other variables tell the age of the patient, how the costs are paid, and which hospital treated him. The reporters realized that with these records, they could see if the hospitals owned by the chain were reporting certain unusual conditions at significantly higher rates than were being seen at other hospitals.

Figure 18. Kwashiorkor (California Watch)

The datasets were large; nearly 4 million records per year. The reporters wanted to study six years worth of records in order to see how patterns changed over time. They ordered the data from the state agency; it arrived on CD-ROMs that were easily copied into a desktop computer. The reporter doing the actual data analysis used a system called SAS to work with the data. SAS is very powerful (allowing analysis of many millions of records) and is used by many government agencies, including the California health department, but it is expensive—the same kind of analysis could have been done using any of a variety of other database tools, such as Microsoft Access or the open-source MySQL.

With the data in hand and the programs written to study it, finding suspicious patterns was relatively simple. For example, one allegation was that the chain was reporting various degrees of malnutrition at much higher rates than were seen at other hospitals. Using SAS, the data analyst extracted frequency tables that showed the numbers of malnutrition cases being reported each year by each of California’s more than 300 acute care hospitals. The raw frequency tables then were imported into Microsoft Excel for closer inspection of the patterns for each hospital; Excel’s ability to sort, filter and calculate rates from the raw numbers made seeing the patterns easy.

Particularly striking were reports of a condition called Kwashiorkor, a protein deficiency syndrome that is almost exclusively seen in starving infants in famine-afflicted developing countries. Yet the chain was reporting its hospitals were diagnosing Kwashiorkor among elderly Californians at rates as much as 70 times higher than the state average of all hospitals.

For other stories, the analysis used similar techniques to examine the reported rates of conditions like septicemia, encephalopathy, malignant hypertension, and autonomic nerve disorder. And another analysis looked at allegations that the chain was admitting from its emergency rooms into hospital care unusually high percentages of Medicare patients, whose source of payment for hospital care is more certain than is the case for many other emergency room patients.

To summarize, stories like these become possible when you use data to produce evidence to test independently allegations being made by sources who may have their own agendas. These stories also are a good example of the necessity for strong public records laws; the reason the government requires hospitals to report this data is so that these kinds of analyses can be done, whether by government, academics, investigators, or even citizen journalists. The subject of these stories is important because it examines whether millions of dollars of public money is being spent properly.

— Steve Doig, Walter Cronkite School of Journalism, Arizona State University

Care Home Crisis

A Financial Times investigation into the private care home industry exposed how some private equity investors turned elderly care into a profit machine and highlighted the deadly human costs of a business model that favored investment returns over good care.

The analysis was timely, because the financial problems of Southern Cross, then the country’s largest care home operator, were coming to a head. The government had for decades promoted a privatization drive in the care sector and continued to tout the private sector for its astute business practices.

Our inquiry began with analyzing data we obtained from the UK regulator in charge of inspecting care homes. The information was public, but it required a lot of persistence to get the data in a form that was usable.

The data included ratings (now defunct) on individual homes' performance and a breakdown of whether they were private, government-owned, or non-profit. The Care Quality Commission (CQC), up to June 2010, rated care homes on quality (0 stars = poor, to 3 stars = excellent).

The first step required extensive data cleaning, as the data provided by the Care Quality Commission for example contained categorizations that were not uniform. This was primarily done using Excel. We also determined—through desk and phone research—whether particular homes were owned by private-equity groups. Before the financial crisis, the care home sector was a magnet for private equity and property investors, but several—such as Southern Cross—had begun to face serious financial difficulties. We wanted to establish what effect, if any, private equity ownership had on quality of care.

A relatively straightforward set of Excel calculations enabled us to establish that the non-profit and government-run homes, on average, performed significantly better than the private sector. Some private equity-owned care home groups performed well over average, and others well below average.

Paired with on-the-ground reporting, case studies of neglect, an in-depth look at the failures in regulatory policies, as well as other data on levels of pay, turnover rates, etc., our analysis was able to paint a picture of the true state of elderly care.

Some tips:

Make sure you keep notes on how you manipulate the original data.
Keep a copy of the original data and never change the original.
Check and double-check the data. Do the analysis several times (if need be, from scratch).
If you mention particular companies or individuals, give them a right to reply.

— Cynthia O’Murchu, Financial Times

The Tell-All Telephone

Most people’s understanding of what can actually be done with the data provided by our mobile phones is theoretical; there were few real-world examples. That is why Malte Spitz from the German Green party decided to publish his own data. To access the information, he had to file a suit against telecommunications giant Deutsche Telekom. The data, contained in a massive Excel document, was the basis for Zeit Online’s accompanying interactive map. Each of the 35,831 rows of the spreadsheet represent an instance when Spitz’s mobile phone transferred information over a half-year period.

Seen individually, the pieces of data are mostly harmless. But taken together they provide what investigators call a profile: a clear picture of a person’s habits and preferences, and indeed, of her life. This profile reveals when Spitz walked down the street, when he took a train, when he was in a plane. It shows that he mainly works in Berlin and which cities he visited. It shows when he was awake and when he slept.

Figure 19. The Tell-All Telephone (Zeit Online)

Deutsche Telekom’s dataset already kept one part of Spitz’s data record private, namely, whom he called and who called him. That kind of information could not only infringe on the privacy of many other people in his life, it would also—even if the numbers were encrypted—reveal much too much about Spitz (but government agents in the real world would have access to this information).

We asked Lorenz Matzat and Michael Kreil from OpenDataCity to explore the data and find a solution for the visual presentation. “At first we used tools like Excel and Fusion Tables to understand the data ourselves. Then we started to develop a map interface to allow the audience to interact with the data in a non-linear way,” said Matzat. To illustrate just how much detail from someone’s life can be mined from this stored data, finally this was augmented with publicly accessible information about his whereabouts (Twitter, blog entries, party information like public calendar entries from his website). It is the kind of process that any good investigator would likely use to profile a person under observation. Together with Zeit Online’s in-house graphics and R&D team they finalized a great interface to navigate: by pushing the play button, you’ll set off on a trip through Malte Spitz’s life.

After a very successful launch of the project in Germany, we noticed that we were having very high traffic from outside Germany and decided to create an English version of the app. After earning the German Grimme Online Award, the project was honored with an ONA Award in September 2011, the first time for a German news website.

All of the data is available in a Google Docs spreadsheet. Read the story on Zeit Online.

— Sascha Venohr, Zeit Online

Which Car Model? MOT Failure Rates

In January 2010, the BBC obtained data about the MOT pass and fail rates for different makes and models of cars. This is the test that assesses whether a car is safe and roadworthy; any car over three years old has to have an MOT test annually.

We obtained the data under freedom of information following an extended battle with VOSA, the Department for Transport agency that oversees the MOT system. VOSA turned down our FOI request for these figures on the grounds that it would breach commercial confidentiality. It argued that it could be 'commercially damaging' to vehicle manufacturers with high failure rates. However, we then appealed to the Information Commissioner, who ruled that disclosure of the information would be in the public interest. VOSA then released the data, 18 months after we asked for it.

We analyzed the figures, focusing on the most popular models and comparing cars of the same age. This showed wide discrepancies. For example, among three year-old cars, 28% of Renault Méganes failed their MOT, in contrast to only 11% of Toyota Corollas. The figures were reported on television, radio, and online.

Figure 20. MOT failure rates released (BBC)

The data was given to us as a 1,200 page PDF document, which we then had to convert into a spreadsheet to do the analysis. As well as reporting our conclusions, we published this Excel spreadsheet (with over 14,000 lines of data) on the BBC News website along with our story. This gave everyone else access to the data in a usable form.

The result was that others then used this data for their own analyses, which we did not have time to do in the rush to get the story out quickly (and which in some cases would have stretched our technical capabilities at the time). This included examining the failure rates for cars of other ages, comparing the records of manufacturers rather than individual models, and creating searchable databases for looking up the results of individuals models. We added links to these sites to our online news story, so our readers could get the benefit of this work.

This illustrated some advantages of releasing the raw data to accompany a data-driven story. There may be exceptions (for example, if you are planning to use the data for other follow-up stories later and want to keep it to yourself in the meantime), but on the whole publishing the data has several important benefits:

Your job is to find things out and tell people about them. If you’ve gone to the trouble of obtaining all the data, it’s part of your job to pass it on.
Other people may spot points of significant interest which you’ve missed, or simply details that matter to them even if they weren’t important enough to feature in your story.
Others can build on your work with further, more detailed analysis of the data, or different techniques for presenting or visualizing the figures, using their own ideas or technical skills that may probe the data productively in alternative ways.
It’s part of incorporating accountability and transparency into the journalistic process. Others can understand your methods and check your work if they want to.

— Martin Rosenbaum, BBC

Bus Subsidies in Argentina

Since 2002, subsidies for the public bus transportation system in Argentina have been growing exponentially, breaking a new record every year. But in 2011, after winning the elections, Argentina’s new government announced cuts in subsidies for public services starting December of the same year. At the same time the national government decided to transfer the administration of the local bus lines and metro lines to the City of Buenos Aires government. As the transfer of subsidies to this local government hasn’t been clarified and there was a lack of sufficient local funds to guarantee the safety of the transportation system, the government of the City of Buenos Aires rejected this decision.

As this was happening, I and my colleagues at La Nación were meeting for the first time to discuss how to start our own data journalism operation. Our Financial Section Editor suggested that the subsidies data published by the Secretaría de Transporte (the Department of Transportation) would be a good challenge to start with, as it was very difficult to make sense of due to the format and the terminology.

The poor conditions of the public transportation system impact the life of more than 5,800,000 passengers every day. Delays, strikes, vehicle breakdowns, or even accidents are often happening. We thus decided to look into where the subsidies for the public transportation system in Argentina go and make this data easily accessible to all Argentinian citizens by means of a “Transport Subsidies Explorer,” which is currently in the making.

Figure 21. The Transport Subsidies Explorer (La Nación)

We started with calculating how much bus companies receive every month from the government. To do this, we looked at the data published on the website of the Department of Transportation, where more than 400 PDFs containing monthly cash payments to more than 1,300 companies since 2006 were published.

Figure 22. Ranking subsidized transport companies (La Nación)

We teamed up with a senior programmer to develop a scraper in order to automate the regular download and conversion of these PDFs into Excel and Database files. We are using the resulting dataset with more than 285,000 records for our investigations and visualizations, in both print and online. Additionally, we are making this data available in machine-readable format for every Argentinian to reuse and share.

The next step was to identify how much the monthly maintenance of a public transport vehicle costed the government on average. To find this out we went to another government website, that of the Comisión Nacional de Regulación del Transporte (CNRT, or The National Commission for the Regulation of Transport), responsible for regulating transportation in Argentina. On this website, we found a list of bus companies that owned 9000 vehicles altogether. We developed a normalizer to allow us to reconcile bus company names and cross-reference the two datasets.

To proceed, we needed the registration number of each vehicle. On the CNRT website, we found a list of vehicles per bus line per company with their license plates. Vehicle registration numbers in Argentina are composed of letters and numbers that correspond to the vehicle’s age. For example, my car has the registration number IDF234, and the “I” corresponds to March-April 2011. We reverse engineered the license plates for buses belonging to all listed companies to find the average age of buses per company, in order to show how much money goes to each company and compare the amounts based on the average age of their vehicles.

In the middle of this process, the content of the government-released PDFs containing the data we needed mysteriously changed, although the URLs and names of the files remained the same. Some PDFs were now missing the vertical "totals," making it impossible to cross-check totals across all the entire investigated time period, 2002-2011.

We took this case to a hackathon organized by Hacks/Hackers in Boston, where developer Matt Perry generously created what we call the “PDF Spy.” This application won the "Most Intriguing” category in that event. The PDF Spy points at a web page full of PDFs and checks if the content within the PDFs has changed. “Never be fooled by ‘government transparency' again,” writes Matt Perry.

Figure 23. Comparing age of fleets to the amount of money they receive from government (La Nación)

Who Worked on the Project?

A team of seven journalists, programmers and an interactive designer were working on this investigation for 13 months.

The skills we needed for this project were:

Journalists with knowledge of how the subsidies for the public transportation system work and what the risks were; knowledge of the bus companies market.
A programmer skilled in Web scraping, parsing and normalizing data, and extracting data from PDFs into Excel spreadsheets.
A statistician for conducting the data analysis and the different calculations.
A designer for producing the interactive data visualizations.

What Tools Did We Use?

We used VBasic for applications, Excel Macros, Tableau Public, and the Junar Open Data Platform, as well as Ruby on Rails, the Google charts API, and Mysql for the Subsidies Explorer.

The project had a great impact. We’ve had tens of thousands of views and the investigation was featured on the front page of La Nación’s print edition.

The success of this first data journalism project helped us internally to make the case for establishing a data operation that would cover investigative reporting and provide service to the public. This resulted in Data.lanacion.com.ar, a platform where we publish data on various topics of public interest in machine-readable format.

— Angélica Peralta Ramos, La Nación (Argentina)

Citizen Data Reporters

Large newsrooms are not the only ones that can work on data-powered stories. The same skills that are useful for data journalists can also help citizens reporters access data about their locality, and turn them into stories.

This was the primary motivation of the citizen media project Friends of Januária, in Brazil, which received a grant from Rising Voices, the outreach arm of Global Voices Online, and additional support from the organization Article 19. Between September and October 2011, a group of young residents of a small town located in north of the state of Minas Gerais, which is one of the poorest regions of Brazil, were trained in basic journalism techniques and budget monitoring. They also learned how to make Freedom of Information requests and access publicly available information from official databases on the Internet.

Januária, a town of approximately 65,000 residents, is also renowned for the failure of its local politicians. In three four-year terms, it had seven different mayors. Almost all of them were removed from office due to wrongdoing in their public administrations, including charges of corruption.

Small towns like Januária often fail to attract attention from the Brazilian media, which tends to focus on larger cities and state capitals. However, there is an opportunity for residents of small towns to become a potential ally in the monitoring of the public administration because they know the daily challenges facing their local communities better than anyone. With the Internet as another important ally, residents can now better access information such as budget and other local data.

Figure 24. The Friends of Januária citizen media project teaches key skills to citizens to turn them into data journalists

After taking part in twelve workshops, some of the new citizen reporters from Januária began to demonstrate how this concept of accessing publicly available data in small towns can be put into practice. For example, Soraia Amorim, a 22 year-old citizen journalist, wrote a story about the number of doctors that are on the city payroll according to Federal Government data. However, she found that the official number did not correspond with the situation in the town. To write this piece, Soraia had access to health data, which is available online at the website of the SUS (Sistema Único de Saúde or Unique Health System), a federal program that provides free medical assistance to the Brazilian population. According to SUS data, Januária should have 71 doctors in various health specialities.

The number of doctors indicated by SUS data did not match what Soraia knew about doctors in the area: residents were always complaining about the lack of doctors and some patients had to travel to neighboring towns to see one. Later, she interviewed a woman that had recently been in a motorcycle accident and could not find medical assistance at Januária’s hospital because no doctor was available. She also talked to the town’s Health Secretary, who admitted that there were less doctors in town than the number published by SUS.

These initial findings raise many questions about reasons for this difference between the official information published online and the town’s reality. One of them is that the federal data may be wrong, which would mean that there is an important lack of health information in Brazil. Another possibility may be that Januária is incorrectly reporting the information to SUS. Both of these possibilities should lead to a deeper investigation to find the definitive answer. However, Soraia’s story is an important part of this chain because it highlights an inconsistency and may also encourage others to look more closely at this issue.

"I used to live in the countryside, and finished high school with a lot of difficulty," says Soraia. "When people asked me what I wanted to do with my life, I always told them that I wanted to be a journalist. But I imagined that it was almost impossible due to the world I lived in." After taking part in the Friends of Januária training, Soraia believes that access to data is an important tool to change the reality of her town. "I feel able to help to change my town, my country, the world," she adds.

Another citizen journalist from the project is 20 year-old Alysson Montiériton, who also used data for an article. It was during the project’s first class, when the citizen reporters walked around the city to look for subjects that could become stories, that Alysson decided to write about a broken traffic light located in a very important intersection, which had remained broken since the beginning of the year. After learning how to look for data on the Internet, he searched for the number of vehicles that exists in town and the amount of taxes paid by those who own cars. He wrote:

The situation in Januária gets worse because of the high number of vehicles in town. According to IBGE (the most important statistics research institute in Brazil), Januária had 13,771 vehicles (among which 7,979 were motorcycles) in 2010. … The town’s residents believe that the delay in fixing the traffic light is not a result of lack of resources. According to the Treasury Secretary of Minas Gerais state, the town received 470 thousand reais in vehicle taxes in 2010.

By having access to data, Alysson was able to show that Januária has many vehicles (almost one for every five residents), and that a broken traffic light could put a lot of people in danger. Furthermore, he was able to tell his audience the amount of funds received by the town from taxes paid by vehicle owners and, based on that, to question whether this money would not be enough to repair the traffic light to provide safe conditions to drivers and pedestrians.

Although the two stories written by Soraia and Alysson are very simple, they show that data can be used by citizen reporters. You don’t need to be in a large newsroom with a lot of specialists to use data in your articles. After twelve workshops, Soraia and Alysson, neither of whom have a background in journalism, were able to work on data-powered stories and write interesting pieces about their local situation. In addition, their articles show that data itself can be useful even on a small scale. In other words, that there is also valuable information in small datasets and tables—not only in huge databases.

— Amanda Rossi, Friends of Januária

The Big Board for Election Results

Election results provide great visual storytelling opportunities for any news organization, but for many years this was an opportunity missed for us. In 2008, we and the graphics desk set out to change that.

We wanted to find a way to display results that told a story and didn’t feel like just a jumble of numbers in a table or on a map. In previous elections, that’s exactly what we did.

Not that there is necessarily anything wrong with a big bag of numbers, or what I call the ``CNN model'' of tables, tables, and more tables. It works because it gives the reader pretty much exactly what she wants to know: who won?

And the danger in messing with something that isn’t fundamentally broken is significant. By doing something radically different and stepping away from what people expect, we could have made things more confusing, not less.

In the end, it was Shan Carter of the graphics desk who came up with the right answer, what we eventually ended up calling the ``big board''. When I saw the mockups for the first time, it was quite literally a head-slap moment.

It was exactly right.

Figure 25. The big board for election results (New York Times)

What makes this a great piece of visual journalism? To begin with, the reader’s eye is immediately drawn to the big bar showing the electoral college votes at the top, what we might in the journalism context call the lede. It tells the reader exactly what she wants to know, and it does so quickly, simply and without any visual noise.

Next, the reader is drawn to is the five-column grouping of states below, organized by how likely The Times felt a given state was to go for one candidate or the other. There in the middle column is what we might call in the journalism context our nut graph, where we explain why Obama won. The interactive makes that crystal clear: Obama took all the states he was expected to and four of the five toss-up states.

To me, this five-column construct is an example of how visual journalism differs from other forms of design. Ideally, a great piece of visual journalism will be both beautiful and informative. But when deciding between story or aesthetics, the journalist must err on the side of story. And while this layout may not be the way a pure designer might choose to present the data, it does tell the story very, very well.

And finally, like any good web interactive, this one invites the reader to go deeper still. There are details like state-by-state vote percentages, the number of electoral votes and percent reporting deliberately played down so as not to compete with the main points of the story.

All of this makes the ``big board'' a great piece of visual journalism that maps almost perfectly to the tried-and-true inverted pyramid.

— Aron Pilhofer, New York Times

Crowdsourcing the Price of Water

Since March 2011, information about the price of tap water throughout France is gathered through a crowdsourcing experiment. In just 4 months, over 5,000 people fed up with corporate control of the water market took the time to look for their water utility bill, scan it, and upload it on Prix de l’Eau ("price of water") project. The result is an unprecedented investigation that brought together geeks, NGO, and traditional media to improve transparency around water projects.

Figure 26. The Price of Water (Fondation France Liberté)

The French water utility market consists in over 10,000 customers (cities buying water to distribute to their taxpayers) and just a handful of utility companies. The balance of power on this oligopoly is distorted in favor of the corporations, which sometimes charge different prices to neighboring towns!

The French NGO France Libertés has been dealing with water issues worldwide for the past 25 years. It now focuses on improving transparency on the French market and empowering citizens and mayors, who negotiate water utility deals. The French government decided to tackle the problem 2 years ago with a nationwide census of water price and quality. So far, only 3% of the data has been collected. To go faster, France Libertés wanted to get citizens directly involved.

Together with the OWNI team, I designed a crowdsourcing interface where users would scan their water utility bill and enter the price they paid for tap water on prixdeleau.fr. In the past 4 months, 8,500 signed up and over 5,000 bills have been uploaded and validated.

While this does not allow for a perfect assessment of the market situation, it showed stakeholders such as national water-overseeing bodies that there was a genuine, grassroots concern about the price of tap water. They were skeptical at first about transparency, but changed their minds over the course of the operation, progressively joining France Libertés in its fight against opacity and corporate malpractice. What can media organizations learn from this?

Partner with NGOs: NGOs need large amount of data to design policy papers. They will be more willing to pay for a data collection operation than a newspaper executive.
Users can provide raw data: Crowdsourcing works best when users do a data collection or data-refining task.
Ask for the source: We pondered whether to ask users for a scan of the original bill, thinking it would deter some of them (especially as our target audience was older than average). While it might have put off some, it increased the credibility of the data.
Set up a validation mechanism: We designed a point system and a peer-review mechanism to vet user contributions. This proved too convoluted for users, who had little incentive to make repeated visits to the website. It was used by the France Libertés team, however, whose 10 or so employees did feel motivated by the points system.
Keep it simple: We built an automated mailing mechanism so that users could file a Freedom of Information request regarding water pricing in just a few clicks. Though innovative and well-designed, this feature did not provide substantial ROI (only 100 requests have been sent).
Target your audience: France Libertés partnered with consumers' rights news magazine 60 Millions de Consommateurs, who got their community involved in a big way. It was the perfect match for such an operation.
Choose your key performance indicators carefully: The project gathered only 45,000 visitors in 4 months, equivalent to 15 minutes worth of traffic on nytimes.com. What’s really important is that 1 in 5 signed up and 1 in 10 took the time to scan and upload his or her utility bill.

— Nicolas Kayser-Bril, Journalism++

Files

03-casestudies.asc

Latest commit

History