Want to be a data journalist? Learn these important tools
As
the world of journalism changes many journalists are looking to learn
new skills; skills better suited to an industry that is increasingly
digitised and visual. For many that probably entails learning something
about data journalism and visualisation. But, if you’re from a strictly
printed words background, the change can be daunting.
For
a start there is an ever-growing list of data journalism tools that are
available which can be daunting. The question becomes, where to start?
There
is no single right answer. What you need to do is to decide what it is
you want to achieve, and your particular working circumstances. If you
work in a newsroom and your primary output is in a newspaper then you
probably don’t need to learn to make interactive graphics. But if you
work online then you may want to learn some data visualisation tools.
The
important thing to understand here is that no matter what kind of
journalism you do you can benefit by learning some basic data journalism
techniques. And don’t be fooled by the all-to-often portrayal of data
journalists as code hackers. There is a place for great programmers but
you don’t have to be a programmer to be a data journalist.
What
follows is an opinionated list of tools worth taking the time to
explore. Most of these are tools I have come to rely on for a range of
different projects, such as data driven stories like this. This is not a comprehensive list of tools, just a shortlist that makes up a good toolbox.
Part 1: The data journalism basics
Spreadsheets
Yes,
you can’t escape it. Spreadsheets are the core tool for any data
journalism project. Too often journalists fall back on the old pretense
that they’re no good with maths. You don’t need a PhD in mathematics to
use a spreadsheet but a basic understanding of averages, means, medians
and the ability to work with a spreadsheet will boost your reporting
skills. If you’re completely new to spreadsheets there are many tutorialsonline that will have you up and running in no time.
For
most people the first thing they think of when they hear spreadsheets
is Excel, which is a great option but by no means the only one. Google Sheets
is preferred by many spreadsheet newcomers because its simplified set
of options give them the bits they need without the huge array of
functions in Excel. If you want something free but powerful, Libre Office spreadsheets is one of the best options.
Document organisation and collaboration
One of the challenges in doing data journalism is how to manage large numbers of documents without losing your way. Again, Google Drive
is a good starting point. Drive stores all of your documents in the
cloud and makes it easy to share these easily with other users. Drive
also has built in version tracking, although it’s not immediately
obvious, which means you can go back to previous versions of a document
if you end up in a data dead end or if you make a mistake.
While Drive has a ton of uses, sometimes you need something a little more focused on the task at hand. Which is where Document Cloud
comes in. Document Cloud is also an online document storage service but
it has a number of features that make it a great tool for data
journalism. One of the most useful of these is the ability to upload
PDFs to Document Cloud and have it convert these to text for you. Not
only that but Document Cloud also indexes documents and over time it
becomes possible to search across all your stored documents for
particular words or names. Document Cloud includes annotations, it can
build timelines from documents and makes it easy to embed portions of
documents into your online stories. Also, multiple users can collaborate
on the documents. Your newsroom will need to apply for an account but
the service itself is free for news organisations.
If you’re looking for something a little different to Document Cloud or Google Drive then it’s worth taking a look at Git and Github.
Git has largely been the domain of programmers but increasingly
journalists and other writers are turning to Git/hub for a range of
reasons. Git is a version control system. You can create files, edit
those while being able to revert to previous versions at any point. You
can also “branch” files which means creating a second or third version
of your files which you can experiment with. If these experiments work
out you can then “merge” the changes back into your main files. If not
you can dump the experiment and switch back to your original files. If
you’re keen to try out Git and Github then do yourself a favour and
watch Daniel Shiffman’s entertaining Git and Github for Poets YouTube series.
Collecting and cleaning data
The
other reality about data journalism is that it is a rare occasion when
you get to deal with clean data. Either you’ll be dealing with dozens of
PDF files that need to be converted into something useful and verified.
Or you’ll have a dump of messy CSV or excel files.
If
you’re looking to convert PDFs into text/numbers there are dozens of
good tools that do good to excellent conversions. The problem is that
PDFs are tricky things and your success converting them is largely based
on how they are created. PDFs that were created directly from
spreadsheets are typically easier to convert than PDFs that are actually
made by scanning in a document and then saving to to PDF. More often
than not you’ll deal with this latter type, especially if you’re getting
leaked data.
If
you’ve got a Document Cloud account this should be your first stop
because it has PDF conversion built in. If you’re looking to convert
just a portion of a PDF, or multiple similar portions of a document then
try Tabula.
With a little bit of practice Tabula can be made to do pretty reliable
PDF conversions, even if your data is spread throughout multiple
documents.
There are also a number of online PDF conversion tools that work with varying degrees of success. One of the more popular is CometDocs which does conversion to multiple file formats. Zamzar
offers a similar service. If you’re looking for something a little more
robust then Nitro is worth testing. Nitro offers a free online PDF conversion service
but it is also available as a paid-for desktop application. It’s not
cheap but it’s very capable if you’re dealing with multiple documents on
a regular basis.
Once
you’ve got your data probably need to clean it. If the data is not too
messy or detailed then a spreadsheet is a good starting place. But, if
you’ve got a file with hundreds or thousands of rows and multiple
problems then Open Refine
is the tool of choice. Open Refine used to be called Google Refine and
it makes it relatively easy clean up dirty datasets. One of its
strengths is its ability to work with just portions of your dataset at
time. For my money, if you’re going to commit to learn anything then
Refine would my choice. Once you’re over the initial learning curve and
you discover the power of Refine you won’t look back and there are some good introductory tutorials available for Open Refine.
A tool similar to Open Refine is Data Wrangler
which aims to make it as easy as possible to clean up and manipulate
large data sets. I’m not overly familiar with Data Wrangler so my
preference is for Open Refine but I mention it because it looks to be a
promising tool.
Part 2: Analysing and visualising data
Once
you’ve got your data cleaned and sorted you’ll want to see what the
data is telling you. If you’ve read anything about data journalism
you’ve probably heard someone say that you need to interview your data
like you would interview a source. Just because you’ve got a set of data
doesn’t mean you have a story. What you need to do is look at the data
in multiple different ways to see what stands out. Also, when you do
this you might well spot anomalies in the data, a sudden spike or dip in
values. Sometimes these are the stories but often these are the result
of a problem in your data.
One
of the easiest tools for doing a quick visualisation or two is Google
Sheets. Exel or Libre Office could also be used but Google Sheets is
perhaps the easiest of the tools when you’re looking for a quick chart.
It’s worth looking at your data in multiple different views to see what
the patterns look like.
Another way to do initial visualisation is with one of a number of online tools. One of the easiest to use is Datawrapper
which outputs your charts in multiple different ways. It’s a useful way
to switch between different views quickly to get a sense of what works
well. There are a few other services online, such as RAW or Quartz’s Atlas charts which produce good results.
Once
you’ve got an idea of what you want to do then it’s time to start
creating. Most of the programs mentioned above will produce embeddable
versions of the charts you’ve made but they may be limited in adding
other elements like images, text areas or extra labels. For that you’ll
need to look at some other tools.
ogram
are among the best and easiest at doing this. Both make it easy to
combine charts with other visual elements, and if you start with one of
the pre-built templates you’ll have something decent looking in next to
no time.
If you’re looking for something more detailed with more than just a few default chart types then you should probably try out Tableau Public
which is free and extremely powerful. It can build everything from the
simplest charts to complete interlinked dashboards. But be warned, the
initial learning curve can be a little daunting for first-timers. If
you’re serious about data visualisation then take the time to learn more
about Tableau Public. But if you just want the occasional chart to
dress up a story then stick with one of the other options.
Part 3: Maps and mapping
If
you do any kind of data journalism you’re bound to come across
geographic data. Which brings up the issue of mapping tools, some of
which are simple point and click affairs while others border on the
arcane. So you need to think carefully about what you’re trying to
achieve with geographic data.
Too
often the first instinct is to plot the points on a map. Which is worth
doing in the initial exploratory stages in almost all cases, but often a
map is not the best way to illustrate the point of a story. For
example, having a map with 200 points all clustered around a small area
is often not the most informative way to display data. While shaded
contiguous areas to indicate some sort of distribution can be far more
effective.
Having said that, a good map done right can add huge amounts to a data story, so what are the best tools?
Once again Google is a good starting point. Google My Maps
is one of the simplest tools to use. It’s pretty intuitive to use and
makes it easy to look up geographic points, draw lines and shape on maps
and even add driving directions. If you just want to illustrate where
or how something happened geographically then there is no better place
to start.
A step up from My Maps is Google Fusion Tables.
This is also part of the Google Drive suite of tools. Fusion Tables in
fact does a lot more than just make maps, though that is one of its
strengths.
Fusion Tables also make it easy to filter data sets, do some
cleaning up of data, merge multiple datasets into one and a fair amount
more. It’s a little tricky at first but is a good choice when you’re
dealing with larger data sets.
If you’re really getting into this mapping thing and you want a bit more than the previous two options then CartoDB
is your next step. Carto is all about maps and it has the potential to
make excellent maps with multiple layers and different designs so long
as you’re prepared to put in a little initial work. Personally I find
Carto an excellent choice for mocking up a quick sample map or merging
sets of data to include geographic points. It makes it pretty simple to
visualise larger sets of data and make decisions about where you should
go with your project. Carto also makes it easy to export the cleaned and
fixed datasets into many formats which makes it easy to use in other
applications.
There
are literally dozens of other applications for making maps some of
which are extremely powerful but often also very complex. ArcGIS is popular tool, as is the open source QGIS
application but both are aimed at fairly experienced mappers so the
learning curve can be steep. If you’re keen to try your hand at making
your own map styles then Mapbox is great for that. Mapshaper.org
is another of my most commonly used maps tools because it makes it easy
to get a quick visual representation of the data in your map files and
it also makes it easy to simplify map shapes, something that can be
extremely useful in keeping download times down.
In conclusion
Data
journalism is a broad area of work with place for many different
skills. Some might favour the visualisation side of data journalism
while others may prefer the mapping side. No matter what you prefer
doing or what the limitations of your newsroom are there is always
something more to be learned about data journalism. The recommended
route would be to start with the basics above and then gradually move
into some of the more detailed areas.
From
experience the best way to learn to become better at data journalism is
to practice. Find a real world dataset and see what you can make out of
it. It’s only when you’re working in a real world scenario that you’ll
really learn the ins and out of good data analysis.
Comments, thoughts, feedback? Leave a comment or find me on Twitter. Please recommend this article if you found it useful.
Programme:
The one that got away
29 dollars
Pasties and a G-string (not: Till the money runs out, as indicated on movie)
Tango
Invitation to the Blues / Eggs and Sausage
Mr. Siegal
Mathilda (Tom Traubert's Blues)
Drums- Ben Riley Bass-Larry Gales Tenor Saxophone- Charlie Rouse Norway 1966 1. Lulu's Back In Town 2. Blue Monk 3.'Round Midnight Denmark 66' 1.Lulu's Back In Town 2. Don't Blame Me 3. Epistrophy
Aaron Neville Track Listings 1. Tell It Like It Is 00:00 2. Over You 02:47 3. The Bells 05:10 4. Don't Take Away My Heaven 08:37 5. Warm Your Heart 13:20 6. You Never Can Tell 17:15 7. Close Your Eyes 20:14 8. The Grand Tour 23:30 9. Louisiana 1927 26:53 10. Everybody Plays the Fool 30:02 11. Don't Go, Please Stay 34:32 12. Angola Bound 37:18 13. A Change Is Gonna Come 41:57 14. Betcha by Golly, Wow 45:41 15. Stardust 49:41 16. Use Me 54:21 17. ...To Make Me Who I Am 59:23 18. Don't Know Much 01:05:00
Bukowski
was disgusting, his actual real fiction is awful, he’s been called a
misogynist, overly simplistic, the worst narcissist, (and probably all
of the above are true to an extent) and whenever there’s a collection of
“Greatest American Writers” he’s never included.
And yet… he’s probably the greatest American writer ever. Whether
you’ve read him or not, and most have not, there’s 6 things worthy of
learning from an artist like Bukoswski.
I consider “Ham on Rye” by Bukowski probably the greatest American
novel ever written. It’s an autobiographical novel (as are all his
novels except “Pulp” which is so awful it’s unreadable) about his
childhood, being beaten by his parents, avoiding war, and beginning his
life of destitution, hardship, alcoholism, and the beginnings of his
education as a writer.
I’m almost embarrassed to admit he’s an influence. Many people hate
him and I’m much more afraid of being judged than he ever was.
1) Honesty. His first four novels are extremely
autobiographical. He details the suffering he had as a child (putting
his parents in a very bad light but he didn’t care), he details his
experiences with prostitutes, his lack of interest in holding down a
job, his horrible experiences and lack of real respect for the women he
was in relationships with, and on and on. His fiction and poetry
document thoroughly the people he hates, the authors he despises, the
establishment he could care less about (and he hated the
anti-establishment just as much. One quote about a potential plan the
hippie movement was going to do: “Run a pig for president? What the fuck
is that? It excited them. It bored me.”) Most fiction writers do what fiction writers do: they make stuff up.
They tell stories that come from their imagination. Bukowski wasn’t
really able to do that. Whenever he attempted fiction (his last novel
being a great example) it fell flat. Even his poetry is non-fiction.
There’s one story he wrote (I forget the name) where he’s sitting in a
bar and he wants to be alone and some random guy starts talking to him:
“its horrible about all those girls who were burned” and Bukowski says
(I’m getting the words a little off. Doing this from memory), “I don’t
know.” And the guy and everyone else in the bar starts yelling, “This
guy doesn’t care that all those little girls burned to death”. But
Bukowski was honest, “It was a newspaper headline. If it happened in
front of me I’d probably feel different about it.” And he refused to
back down and stayed in the bar until closing time.
(Matt Dillon playing a young Bukowski in "Factotum")
He had very few boundaries as to how far his honesty could go. He
never wrote about his daughter after she reached a certain age. That’s
about the only boundary I can find. Every other writer has so many
things they can’t write about: family, spouses, exes, children, jobs,
bosses, colleagues, friends. That’s why they make stuff up. Bukowski
didn’t let himself get hampered by that so we see real raw honest, a
real anthropological survey of being down and out for 60+ years without
anything being held back. No other writer before or since has done that.
For a particular example, see his novel, “Women” which detailed every
sexual nuance of every woman who dared to sleep with him after he
achieved some success. Most of these women were horrified after the book
came out.
I try as hard as possible to remove all boundaries. But it’s a challenge with each post I do.
2) Persistence. Bukowski got two stories
published when he was young (24 and 26 years old) but almost all of his
stories were rejected by publishers. So he quit writing for ten years.
Then, in the mid 1950s he started up again. He submitted tons of poems
and stories everywhere he could. It took him years to get published. It
took him even more years to get really noticed. And it finally took him
about 15 years of writing every day and writing thousands of poems and
stories before he finally started making a living as a writer. He wrote his first novel at the age of 49 and it was financially successful. After 25 years of plugging away at it he was finally a successful writer.
25 years!
Most people give up much earlier, much younger. Both my grandfather
and father wanted to be musicians, for instance. Both gave up in their
20s and 30s and took what they thought was the safer route. (The safer
route being, in my opinion, what ultimately killed both of them).
And this persistence was while he was going through three marriages,
dozens of jobs, and non-stop alcoholism. Some of this is documented
(poorly) in the move “Barfly” but I think a better movie about Bukowski
is the indie that Matt Dillon did about his novel, “Factotum” which
details the 10 years he was going from job to job, woman to woman, just
trying to survive as an alcoholic in a world that kept beating him down.
He wrote his first novel in 19 days. Michael Hemmingson who I write
about below, wrote me and said Bukowski had to finish that novel so fast
because he was desperately afraid he was going to be a failure at being
a successful writer and didn’t want to disappoint John Martin, who had
essentially given him an advance for the novel.
(a tattoo of the epitaph on Bukowski's tombstone)
3) Survival. When I think “constant alcoholic” I
usually equate that with being a homeless bum. Bukowski, at some deep
level, realized that he needed to survive. He couldn’t just be a
homeless bum and kill himself, no matter how many disappointments he
had. He worked countless factory jobs (the basis of the non-fiction
novel, “Factotun”) but even that wasn’t stable enough for him. Finally,
he took a job working for the US Government (you can’t get more stable)
working in the post office for 11 years. He didn’t miss child support
payments (although he constantly wrote about how ugly the mother of his
child was), and as far as I know he was never homeless or totally down
and out from his early 30s ’til the time he started having success as a
writer.
And despite writing about the overwhelming poverty he had, he did
have a small inheritance from his father, a savings account he built up,
and a steady paycheck. The post office job is documented, in full, in
his first “novel” called, appropriately, “Post Office”. Many people
think that’s his best novel but I put it third or fourth behind “Ham on
Rye” and “Factotum” and possibly “Women”. He also wrote a novel,
“Hollywood” about the blow-by-blow experience of doing the movie
“Barfly”. All the names are changed (hence its claim to be fiction) but
once you figure out who everyone is, its totally non-fiction. Like all
of his other novels (not counting “Pulp”, which was the worst American
novel ever written and published).
[See, 33 Unusual Ways to Be a Better Writer – many tips I got from reading his books.]
4) Discipline. Imagine working a brutal 10 hour
shift at the Post Office, coming home and arguing with your wife or
girlfriend, or half-girlfriend, half-prostitute that was living with
you, finishing off three or four six-packs of beer and then…writing. He
did it every day. Most people want to write that novel, or finish that
painting, or start that business, but have zero discipline to actually
sit down and do it. If there was any talent that Bukowski had that I
can’t actually figure out how he got it, its that discipline.
When he was younger (early 20s, late teens) he spent almost every day
in the library, falling in love with all the great writers. The love
must have been so great it superseded almost everything else in his
life. He had to write like them or he really felt like he would die. He
had to “put down a good line” as he would say. And every day he would
try. And good, bad, or ugly, he probably ultimately ended up publishing
(many posthumously) everything he ever wrote. I try to match that
discipline. Even when I don’t post a blog post I write seven days a
week, every morning. At least 1000 words and a completed post. I used to
do this in my 20s when I was trying to write fiction. My minimum then
was 3000 words. I did that for five years.
It adds up. The average book is 60,000 words. If you can write 1000
words a day then you’ll have 6 books by the end of the year. Because
poetry books are much smaller, Bukowski probably had around 80 or so
books published by the time he was dead and I bet there are more coming.
(his first novel at age 49. You're never too old).
5) His “literary map”. He was inspired by
several writers and he inspired many more. Some of my favorite writers
come from both categories. He was probably most inspired by three
writers: Celine, Knut Hamsun, and John Fante. I highly recommend
Celine’s “Journey to the End of the Night”. Celine is almost a more raw
version of Bukowski. He was constantly angry and trying to survive and
do whatever it took to survive. The thing about Bukowski, as opposed to
many other writers, is he didn’t concern himself with flowery images or
beautiful sunsets. He totally wrote as if he were speaking to you and
Celine does that to an extreme but he’s so raw and smart that the way he
“speaks” is like an insane person trying to spew out as much venom as
possible. 600 pages later his first book is a masterpiece and I often
use it in my pre-writing hour every morning when I read stuff to inspire
myself to write.
John Fante wrote the underappreciated “Ask the Dust” which was
completely forgotten until Bukowski’s publisher republished it and all
of Fante’s books. (I also recommend the movie with Colin Farrell and a
naked Salma Hayek).
(maybe Hayek's best role)
Bukowski was almost afraid to admit how much Fante directly
influenced him. He wrote in one “short story”, “I realized that
admitting John Bante had been such a great influence on my writing might
detract from my own work, as if part of me was a carbon copy, but I
didn’t give a damn. It’s when you hide things that you choke on them.”
Note he spelled “Fante” as “Bante”. That’s the extent of Bukowski’s
fiction. Another interesting thing is the last line. Nothing flowery,
nothing descriptively beautiful, yet a line like that is what made
Bukowski unique and one of the best writers ever, getting at the hidden
truth of what was really happening in his head, rather than telling yet
another boring story filled with flowery descriptions like most books
and stories are.
Then there’s the authors Bukowski influenced. Michael Hemmingson
wrote an excellent review of Bukowski in the book “The Dirty Realism
Duo: Bukowski and Carver” which I highly recommend. Raymond Carver comes
from the same genre of down-and-out, oppressive relationships that were
beyond his ability to cope with them, and realist, simple writing that
was mostly autobiographical (although that’s a little less clear in
Carver’s case). I’d also throw Denis Johnson’s book of short stories
(Jesus’ Son) in that category (Johnson studied with Carver) and more
recently, books like the above-mentioned Michael Hemmingson’s “Crack
Hotel”, “The Comfort of Women”, “My Date(Rape) with Kathy Acker” and
other stories. I’m dying to find other writers in this category.
(I haven't seen the movie. Is it good?)
I read how Denis Johnson needed $10,000 to pay the IRS. So he threw
together some vignettes he had forgotten about, called the collection
“Jesus’ Son” and sent it off to Jonathan Galassi and said, “here, you
can have these if you pay the IRS”. So I Facebook-friended Galassi and
asked him if he could tell me one author in Denis Johnson’s league but
I’m still waiting for a response.
I wish I could find more writers like these. Perhaps William Vollmann
who wrote “Butterfly Stories” but his bigger fiction is too difficult
for me to read (anecdote: he wrote the afterward to the recently
re-published Celine’s “Journey of the Night” so all of these writers
tend to recognize their common lineage.)
6) Poetry. I really hate poetry. When I open up
the New Yorker (blecch!) and read the latest poems in there I can’t
understand them, they all seem like gibberish to me, they all seem too
intellectual. And yet, out of all the poets I’ve read, the only ones I
really like are: Bukowski, Raymond Carver, and Denis Johnson. Poetry
allowed them to master making each word in a sentence effective and
powerful. It was this training that allowed them to destroy the
competition when they sat down to write their longer pieces. It makes me
want to try my hand at poetry but even the word “poetry” sounds so
pseudo-intellectual I just have no interest in doing it.
Bukowski: Alcoholic, postal worker, misogynist (there’s a video you
can easily find on Youtube where he must be almost 60 and he literally
kicks his wife in anger while he’s being interviewed.), anti-war,
anti-peace, anti-everything, hated everyone, probably insecure,
extremely honest, and he had to write every day or it would kill him.
In his own words, words which I hope to live by: “What a joy it must
be to be a truly great writer, even if it means a shotgun at the
finish”.
———————— Suggested Reading:
Poem: “You Don’t know What Love Is (an evening with Bukowski)” by Raymond Carver. Article: John Fante, father of LA Literature: Movies:
“Factotum”
If anyone can think of anybody else in this specific “dirty
realism” category, please put it in the comments. I’d also like to read
women in this category but I think it’s a particularly male category.
Jack Kerouac falls somewhere in there but he’s more “beat” which I think
is different. And Chad Kultgen’s recent books (“The Average American
Male”, for instance) are also somewhat in the realism category but not
quite “dirty” enough.