Using command line to talk directly to your computer with lines of text rather than pointing and clicking can seem daunting at first. However, Max Harlow of the Financial Times explained at the 11th Global Investigative Journalism Conference that it is not as challenging as it seems.
“Even though this stuff can seem intimidating, it’s based on simple principles,” Harlow said.
Learning a few lines of code and using basic tools can speed up your data gathering and analysis, and even help you find stories you wouldn’t have found without it.
“If you can get familiar with those, it opens up this world of cool, useful tools,” he said.
Harlow was a software developer before becoming a journalist but most of the stories he used as examples during his presentation were done by reporters who had very little to no experience using command line.
Command line is available for all operating systems, though it is less commonly used on Windows. On Mac and Linux distributions, you can use command line by opening Terminal. For Windows, there is a free tool called Cygwin, which allows functionality similar to a Linux distribution.
Using commands to speak directly to your computer will improve your data analysis speed, power, and efficiency, Harlow said.
For fact-checking purposes, commands allow you to track every step of your data analysis. You can save your lines of code so you can check how you got your results, and also apply it to a similar set of data. Another reason to use command line is that sometimes it is the only option to analyze the data in a way that your story requires.
“Sometimes there is no other option [because] there is no other tool that does that thing you need to do,” Harlow said. “Using the command line is the only way.”
The first tool Harlow showed tells you who owns the domain name of a website.
When police apprehended a suspect for the murder of British PM Jo Cox, news organizations were scrambling to find out more details on the suspect. The Financial Times found that the suspect’s name, Thomas Mair, was mentioned on a Neo-Nazi website.
“They had the website but it didn’t have any contact details,” Harlow said.
Using command line tools whois and grep, the Financial Times found out who registered the website and retrieved a phone number. By calling the website’s owner, a member of the Neo-Nazi organization, they learned that Mair had left the organization because it wasn’t right wing enough for him.
The tool whois will show you the owner of a domain name, while grep allows you to search inside text. Both come with any Mac OS and can be easily installed on Linux or Cygwin.
In order to find out who owns the domain name of the Financial Times, for example, you can type “whois ft.com” and this will return:
To easily find a line of text within the whois result, or any folder or file that has text, you can use the grep tool. Using a pipe, a symbol used to connect different tools, you can use the output of whois as the input for grep.
To look for a phone number in the whois results, the command would say “whois ft.com | grep Phone” and the results would look like this:
Data Analysis
You can also use command line to do fast data analysis. Putting in a line of code will be much faster than opening a CSV in a spreadsheet program such as Excel and you can create various combinations of tools and commands to make more efficient queries.
A simple tool to use for data analysis is called xsv. You can use a “count” function to quickly see how many rows a spreadsheet has, use “headers” to see all the column names and use “frequency” to see how often names come up, for example.
To turn these results into a reusable file, you can save everything as a CSV using a greater than symbol.
Harlow also showed a tool that many journalists will find very useful, namely a fuzzy name match tool called csvmatch. This allows you to find same or very similar names between two spreadsheets.
“There is a million different ways a name can be written out,” Harlow said. “It is the same with companies, which have the little suffixes [such as ] limited, PLC, … that can be written with dots or capitalization.”
Again, this only takes a single line of text to put into the command line. Using a story by The New Humanitarian as an example, Harlow explained how to use csvmatch. Reporters there found a match between a spreadsheet of UN contractors and a spreadsheet of companies blacklisted by the UN.
“This news organization just did a fuzzy match on the names on those two lists,” Harlow said. “Are there any companies that the UN pay but that are also
The command they used for this is:
The backslashes are added to tell the computer to consider all the different lines as one command and to make the command more readable. You could also write it all in one line.
To enrich large data sets with additional information that is online in bulk, Harlow made his own tool called reconcile. With reconcile, you can add the company number and owners to a list of company names, for example.
To find out how to use reconcile and more tools, take a look at Harlow’s full presentation.
Jelter Meers is a researcher and reporter at the Organized Crime and Corruption Reporting Project and a coordinator and editor at the Investigative Journalism Education Consortium. He helped organize the data and academic speaker tracks at #GIJC2017 and #GIJC2019.