What Data-Mining TV's Political Coverage Tells Us
Editor’s Note: RCP readers, meet Kalev Leetaru, the newest voice in our growing stable of writers and political analysts. A senior fellow at RealClearFoundation, Kalev’s beats will include media trends & visualization, which he explains in his inaugural piece below. In the coming months, Kalev will explore how data can be used to re-imagine the evolving world of journalism. From founding his first Internet startup in the eighth grade a year after the debut of the groundbreaking Mosaic browser, Kalev has spent the last two decades examining how the massive computing power of the web has reshaped how we interact with the world. His work has been featured in the media in some 100 nations, and will now grace the pages of RealClearPolitics.
Television offers a powerful lens through which to explore American politics. In contrast to the unlimited space of online news, the fixed daily airtime of television forces networks to carefully balance how much attention they pay to each story, bringing their agenda-setting decisions into sharp relief. It is also still a dominant news medium for much of the electorate, granting it unique relevance to the understanding of the public information environment.
Yet, television has largely remained beyond the reach of the kinds of large-scale data-mining techniques increasingly being applied to textual news. The difficulty of accessing broadcast material (unlike a web page, it can’t simply be downloaded with a web crawler) and converting its audiovisual modality into formats more amenable to traditional data mining had long stymied widespread adoption of television mining for political news analysis.
To address this, over the past two years the open data GDELT Project that I lead has worked closely with the Internet Archive’s Television News Archive to examine how its vast holdings of American television news programming can be harnessed through data mining and intelligent search tools to allow researchers and political journalists to finally explore the rich landscape of broadcast political news. The Television News Archive today holds more than 2 million hours of television news programming totaling more than 5.7 billion words from over 150 American television stations spanning July 2009 to present (though not all stations were monitored for the entire period).
As one of our first collaborations, we took the 74 distinct political television ads that aired in the Philadelphia market leading up to the 2014 midterm election and created an interactive visualization that allows researchers “to explore the television advertising landscape of Philadelphia [that] fall, comparing any pair of candidates, parties, races, status, win/lost, sponsor, sponsor type, television channel, or even keywords found in the transcripts, or any combination therein.”
To create the underlying dataset, trained human volunteers had been used initially to search through the television coverage and identify each time one of the ads aired in order to build up the database of how many times and where each ad was shown. Given the scalability limits of relying on humans to examine an archive with more than 2 million hours of content, the task of identifying each ad airing was eventually delegated to automated “audio fingerprinting” algorithms that, as their name suggests, create a “fingerprint” of the audio of each advertisement and then scan all of the monitored television stations for any re-airing of that content. A similar visualization was later created for the 2015 San Francisco municipal elections.
Figure 1 -- Tracking which lines from the 2015 State of the Union Address went "viral" on television news
Expanding upon this approach, we explored what it would look like to measure “virality” through the medium of television by taking President Obama’s 2015 State of the Union Address, breaking it into sentences, and tracking how many times each of his statements was aired as a sound bite during the following two weeks across all of the stations monitored by the archive, including a selection of international broadcasts, resulting in an interactive timeline of which ones “went viral” across the television news spectrum.
Figure 2 -- Tracking which statements from the first Republican presidential debate went "viral" on television news
This was followed with a series of visualizations repeating the process for each of the Republican and Democratic presidential debates through November 2015, tracking which comments and which candidates generated the most buzz over the following days. On this basis, Donald Trump won the first Republican debate, with his comments accounting for nearly a third of all debate sound bites rebroadcast on the major television news shows over the following week. One notable finding was that Fox News fairly evenly split its coverage across statements made by each of the candidates, while CNN, MSNBC and Univision were the most skewed in favor of comments by Trump, with Univision devoting a full 68 percent of its subsequent debate clips to statements by him.
Figure 3 -- Tracking which presidential candidates were mentioned the most on television news programming
In August 2015 we launched the Candidate Television Tracker, debuting in The Atlantic, which scanned the raw closed-captioning streams of each of the major television networks monitored by the Internet Archive and counted how many times each candidate was mentioned each day. Despite its incredible simplicity, this visualization offered one of the few daily looks at how the candidates were faring in terms of television news coverage and became a fixture of political media analysis, appearing everywhere from The Washington Post to FiveThirtyEight, Politico to The Guardian, Aftenposten to Al-Shorouk.
From a technical standpoint, the tool was quite basic: It simply extracted the textual closed-captioning streams from each television network and performed a basic keyword search for each candidate’s name (along with multiple common misspellings and some additional false positive filtering). Yet, even a process this simple yielded something that had to date been missing: a simple open method for quantifying just how much attention the major television networks were paying to each of the candidates. Given how different television news attention can be from that of print and online counterparts (a nuance missed by some outlets), the tool offered a particularly unique daily look at who was “winning” the media war. The tool also enforced strict normalization, reporting attention in terms of the number of times each candidate was mentioned divided by the total number of mentions of all candidates that day, rather than as an absolute count. This ensured that reduced coverage of the race on weekends and holidays did not skew the results – an issue that not all outlets fully appreciated.
The lesson learned from this tool was that even simple keyword searches of closed-captioning streams can offer powerful insights into the political discourse of direct relevance to scholars and journalists. Its daily updates even made it possible to track which of Trump’s statements resonated the most with the media, highlighting, for example, his autumn 2015 barrage against Muslims that catapulted him back into the media spotlight.
Figure 4 -- A glimpse of some of the television news coverage of Russian hacking
Building on this early experiment, in December 2016 we released an upgraded version of the tool that allows you to search for any arbitrary keyword or phrase across television closed-captioning. This makes it possible to quantitatively examine concepts like agenda setting by comparing which topics each network chooses to emphasize. Over the last six months of 2016, CNN spent 58 percent more of its coverage on Russian hacking than did Fox and 274 percent more time on Trump’s crude “Access Hollywood” statements. On the other hand, Fox spent three times more airtime on the Hillary Clinton email saga than CNN did. In short, if you were watching CNN, the Russians were everywhere and all powerful, hacking without mercy, while if you were a Fox News viewer, Clinton was recklessly shipping the nation’s most sensitive secrets off to her personal computer in her home basement. Lest there be any doubt, we truly live in parallel Americas, depending on which television network we watch.
Try it out for yourself and search for any keyword of interest to see how the major national news channels compare in their coverage or see what topics are trending at any moment across CNN, Bloomberg, CNBC, Fox Business, Fox News, MSNBC and BBC News London.
This kind of insight can be coupled with the GDELT Project’s global monitoring of broadcast, print and online news sources and mass machine translation of 65 languages to ask questions like “What is the domestic news media of each country in the world saying about Trump right now?”
During a campaign season, television news channels contain more than just news and commentary, however: They are saturated with political ads that offer powerful glimpses into the messaging strategies of each candidate. In January 2016 the Internet Archive formally debuted a special archive of these ads that used audio fingerprinting to track every airing of each ad across the major national networks and a selection of regional affiliates in key battleground states. While the collection was primarily intended as a resource for journalists, its availability as an open dataset in machine-friendly formats meant it could also be readily subjected to computerized analysis. Using this collection of what was then 267 distinct television ads totaling 196 minutes (which had aired at that point a combined 72,807 times), they were run through Google’s Cloud Vision API, which applies the company’s deep-learning algorithms to examine the image and catalogue the objects and activities it depicts, the number of human faces present and their average emotional expression, corporate logos and geographic locations, among other attributes.
The resulting analysis was essentially a second-by-second catalogue of the majority of the major political television ads run by the presidential candidates to that point on the networks monitored by the Internet Archive.
Using this data, one could ask a question like “How much of a political ad’s airtime focuses on people?” The answer turned out to be just under two-thirds of total airtime. Of the airtime featuring people, around 70 percent of it focused on a single person, either the candidate or someone discussing them, reinforcing that politics is inherently about people. Around 60 percent of airtime contained a textual overlay of some form, especially framing messages, such as American Crossroads’ opener of “What is it with Hillary Clinton lying about terrorists and videos?” to prime the viewer for the following clip of Clinton speaking at a debate, or Clinton’s own ad showing “Strengthening the economy” on the screen before presenting her economic message. Even in the visual world of television, sometimes you need to summarize your message for your viewers.
Applying this same image analysis to online news coverage and looking at the images used to illustrate articles about Bernie Sanders vs. those about Clinton, it turned out that images of Sanders through the middle of last year focused nearly exclusively on the size of the crowds at his rallies, while those about Clinton focused on global events. In short, in terms of the imagery they used, the media covered Sanders as a cultural phenomenon, while Clinton was covered as a potential head of state.
Looking to the future, the archive recently unveiled a collaboration with a start-up company specializing in facial recognition and has begun scanning several major television networks for any appearance of Donald Trump or several major congressional leaders. The idea behind this new filtering service is to be able to measure how much “face time” each political leader receives on air as a complement to the purely textual and audio fingerprinting analysis we’ve used to date.
Putting this all together, over the past two years we’ve shown that by combining the Internet Archive’s Television News Archive of more than 2 million hours of programming with techniques from simple keyword searches to audio fingerprinting to deep learning image recognition, we can assess the landscape of political television news in powerful new ways we are just starting to explore. Moreover, the view this data-driven lens offers can sometimes challenge our preconceived notions of political television news coverage, bringing statistical rigor to what is often anecdotal discourse. Over the coming months we will be using many of these new analytic tools to explore the changing landscape of television journalism, so stay tuned!