“Content Analysis of 150 Years of British Periodicals” published in PNAS

What could be learnt about the world if you could read the news from over 100 local newspapers for a period of 150 years? This is what a team of Artificial Intelligence (AI) researchers from the University of Bristol have done, together with a social scientist and a historian, who had access to 150 years of British regional newspapers.

The patterns that emerged from the automated analysis of 35 million articles ranged from the detection of major events, to the subtle variations in gender bias across the decades. The study has investigated transitions such as the uptake of new technologies and even new political ideas, in a new way that is more like genomic studies than traditional historical investigation.

The team of academics, led by Professor Nello Cristianini, collaborated closely with the company findmypast, who is digitising historical newspapers from the British Library as part of their British Newspaper Archive project.

The main focus of the study was to establish if major historical and cultural changes could be detected from the subtle statistical footprints left in the collective content of local newspapers. How many women were mentioned? In which year did electricity start being mentioned more than steam? Crucially, this work goes well beyond counting words, and deploys AI methods to identify people and their gender, or locations and their position on the map.

The landmark study, part of the University of Bristol’s ThinkBIG project, collected a huge amount of regional newspapers from the UK, including geographical and time-based information that is not available in other textual data such as books. Over 35 million articles and 28.6 billion words, from the British Library’s newspaper collections, representing 14 per cent of all British regional outlets from 1800 to 1950, were used for the study.

Nello Cristianini, Professor of Artificial Intelligence, from the Department of Engineering Mathematics, said: “The key aim of the study was to demonstrate an approach to understanding continuity and change in history, based on the distant reading of a vast body of news, which complements what is traditionally done by historians.

“The research team showed that changes and continuities detected in newspaper content can reflect culture, biases in representation or actual real-world events.”

Simple content analysis allowed the researchers to detect specific key events like wars, epidemics, coronations or gatherings with high accuracy, while the use of more refined techniques from AI enabled the research team to move beyond counting words by detecting references to named entities, such as individuals, companies and locations.

Some of the results were to be expected, and acted as a rational check for the approach, while other outcomes were not so obvious at the start of the analysis.

The researchers found in the areas of values, beliefs and UK politics that in the 19th century Gladstone was much more newsworthy than Disraeli; until the 1930’s Liberals were mentioned more than Conservatives, and that reference to British identity took off in the 20th century.

In the subjects of technology and economy, the research team tracked the steady decline of steam and the rise of electricity, with a crossing point of 1898; trains overtook horses in popularity in 1902; and the four largest peaks for ‘panic’ corresponded with negative market movements linked to banking crises in 1826, 1847, 1857 and 1866.

The researchers have shown in the subjects of social change and popular culture that the Suffragette movement fell within a delimited time interval 1906 to 1918; ‘actors’, ‘singers’ and ‘dancers’ began to increase in the 1890s, rising significantly from then on, while references to ‘politicians’, by contrast, gradually declined from the early 20th century; and that ‘football’ was more prominent than ‘cricket’ from 1909.

Replicating a previous study done on book content, the researchers then moved on to link famous people in the news to their profession, finding that politicians and writers are most likely to achieve notoriety within their lifetimes, while scientists and mathematicians are less likely to achieve fame but decline less sharply.

More importantly, the researchers found that males are systematically more present than females during the entire period studied, but there is a slow increase of the presence of women after 1900, although it is difficult to attribute this to a single factor at the time. Interestingly, the amount of gender bias in the news over the period of investigation is not very different from current levels.

Dr Tom Lansdall-Welfare, Research Associate in Machine Learning in the Department of Computer Science, who led the computational part of the study, said: “We have demonstrated that computational approaches can establish meaningful relationships between a given signal in large-scale textual corpora and verifiable historical moments.

“However, what cannot be automated is the understanding of the implications of these findings for people, and that will always be the realm of the humanities and social sciences, and never that of machines.”

The researchers believe that these data-driven approaches can complement the traditional method of close reading in detecting trends of continuity and change in historical corpora. The contribution that Big Data and AI can do to the field of Digital Humanities is still largely unexplored, and one of the most exciting areas of cross disciplinary research enabled by the new field of Data Science. The ThinkBIG project is aimed at exploring the interplay between Social Science, Humanities, and large-scale data-driven AI.

Paper: Content Analysis of 150 Years of British Periodicals by Thomas Lansdall-Welfare, Saatviga Sudhahar, James Thompson, Justin Lewis, The FindMyPast Newspaper Team and Nello Cristianini in Proceedings of the National Academy of Sciences of the United States of America (PNAS).

Secondary Data from the Study: http://data.bris.ac.uk/data/dataset/dobuvuu00mh51q773bo8ybkdz

Two workshop papers accepted for presentation at ICDM 2016

Two papers from the ThinkBIG project have been accepted for presentation at workshops at the International Conference on Data Mining taking place in Barcelona, Spain from the 12th – 15th December 2016.

The first paper, Seasonal Fluctuations in Collective Mood Revealed by Wikipedia Searches and Twitter Posts will be presented at the Sentiment Elicitation from Natural Text for Information Retrieval and Extraction workshop (SENTIRE) by Fabon Dzogang. In this paper, we investigate seasonal fluctuations in mood and mental health by analyzing the access logs of Wikipedia pages  and the content of Twitter in the UK over a period of four years. By using standard methods of Natural Language Processing, we extract daily indicators of negative affect, anxiety, anger and sadness from Twitter and compare this with the overall daily traffic to Wikipedia pages about mental health disorders.

The second paper, Change-point Analysis of the Public Mood in UK Twitter during the Brexit Referendum will be presented at the Data Mining in Politics workshop (DMiP) by Thomas Lansdall-Welfare. In this paper, we study the changes in public mood within the contents of Twitter in the UK, in the days before and after the Brexit referendum. We measure the levels of anxiety, anger, sadness, negative affect and positive affect in various geographic regions of the UK, at hourly intervals. We analyse these affect time series’ by looking for change-points common to all five components, locating points of simultaneous change in the multivariate series using the fast group LARS algorithm, an algorithm originally developed for bioinformatics applications.

 

Public lecture on “Living in a Data Obsessed Society”

img_20161202_191231

On 2nd December 2016, Nello Cristianini gave a public lecture with James Ladyman and Andrew Charlesworth which was hosted by Abigail Fraser to discuss with the public on the theme of “living in a data obsessed society”.

 A new unified data infrastructure that mediates a broad spectrum of our daily transactions, communications, and decisions has emerged from the data revolution of the past decade. New AI technologies permit this infrastructure to infer our inclinations and predict our behaviour for an increasing range of activities, whether social, economic or regulatory. As opting-out is no longer a realistic option, we must strive to understand the effects this new reality can have on society.
 
Presently, we are ‘sleepwalking’ into unquestioning acceptance of a data ideology which presupposes that data-driven decisions are inherently neutral, objective and effective. Growing evidence to the contrary requires that such assumptions must be rigorously and robustly questioned. From privacy to persuasion, this technology will affect all of us.  
 

Issues that demand wider debate include addressing the risks of unintended discrimination, challenging spurious claims of objectivity,  the need to uphold an ethics of privacy and autonomy, and the importance of understanding the future roles and capabilities of intelligent machines. 

 A data scientist, a philosopher of science, and a legal scholar,  will present their work on the theme of “living in a data obsessed society”. 

Big data shows people’s collective behaviour follows strong periodic patterns

New research has revealed that by using big data to analyse massive data sets of modern and historical news, social media and Wikipedia page views, periodic patterns in the collective behaviour of the population can be observed that could otherwise go unnoticed.

Academics from the University of Bristol’s ThinkBIG project, led by Nello Cristianini, Professor of Artificial Intelligence, have published two papers that have analysed periodic patterns in daily media content and consumption: the first investigated historical newspapers, the second Twitter posts and Wikipedia visits.

The two sets of findings, taken together, show that people’s collective behaviour follows strong periodic patterns and is more predictable than previously thought.  However, these patterns can often only be revealed when analysing the activities of a large number of people for a very long time, and until recently this has been a very difficult task.

By using big data technologies it is now possible to obtain a unified look at newspaper content, for dozens of newspapers at the same time, spanning several decades or to analyse the contents posted on Twitter by large numbers of users, or even the Wikipedia pages visited.

Professor Nello Cristianini, from the Department of Engineering Mathematics, said: “What emerges is a glimpse at the regularities in our behaviour that are hidden behind the day-to-day variations in our lives.

“Our two papers have shown that by analysing massive data sets of modern and historical news, social media and Wikipedia page views, we can obtain an unprecedented look at our collective behaviour, revealing cycles that we certainly suspected, but that have never been observed before.”

The first paper, published in the journal PLOS ONE, analysed 87 years of US and UK newspapers between 1836 and 1922.  The researchers found people’s leisure and work were strongly regulated by the weather and seasons, with words like picnic or excursion consistently peaking every summer in the UK and US.

Much of our diet was influenced by the seasons too, with very predictable peak times for different fruits and foods, and even flowers, in the historical news. The same was found for diseases, such as the peak season for measles in both countries was found to be in late March to early April.  Interestingly, a strong indicator was provided by the very periodic re-appearance of gooseberries every June, which is no longer found in modern news, along with many other lost traditions.

This may seem obvious, but the research team also noticed that certain activities that used to be highly regular, like Christmas lectures, have now all but disappeared, and have been replaced by other periodic activities, like football, Ibiza, Oktoberfest. In some ways, the TV has partly replaced the weather as a major factor of synchronisation of people’s lives.

In the second paper, to be presented next month at a workshop at the 2016 IEEE International Conference on Data Mining (ICDM), the researchers discovered that seasons may also have strong effects on mental health.  The team analysed the aggregate sentiment in Twitter in the UK, plus aggregate Wikipedia access over four years.  They found that negative sentiment is overexpressed in the winter, peaking in November, and anxiety and anger are overexpressed between September and April.

At the same time, an analysis of Wikipedia visits for mental health pages, globally but strongly dominated by northern hemisphere traffic, showed clear seasonality in searches for specific forms of mental issues. For example, visits to the page on seasonal affective disorder peaks in late December and panic disorder visits peak in April, at the same time as visits to the page on acute stress disorder.

Together, these two articles show that the use of multiple sources of big data can enable researchers to look at the collective behaviour, and even the mood and mental health, of large populations, revealing cycles for the first time that have been suspected but were difficult to observe.

New study in PLOS ONE shows women are seen more than heard in online news

It has long been argued that women are under-represented and marginalised in relation to men in the world’s news media. New research, using artificial intelligence (AI), has analysed over two million articles to find out how gender is represented in online news. The study, which is the largest undertaken to date, found men’s views and voices are represented more in online news than women’s.

What is perhaps more interesting is that the research found – while being overall under-represented – women appear proportionally more in images than men, while men are mentioned more in text than women.  A breakdown of topics shows that women feature more in articles about fashion, followed by entertainment and art, while being least present in topics including sport and politics.

A team of AI experts at the University of Bristol’s Intelligent Systems Laboratory (ISL), led by Nello Cristianini, Professor of Artificial Intelligence, teamed up with social scientist, Dr Cynthia Carter from Cardiff University, to ask a very old question on a very new scale.  How many men and how many women are mentioned in the news, or portrayed in newspaper images, over a long period of time and in over hundreds of different newspapers?

Modern AI, which is frequently in the news, is a great tool to support research and can automate tasks that would take humans an impossible amount of person-hours to complete.  It is now possible to automate the task of recognising the gender of a face with a remarkable level of accuracy, and it is also possible to detect references to people in online text, along with their gender.

The paper, published in PLOS ONE, reports the findings from a large-scale, data-driven study of gender representation in online English language news media. The researchers analysed both words and images to give a broader picture of how gender is represented in online news.

The team gathered a body of news consisting of 2,353,652 articles collected over a period of six months from more than 950 different news outlets.  From this dataset, they extracted 2,171,239 references to named persons and 1,376,824 images resolving the gender of names and faces using AI.

The researchers found that males were represented more often than females in both images and text, but in proportions that changed across topics, news outlets and style.

Additionally, the proportion of females was consistently higher in images than in text for virtually all topics and news outlets.  Women were more likely to be represented visually than mentioned as a news actor or source.

Professor Nello Cristianini from the University’s Department of Engineering Mathematics, said: “Just a few years ago, it would not have been possible for a computer to determine the gender of a face, or to process such a large amount of text, with the ecessary accuracy and speed.  The analysis of millions of articles and images is one of the ways in which modern AI can help scientific research. When Big Data meets AI we see benefits in many areas of business and technology, now we can also see benefits in the way we do science.”

Dr Cynthia Carter, Senior Lecturer in the Cardiff School of Journalism, Media and Cultural Studies, added: “Our large-scale, data-driven analysis offers important empirical evidence of macroscopic patterns in news content, supporting feminist researchers’ longstanding claim that the marginalisation of women’s voices in the news media under-values their potential contributions to society, and in the processes, diminishes democracy.”

Paper