Spanish Newspaper Editions
Newspapers are the tangible record of the history lived by a society, and their importance lies in the way they observe, describe and record the facts. How do they record reality? How have editorial lines and political trends marked these records?
This project aims to be a first step towards a critical analysis of these questions, as well as an invitation to read newspaper in a different way, and continue the work of research and analysis of the richness contained in historical newspapers
What influences newspaper style?
OUR PROJECT
The Spanish Newspaper Project analyzes 100 editions (complete and semi-complete) of 16 historical Spanish newspapers with different political and editorial lines. Our dataset and metadata was built with material available in the Digital Newspaper Library of the National Library of Spain. The analysis includes the articles published in these newspapers between 1890 and 1940. This is a subjective selection, based on historical issues published in this period of time.
DATA
DESCRIPTION
In general, the scanned texts were in acceptable reading condition when we collected them, but we had to address two problems before starting the process:
1. There were extra spaces in some words, and
2. Certain characters were not recognized by the OCR.
OPEN THE NOTEBOOK
100 editions of historical Spanish newspaper
16 newspapers
4 types of format
7 different ideologies
3 regions
2 type of audiences
2.316.825
RAW TOKENS ANALYZED
To fix the words with extra spaces we decided to create a function that tries to recompose those words, remove the spaces, and convert all the tokens to lowercase.
To address the OCR issues instead, we decided to work with SpaCy corpus, "es_core_news_md" to remove unrecognized text.
2.000.333
TOKENS RECOGNIZED BY SPACY
330.434
UNIQUE RAW WORDS
65.876
UNIQUE WORDS
Explore, analyZe, and visualize
the data
It is important to keep in mind that we selected a small dataset for a faster analysis and iteration. It is therefore not very representative, and we can't make any conclusions about general developments in the Spanish press through this time period.
This is an exploration of our dataset.
STYLO
Stylistic analysis of newspaper edition text
For the particular analysis of our corpus consisting of Historic Spanish Newspapers, the team decided to create different subcorpora based on the feature we want to analyse. It is important that the visualisations are as clear as possible, in that sense the colouring plays an essential role. By creating subcorpora from the main corpus, it is useful to plot all newspapers having for eg. a primarily adult audience in the same colour for easy visualisation.
With this in mind, the analysis started by inputting the plain text and choosing the Spanish language from the stylo() parametres. What makes stylo()such an innovative package is that it allows the user to choose from a range of preset settings according to the complexity of the corpus suited for fast exploratory analysis. Further on, in the features section we found it suitable to count words and left the n-grams to 1.
The parameter for the Most Frequent Words was set to 500 MFW, and as for the words we did not want to be included in the analysis, the culling parameter was set to min=max=20, meaning that a given word has to appear in at least 20% texts. Also, for the analysis, pronouns were excluded as well. In the statistics section, for the corpus at hand, the best exploratory method was Multidimensional Scaling, and for having the most precise result among similarities between texts, Elder’s Delta was chosen since Spanish is such an inflected language. No sampling was performed.
FOR MORE DETAILS, OPEN THE NOTEBOOK
NEWSPAPERS
The first visualisation consists of all the newspaper editions that make up the corpus. For the most part, editions cluster together according to their newspaper, and the majority are very similar in writing style. Of course, there is a very small number of outliers.
AUDIENCE
The corpus was divided previously into two categories: youth and adult audiences.
From an audience point of view, there is a clear clustering between adult-targeted newspapers. The newspaper editions dedicated to youth audiences seem to be split into two groups. The number of adult-dedicated newspapers clearly surpasses the number of youth newspapers, so the main target audience was not people aged between 15 and 24 years old.
FORMAT
The corpus was divided into four categories according to the number of appearances in a month: daily, biweekly, weekly, and monthly. The format is important to the analysis because it usually reflects how well-established a newspaper is. As we can see, the majority of the newspapers had daily editions, and very few of them have either weekly or monthly editions. This could either mean that they have a niche audience, for example, youth, or they cover a limited distribution area.
HEADQUARTERS (HQ) - REGION
From the two visualisations, we observe that most national newspapers had their headquarters in Madrid or Barcelona, and regional newspapers were either based in Sevilla or Santander.
IDEOLOGY
From this depiction, one can see how on a general level the different ideologies specific to each newspaper tend to position themselves close to each other. It is also interesting to observe how some specimens of unknown ideology were placed in the cluster with specimens of socialist and regionalist ideology.
CLICK HERE TO SEE ALL THE GRAPHICS
A Network analysis based on stylo results
For Gephi, we used the stylistic data generated by Stylo with the Force Atlas 2 algorithm. Nodes were sized according to their degree and edges according to their weight. Nodes were then coloured to the feature relevant for each analysis.
FOR MORE DETAILS, OPEN THE NOTEBOOK
GEPHI
NEWSPAPERS
Individual editions of newspapers cluster together fairly strongly, though to various degrees. For example, all editions of Vida Socialista are the closest to each other, while some editions of La Dinastía are spread out fairly widely. This could be a result of actual stylistic differences, but errors in OCR and token recognition are probably the most likely cause.
You can also see the different degrees of these newspapers, represented by the size of the nodes. It seems that stylistically, editions of El Sol and El Imparcial had most editions similar to them, as they have the highest degrees. This fits with descriptions of both as being influential newspapers in this period.
FORMAT
The main cluster is obviously formed by newspapers with a daily publication cycle. Only a few non-daily newspapers are even close to the main cluster.
IDEOLOGY
The centre of the main cluster is formed by Liberal, Republican or Conservative newspapers. Anarchists, Socialists or Carlists are on the margins, especially in the case of El Cruzado Español on the top right. Interestingly, the other Carlist newspaper CEDA is located rather close to Anarchists and Socialists. There are also two non-political newspapers, shown here as “nan”, which are the ones aimed at youth audiences. They are both completely separate from the main cluster.
headquarters (HQ)
II this case, the centre of the main cluster is formed by newspapers located in the Spanish capital of Madrid. A ring of publications set in Barcelona, Spain’s second city and the capital of Catalonia, surrounds it. One newspaper set in Cantabria’s Santander is well-connected to the main cluster, while the single Andalusian newspaper headquartered in Sevilla is completely separate from the rest.
A possible explanation for this could be that Barcelona was such an important city that its newspapers were still written for a national base, but separate enough due to the different location to be recognisably different from the Madrid ones. Santander’s style could not be that much of an influence, while the Andalusian accent is famously strong, suggesting a stronger influence.
AUDIENCE
As in the ideology analysis, the two newspapers aimed at a youth audience are completely separate from the main cluster.
Publication Year
Editions in the main cluster are ordered from an earlier to a later year of publication, from the left to the right.
CLICK HERE TO SEE ALL THE GRAPHICS
conclusions
of the project
What influences newspaper style?
PROJECT TEAM
Lisa Raulli
S5352029
Data wrangling & Website management
Maximilian Henning
S5305403
Data wrangling
& Gephi analysis
Catalina Cruceanu
S5367530
Stylo analysis
Maria Pilar Uribe Silva
S5341191
General Coordinator, Data wrangling & Website management