Data Science and Text Mining for Historians with Python

By Kaspar Beelen and Luke Blaxill

⚠️ Warning ⚠️ Still Under construction

These lectures are part of the Text Mining and Statistics course for Historians. For an overview of the full course content go here.

Run all on Binder

Unit 0: Intro and Setup

Lecture 0: How to access Notebooks on Binder

Course 1: Text Mining in Python

Unit 5: Python, the Basics

Lecture A: Introduction to the course

We start with a brief introduction to the aims and principles of this course: why should a historian bother to learn a programming language for analysing textual and other types of data? Why Python (notebooks) in particular? We also discuss what to expect from this course (and what not?) and give an overview of the skills you will obtain.

Lecture B: Basic Python, a gentle introduction

This notebook starts with a gentle introduction to the basic elements of the Python syntax. We discuss how to create and manipulate variables, and demonstrate common operations. Some topics are more extensively discussed in ‘break out’ notebooks or in external documentation.

Lecture C: Text and String Methods

Finally, we move on from more fundamental syntax to working with actual text data. In this notebook, we introduce ‘string methods’, which are Python tools for processing and manipulating text files. We also demonstrate how to open and read text files (at scale).

Unit 6: Text Processing and Corpus Exploration

Lecture A: Processing Texts

This lesson introduces core Python objects such as lists and dictionaries that you will need when processing text files. We discuss the application of Natural Language Processing tools to historical documents. More precisely, we show how to use the NLTK and SpaCy to splitting a text into tokens and analyse the grammatical structure of a sentence with part-of-speech tagging.

Lecture B: Corpus Selection

In this notebook, we introduce techniques for selecting relevant information from large data sets. We discuss how to filter and select information based on their metadata as well as textual content. The strategies covered here allow you to select documents that are relevant to your research question and build question-specific subcorpora,

Lecture C: Corpus Exploration

After building a subcorpus, you need tools to explore and analyse the texts meaningfully. We focus on a wide range of tools provided by the Natural Language Toolkit, such as concordance or Keyword in Context (KWIC), collocation analysis and feature selection. We use reports written by Victorian Medical Officers of Health as a case study.

Lecture D: Trends over Time

The last notebook in the text mining series focuses on studying discursive trends over time. The goal of this notebook is to understand the changing content of British political manifestos.

Course 2: Statistics in Python

Unit 5: Explorative Data Analysis

Lecture A: Exploring DataFrames with Pandas (Part I)

This notebook introduces the Pandas library and explores tools for working programmatically with tabular data in. We have a closer look at realistic and complex metadata derived from the British Library catalogue and demonstrate how you can refine and reorganise information with the goals of studying trends over time.

Lecture B: Exploring DataFrames with Pandas (Part II)

This notebook uses “synthetic” demographic data about age and gender in late Victorian London. We discuss different types of variables and strategies for visualising distributions. We proceed with summarising information using descriptive statistics, such as mean and median. From a historical point of view, we investigate whether men are generally younger than women in late-Victorian London.

Unit 6: Hypothesis Testing

Lecture A: Distributions and Hypothesis Testing

In this section, we move from descriptive to inferential statistics. We assess the statistical ‘significance’ of the gendered differences observed in the previous notebook (on descriptive statistics). We pursue a data-driven and intuitive approach to significance testing. First, We “bootstrap” confidence intervals and then explore permutation for hypothesis testing.

Unit 7: Regression

Lecture A: Correlation and Linear Regression

This session has a closer look at modelling the relation between different variables. The first notebook (click here) discusses how to compute and interpret correlation coefficients and then continue with a gentle introduction to linear regression. The goal is to understand variation in lifespans in late-Victorian London. We try to understand if residents in more affluent boroughs tend to live longer?

Lecture B: Generalised Linear Models

The second notebook on linear regression turns to more advanced techniques: Generalised Linear Models (GLMs). We use GLMs to model and predict count outcomes. We explore two case studies in detail: a) gender bias in university applications and b) gender and participation in the British House of Commons.

Unit 8: Machine Learning (⚠️ Under construction ⚠️)

Lecture A: Supervised Classification

Lecture B: Topic Modelling

Lecture C: Word Vectors