1 Methodology

The point of this chapter is to explain the choices I made in building the model that the book is based around. But to understand the choices that I made, it helps to know a little bit about what a Latent Dirolecht Algorithm (LDA) does.

The inputs to the model are some texts, and a number. The model doesn’t care about the ordering of words in the texts, so really the input isn’t texts, but a list of lists of ordered pairs. Each ordered pair is a word and a number. In the version I’m using, the outer list is a list of philosophy articles. And each element of that list is a list of words in that article, along with the number of times the word appears.

Along with that, you give the model a number. This is the number of topics that you want the model to divide the texts into. I’ll call this number \(t\) in this introduction. And intuitively there is a function \(T\) that maps articles into the \(t\) topics.

What the model outputs is, for our purposes, a pair of probability functions - one for articles, and one for words.

The probability function for articles gives, for each article \(a\) and topic number \(n \in \{1, \dots, t\}\), a probability for \(T(a) = n\). That is, it gives a probability that the article is in that topic. Notably, it doesn’t identify the topics with any more than numbers. I’m going to give names to the topics - this one is Kant, this one is Composition and Constitution, etc. - but the model doesn’t do that. For it, the topics really are just integers between 1 and \(t\).

The probability function for words gives, for each word \(w\) from any of the articles, and topic number \(n \in \{1, \dots, t\}\), the probability that a randomly chosen word from the articles in that topic is \(w\). So in the Kant topic, the probability that a randomly chosen word is ‘Kant’ is about 0.14.

That number feels absurdly high, but it makes sense for a couple of reasons. One is that to make the models compile in even semi-reasonable time, I filtered out a lot of words. What’s it’s really saying is that the word ‘Kant’ produces about 1/7 of the tokens that remain. The other is that what it’s really giving you here is the probability that a random word in an article is ‘Kant’ conditional on the probability of that article being in the Kant is 1. And in fact the model is never that confident. Even for articles that you or I would say are unambiguously articles about Kant, the model is rarely more than 40% confident that that’s what they are about. And this is for a good reason. Most articles about Kant in philosophy journals are, naturally enough, about Kantian philosophy. And any part of Kantian philosophy is, well, philosophy. So the model has a topic on Beauty, and when it sees an article on Kantian aesthetics, it gives some probability the correct classification of that article is in the topic on Beauty. So the word probabilities are quite abstract things - they are something like word frequencies in a certain kind of of stereotyped article. What the model really wants to do is find \(t\) such stereotypes such that each article we started with is a linear mixture of the stereotypes.

The way the model approaches this goal is by building two probability of functions, checking how well they cohere, and recursively refining them in places that they don’t cohere. The model wants, as much as possible, the word frequencies in the input to match the modelled frequencies you get by looking at the probability an article is in a particular topic, combined with the word frequencies for each of those topics. This is strictly speaking impossible; there aren’t enough degrees of freedom. But it can minimise the error, and it does so recursively.

The process involved is slow. I was able to build all the models I’ll discuss on personal computers, but it takes some processing time. The particular model I’m primarily using took about 20 hours to build, but I ran through many more hours than that building other models to compare it to.

And the process is very path dependent. The algorithm, like many algorithms, has the basic structure of pick a somewhat random starting point, then look for a local equilibrium. That’s incredibly dependent on how you start, and somewhat dependent on how you travel.

The point of this chapter is to describe how I chose the inputs to the model I ended up using, and then how I set various parameters within the model. The parameters are primarily, in terms of the metaphor of the previous paragraph, the starting point of the search, and how long the search should go before we decide we’re at something close enough to an equilibrium.

The inputs are more complex. Very roughly, the inputs I used are the frequently occurring contentful words from research articles in twelve important philosophy journals. I’ll start by talking about how and why I selected the particular twelve journals that I did.