1.6 Choosing Between The Models

Even once setting the number of topics, there are still a lot of ways that the model can change. Building a model starts with a somewhat random assignment of words and articles to topics, followed by a series of steps (themselves each involving a degree of randomisation) towards a local equilibrium. But there is a lot of path dependency in this process, as there always is in finding a local equilibrium.

Rather than walk through the mathematics of why this is so, I find it more helpful to think about what the model is trying to achieve, and why it is such a hard thing to achieve. Let’s just focus on one subject matter in philosophy, friendship, and think about how we could classify it if we’re trying to divide all of philosophy up into 60-90 topics.

It’s too small a subject matter to be its own topic. We’ll do best if we have the topics be roughly equal size, and discussions that are primarily about friendship are, I’d guess, about 0.001 to 0.002 of the articles in these twelve journals. It’s an order of magnitude short of being its own topic. So it has to be grouped in with neighbouring subjects. But which ones? For some subjects, the problem is that there aren’t enough natural neighbours. This is why the models never quite know what to do with vagueness, or feminism, or Freud. But here the problem is that there are too many.

One natural enough thing to do is to group papers on friendship with papers on love, and both of them with papers on other emotions, or perhaps with papers on other reactive attitudes. That gives you a nice set of papers about aspects of the mental lives of humans that are central to actually being human, but not obviously well captured by simple belief-desire models.

Another natural thing to do is to group papers on friendship with papers on families, and perhaps include both of them in broader discussions of ways in which special connections to particular others should be accounted for in a good ethical theory. Again, you get a reasonably nice set of papers here, with the general theme of special connections to others.

Or yet another natural thing to do is to group papers on friendship with papers on cooperation. And once you’re thinking about cooperation, the natural paper to center the topic around is Michael Bratman’s very highly cited paper Shared Cooperative Activity. From there there are a few different ways you could go. You could expand the topic to Bratman’s work on intention more broadly, and the literature it has spawned. Or you could expand it to include other work on group action, and even perhaps on group agency. (I teach that Bratman paper in a course on groups and choices, which is centered around game theory. Though I think getting from friendship to game theory in a single one of our 60-90 topics would be a step too far.)

Which of these is right? Well, I saw all of them when I ran the algorithm enough times. And they all seem like sensible choices to me. How should we choose which model to use when different models draw such different boundaries within the space of articles? A tempting thought is to see which one looks most like what one thinks philosophy really looks like, and choose it. But now we’re back to imposing our prejudices on the model, rather than letting the model teach us something about the discipline.

A better thing to do is to run the model a bunch of times, and find the one that most commonly appears. Intuitively, we’re looking for an equilibrium, and there’s something to be said for picking the equilibrium with the largest basin of attraction. This is more or less what I did, though there are two problems.

The first is ‘run the model a bunch of times’ is easier said than done. On the computers I was using (pretty good personal computers), it took about 8 hours to come up with a model with 60 topics. So running a bunch of them to find an average was a bit of work. (The University of Michigan has a good unit for doing intensive computing jobs like this. But I kept feeling I was close enough to being done that running things on my own devices was less work than setting up an account there. This ended up being a bad mistake.) But I could just leave them run overnight every night for a couple of weeks, and eventually I had 16 60-topic models to average out.

The models are distinguished by their seed. This is a number that you can specify to seed the random number generator. The intended use of it is to make it possible to replicate work like this that relies on randomisation. But it also means that we can run a bunch of models, then make slight changes to the one that seems most representative. And that’s what I ended up doing. The seeds I used at this stage were famous dates from the revolutions of 1848. And to get ahead of ourselves, the model the book is based around has seed value 22031848, the date of both the end of the Five Days of Milan, and of the start of the Venetian Revolution.6

The second is that it isn’t obvious how to average them. At one level, what the model produces is a giant probability function. And there is a lot of literature on how to merge probability functions into a single function, or (more or less equivalently), how to find the most representative of a set of probability functions. But this literature assumes that the probability functions are defined over (more or less) the same possibility spaces. And that’s precisely what isn’t true here. When you build one of these models, what you’re left with is a giant probability function all right. But no two model runs give you a function over the same space. Indeed, the most interesting thing about any model is what space it decides is most relevant. So the standard tools for merging probablity functions don’t apply.

What I did instead was look for two things.

The model doesn’t just say, this article goes in this topic. It says that this article goes in this topic with probability p. Indeed, it gives non-zero probabilities to each article being in each topic. So one thing you can look at for a model is which articles does it think have the highest probability of being in any given topic. That is, roughly speaking, which articles does it think are the paradigms of the different topics it discovers. Then across a range of models, you can ask, how much does this model agree with the other models about which are the paradigm articles. So, for instance, you can find the 10 articles with the highest probability of being in each of the 60 topics. And then you can ask, of the 600 articles that this model thinks are the clearest instance of a particular topic, how many of them are similarly in the 600 articles that other models think are the paradigms of a particular topic. So that was one of the things I looked for - which models had canonical articles that were also canonical articles in a lot of other models.

The models don’t just give you probabilistic judgments of an article being in a particular topic, they give you probabilistic judgments of a word being in an article in that topic. So the model might say that the probability of the word ‘Kant’ turning up in an article in topic 25 is 0.1, while the probability of it turning up in most other topics is more like 0.001. That tells you that topic 25 is about Kant, but it also tells you that the model thinks that ‘Kant’ is a keyword for a topic. Since some words will turn up frequently in a lot of topics no matter what, you have to focus here not just on the raw probabilities (like the 0.1 above), but on the ratio between the probability of a word being in one topic and it being in others. That tells you how characteristic the word is of the topic. And again you can use this trick to find the 600 characteristic words of a particular model, and ask how often those 600 words are characteristic words of any model at all. There is a lot of overlap here - the vast majority of models have a topic where ‘Aristotle’ is chaacteristic word in this sense, for example. But there are also idiosyncracies, and the models with the fewest idiosyncracies seem like better bets for being more representative. So that was another thing I looked for - which models had keywords that were also keywords in a lot of other models.

The problem was that these two approaches (and a couple of variations of them that I tried) didn’t really pick out a unique model. It told me that three of them were better than the others, but not really which of those three was best. So I chose one in particular. Partially this was because I could convince myself it was a bit better on the two representativeness tests from the last two paragraphs, though honestly the other two would have done just as well. Partially it was because it did better on the four criteria from the previous section. But largely it was because the flaws it had all seemed to go one way; they were all flaws where the model failed to make distinctions I felt it should be making. The other models had a mix; some missing distinctions, but also some needless distinctions. And I felt at the time that having all the errors go one way was a good thing. All I had to do now was run the same model with slightly more topics, and I’d have a really good model. And that sort of worked, though it was more complicated than I’d hoped.


  1. Why 1848 and not some other historical event? Well, I had originally been using dates from the French Revolution. But I made so many mistakes that I had to start again. In particular, I didn’t learn how many words I needed to filter out, and how many articles I needed to filter out, until I saw how much they were distorting the models. And by that stage I had so many files with names starting with 14071789 and the like that I needed a clean break. So 1848, with all its wins and all its losses, it was.↩︎