1.5 Choosing the Number of Topics

The model building algorithm automates most of the work; it even chooses what the topics are. But the one thing it doesn’t do is choose how many topics there are. You have to specify that in advance. And it’s a big choice.

In principle, you can give it as few as two topics to work with. If you ask the model to divide all the articles into two groups, it will usually divide them into something like ethics articles and something like M&E articles. I say ‘usually’ because it’s a fairly random process. And about a quarter of the time, it will find some other way of dividing the articles in two, such as earlier or later, or perhaps things that look maximally like philosophy of science, and maximally unlike philosophy of science. But none of these are helpful models; they tell us more about the nature of the modeling function than they tell us about the history of philosophy.

The topic models package itself comes with a measure that’s intended to be used for this purpose. The ‘perplexity’ function asks the model, in effect, how confused it is by the data once it has built the model.5 The thought is that once you’ve got too many topics, the perplexity score won’t change as you add more topics. That’s a sign that you’ve reached a natural limit. But it didn’t help here. As far as I could tell, I could have had something like 400 topics and the perplexity score would still have been falling every time I added more topics. Philosophers are just too idiosyncratic, and you really need to get very very fine-grained topics before the computer is comfortable thinking it has the classifications of articles into topics right.

But a model with 400 topics wouldn’t help anyone. (I did build one such model, and the rest of this paragraph is about why I’m not using it.) On it’s own, it’s too fine-grained to be useful. I don’t think anyone would actually read it closely. To make the model human-readable, I’d have to bundle the 400 topics into something like the familiar categories: ethics, metaphysics, philosophy of science, etc. But when I tried to do that, I found just as many edge cases as clear cases. The only data that would come out of this approach that would be legible to humans would be a product of my choices not the underlying model. And the aim was to get my prejudices out of the system as much as possible.

So I needed something more coarse-grained than the model with lowest perplexity, but obviously more fine grained than simply two topics. I ended up doing a lot of trial and error, and looking at how the models came up with different numbers of topics. (This feels like the thing that most people using topic modeling tools end up doing.)

When I looked at the models that were produced with different numbers of topics, I was generally looking at these four factors. The first two factors push you towards more and more topics. The next two were designed to put downwards pressure on the number of topics.

First, how often did the model come up with topics that simply looked disjunctive? The point of the model is to group the articles into n topics, and hopefully each of these topics has a sensible theme. But sometimes the theme is a disjunction - i.e., the topic consists of papers from philosophical debate X and papers from mostly unrelated debate Y. There are always some of these. Some debates are distinctive enough that the papers within that topic always cluster together - the model can tell that it shouldn’t be separating them - but small enough (in these twelve journals) that the model doesn’t want to use up a valuable topic on just that debate. There were three of these that almost always came up: feminism, Freud, and vagueness. If you build a model out of these journals with, say, 40 topics, then it is almost certain that three of the topics you’ll end up are simply disjunctive, with one of the disjuncts being one of these three topics. My favourite was an otherwise sensible model that decided one of the topics in philosophy consisted of papers on material constitution, and papers on feminist philosophy. Now there are links there - some important feminist theories spend a lot of effort on carefully distinguishing causation from constitution - but it’s really a disjunctive topic. And the fewer topics you have, the more disjunctive topics you get. So it’s good to get rid of disjunctions, and that’s a reason to increase the number of topics.

Second, how often did the model make divisions that cross-cut familiar disciplinary boundaries? Some such divisions are unavoidable, and the model I use ends up with a lot of them. But in the first instance I’d prefer, for example, a model that separates out papers on the metaphysics of causation from papers on the semantics of counterfactuals to one that puts them together. The debates are obviously closely related - but there was a big advantage to me if they were separated. If they were, then measuring how prominent Metaphysics is in the journals becomes one step easier, and so does measuring how prominent Philosophy of Language is. So I’d rather models that split them up.

Third, how often did the model divide up debates not in terms of what question they were asking, but in terms of what answers they were giving (or at least taking seriously). For instance, sometimes the model would decide to split up work on causation into, roughly, those papers that did and those that did not take counterfactuals to be central to understanding causation. This tracked pretty closely (but not perfectly) the division into papers before and after David Lewis’s paper Causation (Lewis 1973). (Though, amusingly, models that made this division usually put Lewis’s own paper into the pre-Lewisian category; which makes sense since most of that paper is about theories of causation that had come before.) This seemed bad - we want a division into topics, and different answers to the same question shouldn’t count.

Fourth, how often did the model make divisions that only specialists would understand? A bunch of models I looked at divided up, for instance, the philosophy of biology articles along dimensions that I, a non-specialist, couldn’t see reason behind. The point of this is not that there are no real divisions there, or that the model was in any sense wrong. It’s rather that I want the model to be useful to people across philosophy, and if non-experts can’t see what the difference is between two topics just by looking at the headline data about the topic, then it isn’t serving its function.

Still, after a lot of trial and error, it seemed like the best balance between these four criteria was hit at around 60 topics. This isn’t to say it was perfect. For one thing, even with a fixed number of topics, different model runs produce very different models, and as I’ll discuss in the next section, we have to choose between them. For another, the optimal balance between these criteria would come at different points in different fields. So perhaps at 48 topics you’d see a pretty good balance between these criteria within ethics (broadly construed), but it might be double that before you saw the right balance in philosophy of mind. So there are a lot of trade-offs, as you might expect given that we’re trying to detect trends in the absence of anything like clear boundary lines.

But you might notice at this stage something odd. I said that we got the best balance at around 60 topics. Yet the model I’ve based the book on has 90 topics. How I got to that model involves yet more choices. I think each of the choices I made was defensible, but the reason this chapter is so long is that there really were quite a lot of choices, and I think it’s worthwhile to lay them all out.


  1. You can ask it this about the data that was used to build the model, or hold back some of the data from the model building stage and use it on the held back data. The second probably makes more sense theoretically, but it didn’t make a huge difference here.↩︎