The model-building algorithm automates most of the work; it even chooses what the topics are. But the one thing it doesn’t do is choose how many topics there are. That has to be specified in advance. And it’s a big choice.
In principle, it can be given as few as two topics to work with. If the model is asked to divide all the articles into two groups, it will usually divide them into something like ethics articles and something like metaphysics and epistemology articles. I say “usually” because it’s a fairly random process. And about a quarter of the time, it will find some other way of dividing the articles in two, such as earlier or later, or perhaps things that look maximally like philosophy of science and maximally unlike philosophy of science. But none of these are helpful models; they say more about the nature of the modeling function than they say about the history of philosophy.
The topicmodels package itself comes with a measure that’s intended to be used for this purpose. The “perplexity” function asks the model, in effect, how confused it is by the data once it has built the model.5 The thought is that once there are too many topics, the perplexity score won’t change as more topics are added. That’s a sign that a natural limit has been reached. But it didn’t help here. As far as I could tell, I could have had something like four hundred topics and the perplexity score would still have fallen every time I added more topics. Philosophers are just too idiosyncratic, and topics need to be fine-grained before the computer is comfortable thinking it has the classifications of articles into topics right.
But a model with four hundred topics wouldn’t help anyone. I did build one such model, and the rest of this paragraph is about why I’m not using it.) On its own, it’s too fine-grained to be useful. I don’t think anyone would actually read it closely. To make the model human readable, I’d have to bundle the four hundred topics into familiar categories (e.g., ethics, metaphysics, philosophy of science, etc.). But when I tried to do that, I found just as many edge cases as clear cases. The only data that would come out of this approach that would be legible to humans would be a product of my choices—not the underlying model. And the aim was to get my prejudices out of the system as much as possible.
I needed something more coarse-grained than the model with lowest perplexity but obviously more fine-grained than simply two topics. I ended up doing a lot of trial and error and looking at how the models came up with different numbers of topics. (This feels like the thing that most people using topic-modeling tools end up doing.)
When I looked at the models that were produced with different numbers of topics, I was generally looking at four factors, which I will describe in detail. The first two factors push toward more and more topics. The next two were designed to put downward pressure on the number of topics.
First, how often did the model come up with topics that simply looked disjunctive? The point of the model is to group the articles into n topics, and hopefully each of these topics has a sensible theme. But sometimes the theme is a disjunction (i.e., the topic consists of papers from philosophical debate X and papers from mostly unrelated debate Y). There are always some of these. Some debates are distinctive enough that the papers within that topic always cluster together—the model can tell that it shouldn’t be separating them—but small enough in these twelve journals) that the model doesn’t want to use up a valuable topic on just that debate. There were three of these that almost always came up: feminism, Freud, and vagueness. If a model is built out of these journals with, say, forty topics, then it is almost certain that three of the topics are simply disjunctive, with one of the disjuncts being one of these three topics. My favorite was an otherwise sensible model that decided one of the topics in philosophy consisted of papers on material constitution and papers on feminist philosophy. Now there are links there—some important feminist theories carefully distinguish causation from constitution—but it’s really a disjunctive topic. And the fewer topics there are, the more disjunctive topics you get. It’s good to get rid of disjunctions, and that’s a reason to increase the number of topics.
Second, how often did the model make divisions that cross-cut familiar disciplinary boundaries? Some such divisions are unavoidable, and the model I use ends up with a lot of them. But in the first instance I’d prefer, for example, a model that separates papers on the metaphysics of causation from papers on the semantics of counterfactuals to a model that puts them together. The debates are obviously closely related, but there was a big advantage to me if they were separated. If they were, then measuring how prominent metaphysics is in the journals becomes one step easier, as is measuring how prominent philosophy of language is. So I’d rather models that split them up.
Third, how often did the model divide up debates, and not in terms of what question they were asking but in terms of what answers they were giving (or at least taking seriously)? For instance, sometimes the model would decide to split up work on causation into, roughly, those papers that did and those that did not take counterfactuals as central to understanding causation. This tracked pretty closely (but not perfectly) the division into papers before and after David Lewis’s paper “Causation” (D. Lewis 1973). (Though, amusingly, models that made this division usually put Lewis’s own paper into the pre-Lewisian category; which makes sense since most of that paper is about theories of causation that had come before.) This seemed bad—division should be into topics, and different answers to the same question shouldn’t count.
Fourth, how often did the model make divisions that only specialists would understand? A bunch of models I looked at divided up, for instance, the philosophy of biology articles along dimensions that I, a non-pecialist, couldn’t see reason behind. The point of this is not that there are no real divisions there, or that the model was in any sense wrong. It’s rather that I want the model to be useful to people across philosophy, and if nonexperts can’t see what the difference is between two topics just by looking at the headline data about the topic, then it isn’t serving its function.
Still, after a lot of trial and error, it seemed like the best balance between these four criteria was hit at around sixty topics. This isn’t to say it was perfect. For one thing, even with a fixed number of topics, different model runs produce very different models, and as I’ll discuss in the next section, I have to choose between them. For another, the optimal balance between these criteria would come at different points in different fields. So perhaps at forty-eight topics a pretty good balance between these criteria within ethics (broadly construed) would be seen, but it might be double that before seeing the right balance in philosophy of mind. There are a lot of trade-offs, as might be expected given that I’m trying to detect trends in the absence of anything like clear boundary lines.
But something odd might be noticed at this stage. I said that I got the best balance at around sixty topics. Yet the model I’ve based the book on has ninety topics. How I got to that model involves yet more choices. I think each of the choices I made was defensible, but the reason this chapter is so long is that there really were quite a lot of choices, and I think it’s worthwhile to lay them all out.
One can ask the model how confused it is about the data that was used to build the model or hold back some of the data from the model-building stage and use it on the held-back data. The second probably makes more sense theoretically, but it didn’t make a huge difference here.↩︎