5.1 Two Failed Attempts
The first thing I tried was to look at the correlations between topics. That is, for any two topics, measure the correlation between the probability the model assigns to each article being in the first topic, and the probability it assigns to each article being in the second topic. If the topics are part of a common category, this should be reasonably high.
There are some interesting results from looking at the data this way, and later I’ll talk about them more. But it doesn’t work well as a way of generating categories. For one thing, there are too many false positives. For another, the approach fails just when you need it the most - at telling apart topics that are intuitively on the border between two categories. I did rely on correlations at one point below, but mostly it was a bad idea.
So then I tried simply categorising the topics by hand. And this got a lot of the way, but there were just too many hard cases for it to be reliable. That said, thinking about how to categorise the topics by hand led to two crucial realisations.
First, some of the topics seem so disjunctive that they don’t fit naturally into any topic, but they do seem like they should be divisible in a way that makes them more easy to classify.
Second, it’s important to not get too ‘realist’ about what we’re trying to do here. If you sit down and ask yourself, “Is this topic really in category X or category Y”, you can end up making the following bad mistake. It can end up that you decide over and over again that while it’s a close call, it’s really category X. And even if every one of those answers is defensible, the conjunction of them is not.
The aim here is not to match some Platonic ideal of correct classification. The aim is to tell a story about what happened to philosophy over time. And if every close call gets decided the same way, that story won’t be any good.
In sporting terms, this is a case where you really need to have ‘make-up calls’. If we want to track how well Philosophy of Science, say, was represented in these journals, then about 1/2 the close calls involving whether to put a topic in the Philosophy of Science category should be resolved in favour of saying it is in that category. That principle is something I’ll come back to a few times in what follows.