So that’s the methodology I used. Now that I’ve written the whole thing up, there are a few things I wish I’d done differently. I don’t so strongly wish this that I decided to scrap the whole project and start again.8 But I hope that others will learn from what I’ve done, and to that end I want to be upfront about my mistakes.
First, I should have filtered even more words. There are three kinds of words I wish I’d been more aggressive about filtering out.
- There are some systematic OCR errors in the early articles. I caught ‘anid’, which appears over 3000 times. (It’s almost always meant to be ‘and’, I think.) But I missed ‘aind’, which appears about 1500 times. And there are other less common words that are also OCR errors and should be filtered out.
- I caught a lot of latex words, but somehow missed ‘rightarrow’, as well as a few much rarer words.
- And I caught a lot of words that almost always appear in bibliographies, headers or footers, but missed ‘basil’ (which turns up on a table later) and ‘noûs’ (though I caught ‘nous’).
In general I could have been way more aggressive filtering words like these out.
But second, I think it was a mistake to filter out words that appear 1-3 times in articles. This actually makes perfect sense for long articles, and for some long articles you could get rid of words that appear 4 or 5 times as well. But it’s too aggressive for short articles. I needed some kind of rule like filtering out words that appear less than 1 time in 2000 in the article. It is important, I think, to filter out the words that appear just once, or else you have to be perfect in catching OCR errors and weird latex code. But after that you need some kind of sliding scale.
The next three things are much more systematic, though also less clearly errors.
The third problem was that my model selection was too stepwise, and not holistic enough. I found the best 60 topic model I could find. Then I increased the topics on it (eventually to 90) until the topics looked as good as they could get holding fixed the seed number from the search through 60 topics. Then I ran refinements on it until the refinements looked like they were damaging the model. Then I split some of the topics up for categorisation. What I didn’t do at any step was look back and ask, for example, how would the other 60 topic models look if I applied these adjustments to them?
Now there was a reason for that. Each of those adjustments cost quite a lot of my time, and even more computer time. Doing the best you can at each step and then locking in the result makes the process at least a bit manageable. But I should (a) have been a bit more willing to revisit earlier decisions, and (b) more forward looking when making each of those intermediate decisions. I was a bit forward looking at one point; one of my criteria for choosing between 60 topic models was a preference for unwanted conflations over unwanted splits. And that was because I knew I could fix conflations various ways. But I should have done more of this. And maybe I could have stuck much closer to 60 topics if I had.
The fourth problem was that I didn’t realise how bad a topic Arguments would turn out to be. For the purposes of the kind of study I’m doing, it’s really important that the topics really be topics in the ordinary sense, and not tools or methods. Now this is hard in philosophy, because philosophy is so methodologically self-conscious that there are articles that really are about all the tools and methods you might care about. But I wish I’d avoided making one of them a topic. (I’ll come back in section 8.10 to a formal method one can use for detecting these kinds of topics early in the process.)
The fifth problem, if it is a problem, is that I wasn’t more aggressive about expanding the list of stop words. This model as a topic on Ordinary Language Philosophy. Actually, all the models I built had a topic like this (at least once they had at least 15 or so topics.) But the keywords characteristic of this topic are words that really could have been included on a stop words list. They are words like ‘ask’ and ‘try’. And one side-effect of this is that the model keeps thinking a huge proportion of the articles in the data set are maybe kind of Ordinary Language Phiosophy articles.
Another way to put this is that the boundary between a stop word and a contentful word (in this context) is pretty vague. And given that Ordinary Language Philosophy was a thing that happened, and that affected how everyone (at least in the UK) was writing for a while, there is a good case for taking a very expansive understanding of what the stop words were.
The choice I made was to not lean on the scales at all, and just use the most common off-the-shelf list of stop words. And there was a good reason for that; I wanted the model to not simply replicate my prejudices. But I half-think I made the wrong call here, and that the model would be more useful if I had filtered out more ‘ordinary language’.
I went through several cycles of building a model, writing it up, seeing mistakes that way, and restarting the process. This is the model that survived.↩︎