1.7 Two Refinements

So now I had a model, with 60 topics, that looked good but not quite right. And, by design, there was a natural way to fix the problems; just add topics. It turns out that if you keep the seed number the same, and just give the model more topics to play with, it makes very few changes. Or, to be a bit more precise, it makes very few changes apart from permuting the numbers. So if you build two models with the same seed, and the second has one more topic than the first, for the vast majority of topics in the first model, there will typically be a ‘matching’ topic in the second model. And by ‘matching’ topic here I mean that the correlation between the probabilities the models give to articles being in those topics is very very high, above 0.99 or so. The matching models won’t always have the same number, so it isn’t always easy to find them. But by simply looking at the correlations between any pairs of topics (one from each model) they usually jumped out.

That meant it was possible every time a few topics were added to simply look at the new topics, and ask if they were improvements or not. In an earlier attempt at this project, one that was fatally undermined by not filtering out enough latex and bibliographic words, this had led to a clear optimum arising around 70 topics. And that’s what I expected this time. But it didn’t happen.

Instead what happened was that as I kept adding topics, it kept (a) finding relative sensible new topics to add, and (b) not splitting up the topics I really hoped it would split. This was something of a disappointment - the project would have been more manageable for me if the model had found an optimum number of topics in the low 70s or lower. But it simply didn’t; by the standards I’d set before looking at the models, they just kept getting better as the number of topics got higher.

Eventually I settled on 90 topics. There was a bit more than I wanted, and I could have gone even higher. But it was starting to get a little more fine-grained than I wanted - we already have three distinct topics in philosophy of biology, for example. Still, the model runs where I asked for 96 topics and then for 100 topics weren’t clearly worse than the one with 90 (by the standards I’d set myself). So stopping here was somewhat arbitrary.

Once I had the 90 topic model, it still wasn’t perfect. There were a few places where it looked like the model had put some things in very odd spots. Some of this remains in the finished product - the model bundles together some work on probability and coherence with historical work on Hume, and puts one half of the Freud papers with Medical Ethics and the other half of them with Intention. But at this stage there were more of these overlaps than I liked.

So I relied on one last feature of the topicmodels package. The algorithm doesn’t stop when it reaches an equilibrium; it stops when it sees insufficient progress towards equilibrium. One thing you can do is refine what counts as ‘insufficient’, but I found this hard to control. A similar approach is to start not with a random distribution, but with a finished model, and then ask it to approach equilibrim from that starting point. It won’t go very far; the model was finished to start with. But it will end up with a model that it likes slightly better. (It will, for example, have a lower perplexity score.) I’ll call the resulting model a refinement.

The refinement process takes a model as input and returns a model as output, so it can be iterated.7 And at this stage I had a clever thought. Since the refinement process improves the model, and it can be iterated, I should just iterate it as often as I can to get a better and better model. At the back of my mind I had two worries at this point. One was that this was a bit like tightening a string, and if you do it too much it will just snap. The other was that I had lost my mind, and was fretting about mathematical models of large text libraries using half-baked metaphors to the physics of everyday objects.

Reader, it snapped.

After 100 iterations, the model ended up making an interesting, and amusing, mistake.

One signature problem with the kind of text mining I’m doing is that it can’t tell the difference between a change of vocabulary that is the result of a change in subject matter, and a change of vocabulary that is the result of a change in verbal fashions. If you build these kind of models with almost any parameter settings, you’ll get a distinctive topic (or two) for ordinary language philosophy. Why? Because the language of the ordinary language philosophers was so distinctive. That’s not great, but it’s unavoidable. Ideally, that would be the only such topic. And one of the reasons I filtered out so many words was to avoid having more such topics.

But it turns out that there is another period with a somewhat distincive vocabulary: the twenty-first century. It’s not as distinctive as mid-century British philosophy. And usually it isn’t distinctive enough to really confuse most of these models. But it is just distinctive enough that if you run refinements iteratively for, let’s say, four days while you’re away at a conference, the model will find this distinctive language. So after 100 iterations, we ended up with a model that wasn’t a philosophical topic at all, but was characterised by the buzzwords of recent philosophy.

Still, it turns out the refinements weren’t all a bad idea. After 15 refinements, the model had separated out some of the disjunctive categories I’d hoped it would, and was only starting to gt thrown by the weird language of very recent philosophy. So that’s the model I ended up using - the one with seed 22031848, 90 topics, and 15 iterations of the refinement process.


  1. If you’re interested in doing this yourself, the magic code looks like refinedlda <- LDA(all_dtm, k = 90, model = refinedlda, control = list(seed = 22031848, verbose = 1, initialize = "model")). That is refinedlda is an LDA that takes the DTM we started with, and has 90 topics, and is based on a model, where that model is refinedlda itself. If loops don’t scare you, you can simply loop this process to get as many iterations of refinement as you like. They took about 45 minutes each to run when I did them.↩︎