5.2 Splitting Up Topics

Some of the topics just look disjunctive, no matter how hard I tried to get rid of disjunctiveness. And this affects the categorisation.

Consider, for instance the Sets and Grue topic. This just looks like it is made up of two parts - discussion of set theory, and discussion of the grue paradox. And while both of these are connected to Nelson Goodman, and more generally involve technical challenges facing a certain kind of mid-century empiricist, they aren’t really connected to each other. The set theory discussion looks like it should go in either Metaphysics or Logic; the grue discussion looks like it should go in Epistemology or Philosophy of Science. But putting the whole topic in any one of these four seemed mistaken.

Fortunately, there is a nice technique to resolve this problem. And it involves yet more applications of the LDA model. Take the articles that are in this topic (i.e., have a higher probability of being in this topic than in any other), and use the LDA technique to sort them into a two-topic model. I’ll call this a binary sort in what follows. So instead of taking all 32261 articles and sorting them into 90 (or more) topics, just take the 555 articles and sort them into 2 topics. If we’re lucky, one side of the sort will be the set theory articles, and the other side will be the grue articles.15

And it turns out we are more or less lucky in just that way. Here are the keywords and paradigm articles for topic 1 in this binary sort.16

First Subtopic

grue, green, emeralds, examined, projectible, verisimilitude, entrenchment, projectibility, emerald, entrenched

Characteristic Articles

Second Subtopic

frege, membership, pure, null, abstraction, plural, boolos, mathematics, ordinal, russell

Characteristic Articles

The first subtopic isn’t exclusively articles about grue, but the keywords suggest that they are going to be a big chunk of what’s there. And the second subtopic looks like set theory. The binary sort worked.

So now instead of asking how the articles in this topic should be classified, we can ask the two questions of how the articles in these subtopics should be classified.17 And while neither question is trivial, they seem at least a bit more tractable. Ultimately, I ended up putting set theory in Logic and Mathematics, and grue in Philosophy of Science. Just why I made those choices is for later sections, but for now I just wanted to show how the subtopics were generated.

There is one more thing we can note about this binary sort - the model is very confident in its answers. In the original 90 topic model, there is precisely 1 article that the model gives a probability greater than 0.99 to being in a particular topic.18 In the binary sort I just described, 57% of the articles are such that the model gives them a probability at least 0.99 of being in one particular topic. Now obviously it’s easier to be more confident in a two-way sort than a ninety-way sort. But this gives us a check of how disjunctive the model itself things the topic is. And I’ll use that to check whether it really makes sense to split a topic up in this way.

  1. One advantage of doing things this way rather than looking for a more and more fine-grained model of the whole universe is speed. It would be somewhat interesting to see what happened if we sorted the 32261 into 120 topics. But that would take something like 12 hours on a good personal computer. The binary sort I described in the text takes well under 12 seconds.↩︎

  2. An embarrassing admission: Due to a coding error, I ended up using 1954 for the seed for these binary sorts, not 22031848 like I’ve used for everything else. I only realised this after I’d done so much work building on them that it would have been too much to go back and change it - especially since the value of a random seed shouldn’t matter too much. But it was annoying to have had this change slip in.↩︎

  3. Just to be clear, I’m using ‘topic’ for the elements of the 90-way partition that the original model generated, and ‘subtopic’ for elements of the 2-way partition that these new binary sorts generate.↩︎

  4. It’s Contextualism, Hawthorne’s Invariantism and Third-Person Cases by Anthony Brueckner.↩︎