In the previous section I presented one way of graphing the trends in these ninety topics. In making that graph I made three major choices, and in each case there is a good argument for doing things the other way. In this section I’m going to say what these choices are, and in the subsequent sections I’ll show what the graphs look like with the other choices. This is getting particularly deep in the weeds, and one wouldn’t lose a lot by jumping ahead to the next chapter rather than going over these three questions.
First, should the Y-axis be a probability sum, or a count? That is, for each topic-year pair, do we work out the expected number of articles in that topic from that year (given the LDA-generated probability function), or do we count the number of articles from that year whose probability of being in this topic is maximal? I’ll call these options the weighted count and raw count respectively.
Second, do we present the result as a sum, or do we normalize the result by presenting it as a ratio of the total number of articles from that year? I’ll call these the sum and frequency options.
Third, do we take articles or pages to be the basic unit? In practice, taking articles as the basic unit means adding up how many articles are in a topic, while taking pages as the basic unit involves weighting each article by its page length. (And, if using frequencies, suitably increasing the denominator being used for normalization.)
These three choices totally cross-cut each other, so we get eight possible things to go on the Y-axis.
|Number||Short Description||Long Description|
|1||Weighted sum of articles||What we already saw, the expected number of articles in a topic in a year|
|2||Raw sum of articles||How many articles in each topic each year, where ‘in’ is defined as having a higher probability of being in that topic than any other|
|3||Weighted frequency of articles||The value in 1, divided by the number of articles in that year|
|4||Raw frequency of articles||The value in 2, divided by the number of articles in that year|
|5||Weighted number of pages||For each article in a year, the probability of being in that topic, times its length in pages|
|6||Raw number of pages||The sum of the pages of the articles in a topic (in the sense of 2) in a given year|
|7||Weighted frequency of pages||The value in 5, divided by the total number of pages in that year|
|8||Raw frequency of pages||The value in 6, divided by the total number of pages in that year|
So why did I go with option 1? Well, all of the other options have flaws that made me want to try something different.
The raw counts are too uneven, and require trendlines to be added to the graph. And they are fairly arbitrary. If the model says that something has probability 0.21 of being in topic X, 0.18 of being in topic Y, 0.16 of being in topic Z, and so on for several other topics, it feels like throwing away information to simply classify it in X. This is especially true if Y and Z are very similar topics, and the model could easily have merged them, while X is fairly different.
Given the variation in the number of articles and pages being published each year, it makes sense to express trends as a proportion of the annual whole, rather than as a count. The problem is that there is too much of a monoculture (or perhaps biculture) in the early years of the data. Idealism and psychology routinely account for more than 30% of the articles (or pages) in a year. So any graph we have would have to have a Y-axis that stretches that high. But post-1970, the difference between the prominent and less prominent topics is the difference between being 1 percent and 3 percent of the total. There is no natural way to represent those things on a single graph.
There are a few ways around this, and the proportion-based measures make so much sense that I’ll use most of the following tricks somewhere in the book.
- We can present the different topics on different charts and hope that a combination of labelling the axes carefully and putting reminders about the axis labels in the text will make the differences apparent. I did that in the previous chapter.
- We can separate out the journals so prominent modern topics are a reasonable percentage of a particular journal. I did that in the previous chapter too.
- We can leave off the early years of the data set, so the outliers go away. That works, but it literally involves removing data, and by the time we’re done, the quantity of articles and pages has stabilized enough that it’s less important to normalize.
- We can literally leave off some data points, setting the limit to the Y-axis below where outlier points are, and just list the outlier points in the text. For the purposes of these eight graphs there are too many points for that to be feasible, but I will use this trick from time to time later in the book.
While all of these work, none of them seem perfect, especially if we want a single graph to show everything. So I think it’s best, at least for this chapter, to use sums not frequencies.
Third, it is much simpler to use article counts, but there is some value to using pages as well. If one topic is frequently the subject of very short articles, especially in Analysis, and another topic is never discussed in short articles, then the difference between the article count and the page count will be substantial. And there is a sense in which the page count more reflects what is going on in the journal as a whole.
To give you a sense of how this can happen, here are the weighted article counts and page counts for two topics just restricting attention to the 2000s (i.e., 2000–2009).
The articles on truth were largely (though not exclusively) in Analysis. The articles on liberal democracy were largely (though again not exclusively) in Philosophy and Public Affairs. Which of these is said to have a bigger presence in the journals in the 2000s is very much a function of whether the measure used is based on articles or pages. If one has a strong view about the journals in the 2000s, one might be able to use this to test which of the two is a more accurate measure.
For my money, both seem like reasonably interesting facts about the journals. There were more articles about truth, and more pages about liberal democracy, and that’s all there is to say. A philosopher working on one of these topics but not the other would probably feel their home topic was the bigger deal. But it seems to me like both measures are useful.
I’ve usually used articles not pages though for a few reasons.
One is that the LDA assigns probabilities to articles, not individual pages, and it seems potentially misleading to use these article-based measures to implicitly say that there were, say, 1,369 journal pages on liberal democracy in the 2000s. If my data included not just words in articles, but words on pages, I could have set the LDA up to assign topics to each page, and then it would have made more sense to count the pages. But that’s not the data available.
Another is that the number of journal pages has been growing more rapidly than the number of journal articles. So if we measure by pages, the case for normalizing (i.e., showing frequencies not counts) is even stronger. But the problems with normalizing remain.
These aren’t super conclusive reasons, but they were enough that I’m mostly using article-based, rather than page-based, measures in this book.
But that said, I think all eight graphs are interesting, and in the rest of this chapter I’ll show what they look like.