3.2 Graph Choices
In the previous section I presented one way of graphing the trends in these 90 topics. In making that graph I made three major choices, each of which I can see good arguments either way. In this section I’m going to say what these choices are, and in the subsequent sections I’ll show you what the graphs look like with the other choices. This is getting particularly deep in the weeds, and you wouldn’t lose a lot by jumping ahead to the next chapter rather than going over these three questions.
First, should the y-axis be a probability sum, or a count? That is, for each topic-year pair, do work out the expected number of articles in that topic from that year (given the LDA-generated probability function), or do we count the number of articles from that year whose probability of being in this topic is maximal? I’ll call these options the weighted count and raw count respectively.
Second, do we present the result as a sum, or do we normalise the result by presenting it as a ratio of the total number of articles from that year? I’ll call these the sum and frequency options.
Third, do we take articles or pages to be the basic unit? In practice, that means do we simply add up how many articles are in a topic, or do we weight the articles by their page length. (And, if we’re using frequencies, suitably increase the denominator we’re using for normalisation.)
These three choices totally cross-cut each other, so we get eight possible things to go on the y-axis.
|Number||Short Description||Long Description|
|1||Weighted sum of articles||What we already saw, the expected number of articles in a topic in a year|
|2||Raw sum of articles||How many articles in each topic each year, where ‘in’ is defined as having a higher probability of being in that topic than any other|
|3||Weighted frequency of articles||The value in 1, divided by the number of articles in that year|
|4||Raw frequency of articles||The value in 2, divided by the number of articles in that year|
|5||Weighted number of pages||For each article in a year, the probability of being in that topic, times its length in pages|
|6||Raw number of pages||The sum of the pages of the articles in a topic (in the sense of 2) in a given year|
|7||Weighted frequency of pages||The value in 5, divided by the total number of pages in that year|
|8||Raw frequency of pages||The value in 6, divided by the total number of pages in that year|
So why did I go with option 1? Well, all of the other options have flaws that made me want to try something different.
The raw counts are too uneven, and require trendlines to be added to the graph. And they are fairly arbitrary. If the model says that something has probability 0.21 of being in topic X, and 0.18 of being in topic Y, and 0.16 of being in topic Z, and so on for several other topics, it feels like throwing away information to simply classify it in X. This is especially true if Y and Z are very similar topics, and the model could easily have merged them, while X is fairly different.
Given the variation in the number of articles and pages being published each year, it makes sense to express trends as a proportion of the annual whole, rather than as a count. The problem is that there is too much of a mono-culture (or perhaps bi-culture) in the early years of the data. Idealism and Psychology routinely account for more than 30% of the articles (or pages) in a year. So any graph we have would have to have a y-axis that stretches that high. But post-1970, the difference between the prominent and less prominent topics is the difference between being 1% and 3% of the total. There is no natural way to represent those things on a single graph.
There are a few ways around this, and the proportion based measures make so much sense that I’ll use most of the following tricks somewhere in the book.
- We can present the different topics on different charts, and hope that a combination of labelling the axes carefully and putting reminders about the axis labels in the text will make the differences apparent. I did that in the previous chapter.
- We can separate out the journals, so prominent modern topics are a reasonable percentage of a particular journal. I did that in the previous chapter too.
- We can leave off the early years of the data set, so the outliers go away. That works, but it literally involves removing data, and by the time we’re done, the quantity of articles and pages has stabilised enough that it’s less important to normalise.
- We can literally leave off some data points, setting the limit to the y-axis below where outlier points are, and just list the outlier points in the text. For the purposes of these eight graphs there are too many points for that to be feasible, but I will use this trick from time to time later in the book.
While all of these work, none of them seem perfect, especially if we want a single graph to show everything. So I think it’s best, at least for this chapter, to use sums not frequencies.
Third, it is much simpler to use article counts, but there is some value to using pages as well. If one topic is frequently the subject of very short articles, especially in Analysis, and another topic is never discussed in short articles, then the difference between the article count and the page count will be substantial. And there is a sense in which the page count more reflects what is going on in the journal as a whole.
To give you a sense of how this can happen, here are the weighted article counts and page counts for two topics just restricting attention to the 2000s (i.e., 2000-2009).
The articles on Truth were largely (though not exclusively) in Analysis. The articles on Liberal Democracy were largely (though again not exclusively) in Philosophy and Public Affairs. And as you can see, which of these two has a bigger presence in the journals in the 2000s is very much a function of whether you count by articles or pages. If you have a strong view about the journals in the 2000s, you might be able to use this to test which of the two is a more accurate measure.
For my money, both seem like reasonably interesting facts about the journals. There were more articles about Truth, and more pages about Liberal Democracy, and that’s all there is to say. If you worked on one but not the other, you would probably feel like it was a bigger deal, because both of them were very important in the 2000s. But otherwise it seems to me like both measures are useful.
I’ve usually used articles not pages though for a few reasons.
One is that the LDA assigns probabilities to articles, not individual pages, and it seems potentially misleading to use these article-based measures to implicitly say that there were, say, 1369 journal pages on Liberal Democracy in the 2000s. If my data included not just words in articles, but words on pages, I could have set the LDA up to assign topics to each page, and then it would have made more sense to count the pages. But that’s not the data available.
Another is that the number of journal pages has been growing more rapidly than the number of journal articles. So if you measure by pages, the case for normalising, i.e., showing frequencies not counts, is even stronger. But the problems with normalising remain.
These aren’t super conclusive reasons, but they were enough that I’m mostly using article-based, rather than page-based, measures in this book.
But that said, I think all eight graphs are interesting, and in the rest of this chapter I’ll show what they look like.