3.5 Raw Frequency of Articles

The problems of the previous section are just exacerbated if we go to raw frequencies.

All 90 topics - Raw Frequency of Articles

Figure 3.8: All 90 topics - Raw Frequency of Articles

The 90 topics - Raw Frequency of Articles (Faceted)

Figure 3.9: The 90 topics - Raw Frequency of Articles (Faceted)

All this tells us is that there is a lot more diversity, and a lot more specialisation, in journals in the last 30 years than there was 120 years ago. Everything else gets lost in the noise.

It’s only a little clearer if you filter down even to the last 75 years.

All 90 topics - Raw Frequency of Articles

Figure 3.10: All 90 topics - Raw Frequency of Articles

The animation is a bit more revealing.

But what it reveals is primarily that these raw counts are very unstable. That’s because the measure they are built on is subject to severe tipping point effects. Whether an article gets probability 0.26 for being in one category and probability 0.25 for being in another, or the other way around, really just depends on where we stopped the algorithm. (It’s bob of the head stuff in horse racing terms.) But it makes all the difference to these raw counts. This is why I’ve tried, contrary to most work that I’ve seen that uses topic modeling, to de-emphasise these raw counts in favor of the weighted counts.

The rest of the graphs look at what happens when we focus on pages rather than articles. I’m using articles as the basic unit of measure for everything else in this book, but it’s worth spending a little time seeing how things look if you focus on pages instead.