1.3 Selecting the Words
The JSTOR data excludes a few stop words (like ‘the’ and ‘and’), and words with 1 or 2 characters. On the other hand, it takes non-letters to be word breaks. So doesn’t would be split into doesn and t, and the second rejected as too short. And hyphenated words are split as well. It turned out that this made est into a reasonably common word. But I didn’t want to include all the words for various reasons.
It seems common in text mining to exclude a more expansive list of ‘stop words’ than JSTOR leaves out. I was playing around with making my own list of stop words, but I decided it would be more objective to use the commonly used list from the tm package. They use the following list of stop words.
- i, me, my, myself, we, our, ours, ourselves, you, your, yours, yourself, yourselves, he, him, his, himself, she, her, hers, herself, it, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, would, should, could, ought, i’m, you’re, he’s, she’s, it’s, we’re, they’re, i’ve, you’ve, we’ve, they’ve, i’d, you’d, he’d, she’d, we’d, they’d, i’ll, you’ll, he’ll, she’ll, we’ll, they’ll, isn’t, aren’t, wasn’t, weren’t, hasn’t, haven’t, hadn’t, doesn’t, don’t, didn’t, won’t, wouldn’t, shan’t, shouldn’t, can’t, cannot, couldn’t, mustn’t, let’s, that’s, who’s, what’s, here’s, there’s, when’s, where’s, why’s, how’s, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very
I excluded all these words from the analysis. The intuition here is that including them would mean that the analysis is more sensitive to stylistic ticks than to content, and in practice that seemed to be right. The models did look more reflective of substance than style with the stop words excluded. In principle I’m not sure it was right to exclude all those quantifiers from the end of the list. But it doesn’t seem to have hurt the analysis. I’ll come back to this point at the end of the chapter, but it is possible I should have been more aggressive in filtering out stop words.
The stop words list from tm includes a lot of contractions. I wrote a small script to extract the parts of those contractions before the apostraphe, and excluded them too. The parts after then apostraphe were always 1 or 2 letters, so they were already excluded.
I’ve also looked through the list of the 5000 most common words in the data set to see what shouldn’t be there, and the rest of this section comes from what was cut on the basis of that.
In some cases, JSTOR’s source for the text was from the LaTeX code for the article, so there was a lot of LaTeX junk in the text file. I’m sure I didn’t clean out all of this, but to clean out a lot of it, I deleted the following words.
- aastex, amsbsy, amsfonts, amsmath, amssymb, amsxtra, begin, cal, cyr, declaremathsize, declaretextfont, document, document class, empty, encodingdefault, end, fontenc, landscape, mayhem, mathrsfs, math strut, newcommand, normalfont, pagestyle, pifont, portland, renewcommand, rmdefault, selectfont, sfdefault, stmaryrd, textcomp, textcyr, usepackage, wmcyr, wncyss, xspace, documentclass, declaretextfontcommand, wncyr, declaremathsizes, mathrm, vert, mathstrut, hat, mathbf, thinspace, ldots, neg, bbb, ebc, cdot, boldsymbol, vec, langle, rangle, leq, infty, mathsf, vdash, boldmath, boldsymbol, cwmi, forall, mathrel, mbox, prfm, neq, anid
I’m a bit worried that excluding ‘document’ meant I lost some signal about historical articles in the LaTeX noise. But this was unavoidable.
Also note that ‘anid’ is not a LaTeX term, but it was worthwhile to exclude it here. Something about how the text recognition software JSTOR uses interacted with 19th and early 20th century articles meant that several words, especially ‘and’, got coded as ‘anid’. But this was the OCR verison of a typo, and best deleted. (There were a few more of these that were not in the 5000 most common words that on reflection I wish I’d cut too. But I don’t think they make a huge difference to the analysis given how rare they are.)
Somewhat relunctantly, I deleted a bunch of spellings out of Greek letters for the same reason; they were mostly from LaTeX code. This meant deleting the following words.
- alpha, beta, gamma, delta, omega, theta, lambda, rho, psi, phi, sigma
I’m sure this lost some signal. But there was so much LaTeX noise that it was unavoidable.
Next I deleted a few honorifics, in particular:
- prof, mrs, professor
These just seemed to mark the article as being old, not anything about the content of the article. I didn’t need to exclude ‘mr.’ or ‘dr.’ since they were already excluded as too short.
Although I was trying to exclude foreign language articles, I also excluded a bunch of foreign words. One reason was that it was a check on whether I missed any foreign language articles. Another was that if I didn’t do this, then articles that had extensive quotation from foreign languages would be seen by the model as being in their own distinctive topic merely in virtue of having non-English quotations. And that seemed wrong. So to fix it, I excluded these words.
- auch, aussi, autre, cette, diese, haben, leur, soit, toute, peut, noch, habe, wenn, einem, doch, durch, kann, comme, aber, mais, nur, wird, wie, sont, ich, dieser, oder, avec, une, werden, bien, sie, auf, einer, dans, dass, esta, nicht, entre, uns, ont, que, wir, nach, einen, como, esprit, seine, elles, fait, elle, eine, lui, selbst, aus, deux, vom, pensee, schon, zum, nin, propre, les, pour, espace, las, una, amour, sind, etre, ueber, biran, das, bei, qui, temps, mich, alcan, sich, ein, zur, idee, welt, philosophique, mir, vie, homme, ces, maupertuis, leipzig, als, essai, del, sens, hier, monde, und, histoire, soi, por, des, den, bachelard, logique, sans, meyerson, filosofia, bourgeois, sein, philosophie, ist, meiner, zeit, raison, tarde, begriff, los, theorie, dem, der, pas, revue, uber, veblen, mas, weil, ser, philosophische, psychologie, milieu, geschichte, sur, dire, ses, une, les, que, est, etc
Finally, I excluded a bunch of words that seemed to turn up primarily in bibiographies, or in text citations. Including them seemed to just make the model be more sensitive to the referencing style of the journal, rather than the content. But here the deletions really did cost some content, because some of the words really were philosophically relevant. But I deleted them because they seemed to be turning up more often in bibliograhies than in text.
- doi, proceedings, review, journal, press, compilation, compilation, editors, supplementary, quarterly, aristotelian, kegan, dordrecht, minnesota, reidel, edu, stanford, oxford, cambridge, basil, blackwell, thanks, cit, mit, eds, loc, york, university, nous, chicago, clarendon, edited
The surprising one there is ‘compliation’. But it most often appears because some journals have a footer saying “Journal compilation © …”.
Then to speed up processing, I deleted any word that appeared in any article three times or fewer. This did lose some content. But it sped up the processing a lot. Some of the steps I’ll describe below took several days computing time. Without this restriction they would have taken several weeks. And words that appear 1-3 times in an article shouldn’t be that significant for determining its content.