10 Pubmed Data set used in Exemplar-based Visualization (EV) Software

  • Summary


  • 10Pubmed data set is a collection of approximately 15,500 medical documents, partitioned across 10 different diseases. It consists of published abstracts in the MEDLINE database from 2000 to 2008, relating to 10 different diseases. Use ``MajorTopic'' tag along with the disease-related MeSH terms as queries to MEDLINE. From all the retrieved abstracts, the common and stop words are removed, and the words are stemmed using Porter's suffix-stripping algorithm. Finally, a document-word matrix of the size 15565 x 22437 and the corresponding 22437 word lists are built.

    Top


  • Organization


  • The data is organized into 10 different files, each corresponding to a different disease. Here is a list of the 10Pubmed, partitioned according to subject matter:
    Gout,
    Chickenpox,
    Raynaud Disease,
    Jaundice,
    Hepatitis A,
    Hay Fever,
    Kidney Calculi,
    Age-related Macular Degeneration,
    Migraine,
    Otitis.

    Top


  • Data


  • The orignial data download from MEDLINE available here are in 10Pubmed.zip bundles.
    You will need unzip to open them. Each subdirectory in the bundle represents a kind of disease documents, each document of a kind of disease is indexed by number. The total number of documents is 15569. After pre-processing, the final total number of documents is 15565, of which Porter algorithm skips 4. So the matlab version (below) represents 15565 documents. The details of each kind of disease documents are listed in the following table.

    Diseases  Number of Documents
    Gout  543
    Chickenpox  732
    Raynaud Disease  343
    Jaundice  503
    Hepatitis A  796
    Hay Fever  1517
    Kidney Calculi  1549
    Age-related Macular Degeneration  3283
    Migraine  3703
    Otitis  2596
    Top


  • Matlab Download


  • Below is a processed version of the 10Pubmed data set which is easy to read into Matlab, icluding:

    docWordMat.mat
    label.mat
    wordList.mat
    map.txt

    • docWordMat.mat is formatted as document-word matrix.
    • label.mat file is simply a list of label id's (i.e, 1-10).
    • wordList.mat file contains the vocabulary for the indexed data. The line number corresponds to the index number of the word, that is, word on the first line is word #1, word on the second line is word #2, etc.
    • map.txt file maps from label id's to label names.
    Top