next up previous
Next: Datasets Up: Document Recovery from Bag-of-Word Previous: Recovery from bigram count


Table: Statistics and example sentences from each domain. s/d = sentences per document; w/d = word tokens per document. The fourth column lists (1) the original document, (2) the document recovered from a count BOW, (3) the document recovered from an indicator BOW, and (4) the document recovered from a stopwords-removed count BOW.
Domain s/d w/d Example
Medical 8 153 She also had some breathlessness .
    She also had some breathlessness .
    She also had some breathlessness .
    We have had some breathlessness .

    He was not wearing a helmet and was seen unconscious when paramedics arrived .
    He was unconscious when paramedics arrived and was seen not wearing a helmet .
    He was not wearing a helmet and was unconscious when paramedics arrived . seen .
    He was seen wearing a helmet was not unconscious when paramedics arrived .


22 487 These regiments are under a divisional headquarters called the 324 B Division .
    These are called the B 324 Division regiments under a divisional headquarters .
    B Division 324 These are called the regiments under a divisional headquarters .
    324 regiments under Division B of A are called the divisional headquarters .

    The President had his breakfast during the meeting is [sic] the Situation
      Room Conference Room .
    The Situation Room is the President had his breakfast meeting during the
      Conference Room .
    The President had his breakfast . The Situation Room is the meeting
      during the Conference .
    It is during this meeting that the President had breakfast Situation Room
      Conference Room .


11 228 I sincerely wish all of you the best in your future endeavors .
    I sincerely wish all of you the best in your future endeavors .
    I sincerely wish all of you the best of the best in all your future endeavors .
    sincerely wish all the best in future endeavors .

    PIRA is coming in May to do their semi - annual energy outlook .
    May PIRA is coming in to do their semi - annual energy outlook .
    May . PIRA is coming in to do their semi - annual energy outlook .
    PIRA is a semi - annual energy outlook for this coming May .


13 400 Our quality systems are ISO/TS16949 ( 2002 version ) certified .
    Our quality systems are certified ISO/TS16949 ( 2002 version ) .
    ISO/TS16949 Our quality systems are certified version ) ( 2002 ) .
    ISO/TS16949 quality systems are certified to the 2002 and version ( ) .

    This department operates under the name of Stock Yards Trust Company .
      This department operates under the name of Trust Stock Yards Company .
    This department operates under the name of the Trust . Stock Yards Company .
    Trust in the Stock Yards Company operates under the name department .


111 1771 B : uh - huh um - hum
    B : huh uh - um - hum .
    B : uh huh um - hum .
    Go to : A - B - um uh huh hum .
    A : halfway there so that 's good .
      A halfway there : so that 's good .
      A : halfway there so that 's good .
      so good : it 's halfway there .

next up previous
Next: Datasets Up: Document Recovery from Bag-of-Word Previous: Recovery from bigram count
Nathanael Fillmore 2008-07-18