|
| 1 | +Lexeme Features |
| 2 | +=============== |
| 3 | + |
| 4 | +A lexeme is an entry in the lexicon --- the vocabulary --- for a word, punctuation |
| 5 | +symbol, whitespace unit, etc. Lexemes come with lots of pre-computed information, |
| 6 | +that help you write good feature functions. Features are integer-valued where |
| 7 | +possible --- instead of strings, spaCy refers to strings by consecutive ID numbers, |
| 8 | +which you can use to look up the string values if necessary. |
| 9 | + |
| 10 | +String features |
| 11 | +--------------- |
| 12 | + |
| 13 | ++---------+-------------------------------------------------------------------+ |
| 14 | +| SIC | The word as it appeared in the sentence, unaltered. | |
| 15 | ++---------+-------------------------------------------------------------------+ |
| 16 | +| NORM | For frequent words, case normalization is applied. | |
| 17 | +| | Otherwise, back-off to SHAPE. | |
| 18 | ++---------+-------------------------------------------------------------------+ |
| 19 | +| SHAPE | Remap the characters of the word as follows: | |
| 20 | +| | | |
| 21 | +| | a-z --> x, A-Z --> X, 0-9 --> d, ,.;:"'?!$- --> self, other --> \*| |
| 22 | +| | | |
| 23 | +| | Trim sequences of length 3+ to 3, e.g | |
| 24 | +| | | |
| 25 | +| | apples --> xxx, Apples --> Xxxx, app9LES@ --> xxx9XXX* | |
| 26 | ++---------+-------------------------------------------------------------------+ |
| 27 | +| ASCIIED | Use unidecode.unidecode(sic) to approximate the word using the | |
| 28 | +| | ascii characters. | |
| 29 | ++---------+-------------------------------------------------------------------+ |
| 30 | +| PREFIX | sic_unicode_string[:1] | |
| 31 | ++---------+-------------------------------------------------------------------+ |
| 32 | +| SUFFIX | sic_unicode_string[-3:] | |
| 33 | ++---------+-------------------------------------------------------------------+ |
| 34 | + |
| 35 | + |
| 36 | +Integer features |
| 37 | +---------------- |
| 38 | + |
| 39 | ++--------------+--------------------------------------------------------------+ |
| 40 | +| LENGTH | Length of the string, in unicode | |
| 41 | ++--------------+--------------------------------------------------------------+ |
| 42 | +| CLUSTER | Brown cluster | |
| 43 | ++--------------+--------------------------------------------------------------+ |
| 44 | +| POS_TYPE | K-means cluster of word's tag affinities | |
| 45 | ++--------------+--------------------------------------------------------------+ |
| 46 | +| SENSE_TYPE | K-means cluster of word's sense affinities | |
| 47 | ++--------------+--------------------------------------------------------------+ |
| 48 | + |
| 49 | +Boolean features |
| 50 | +---------------- |
| 51 | + |
| 52 | ++-------------+--------------------------------------------------------------+ |
| 53 | +| IS_ALPHA | The result of sic.isalpha() | |
| 54 | ++-------------+--------------------------------------------------------------+ |
| 55 | +| IS_ASCII | Check whether all the word's characters are ascii characters | |
| 56 | ++-------------+--------------------------------------------------------------+ |
| 57 | +| IS_DIGIT | The result of sic.isdigit() | |
| 58 | ++-------------+--------------------------------------------------------------+ |
| 59 | +| IS_LOWER | The result of sic.islower() | |
| 60 | ++-------------+--------------------------------------------------------------+ |
| 61 | +| IS_PUNCT | Check whether all characters are in the class TODO | |
| 62 | ++-------------+--------------------------------------------------------------+ |
| 63 | +| IS_SPACE | The result of sic.isspace() | |
| 64 | ++-------------+--------------------------------------------------------------+ |
| 65 | +| IS_TITLE | The result of sic.istitle() | |
| 66 | ++-------------+--------------------------------------------------------------+ |
| 67 | +| IS_UPPER | The result of sic.isupper() | |
| 68 | ++-------------+--------------------------------------------------------------+ |
| 69 | +| LIKE_URL | Check whether the string looks like it could be a URL. Aims | |
| 70 | +| | for low false negative rate. | |
| 71 | ++-------------+--------------------------------------------------------------+ |
| 72 | +| LIKE_NUMBER | Check whether the string looks like it could be a numeric | |
| 73 | +| | entity, e.g. 10,000 10th .10 . Skews for low false negative | |
| 74 | +| | rate. | |
| 75 | ++-------------+--------------------------------------------------------------+ |
| 76 | +| IN_LIST | Facility for loading arbitrary run-time word lists? | |
| 77 | ++-------------+--------------------------------------------------------------+ |
| 78 | + |
0 commit comments