Skip to content

Commit 0a1ec40

Browse files
committed
* Add draft work on features
1 parent 7d432b7 commit 0a1ec40

1 file changed

Lines changed: 78 additions & 0 deletions

File tree

docs/source/features.rst

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
Lexeme Features
2+
===============
3+
4+
A lexeme is an entry in the lexicon --- the vocabulary --- for a word, punctuation
5+
symbol, whitespace unit, etc. Lexemes come with lots of pre-computed information,
6+
that help you write good feature functions. Features are integer-valued where
7+
possible --- instead of strings, spaCy refers to strings by consecutive ID numbers,
8+
which you can use to look up the string values if necessary.
9+
10+
String features
11+
---------------
12+
13+
+---------+-------------------------------------------------------------------+
14+
| SIC | The word as it appeared in the sentence, unaltered. |
15+
+---------+-------------------------------------------------------------------+
16+
| NORM | For frequent words, case normalization is applied. |
17+
| | Otherwise, back-off to SHAPE. |
18+
+---------+-------------------------------------------------------------------+
19+
| SHAPE | Remap the characters of the word as follows: |
20+
| | |
21+
| | a-z --> x, A-Z --> X, 0-9 --> d, ,.;:"'?!$- --> self, other --> \*|
22+
| | |
23+
| | Trim sequences of length 3+ to 3, e.g |
24+
| | |
25+
| | apples --> xxx, Apples --> Xxxx, app9LES@ --> xxx9XXX* |
26+
+---------+-------------------------------------------------------------------+
27+
| ASCIIED | Use unidecode.unidecode(sic) to approximate the word using the |
28+
| | ascii characters. |
29+
+---------+-------------------------------------------------------------------+
30+
| PREFIX | sic_unicode_string[:1] |
31+
+---------+-------------------------------------------------------------------+
32+
| SUFFIX | sic_unicode_string[-3:] |
33+
+---------+-------------------------------------------------------------------+
34+
35+
36+
Integer features
37+
----------------
38+
39+
+--------------+--------------------------------------------------------------+
40+
| LENGTH | Length of the string, in unicode |
41+
+--------------+--------------------------------------------------------------+
42+
| CLUSTER | Brown cluster |
43+
+--------------+--------------------------------------------------------------+
44+
| POS_TYPE | K-means cluster of word's tag affinities |
45+
+--------------+--------------------------------------------------------------+
46+
| SENSE_TYPE | K-means cluster of word's sense affinities |
47+
+--------------+--------------------------------------------------------------+
48+
49+
Boolean features
50+
----------------
51+
52+
+-------------+--------------------------------------------------------------+
53+
| IS_ALPHA | The result of sic.isalpha() |
54+
+-------------+--------------------------------------------------------------+
55+
| IS_ASCII | Check whether all the word's characters are ascii characters |
56+
+-------------+--------------------------------------------------------------+
57+
| IS_DIGIT | The result of sic.isdigit() |
58+
+-------------+--------------------------------------------------------------+
59+
| IS_LOWER | The result of sic.islower() |
60+
+-------------+--------------------------------------------------------------+
61+
| IS_PUNCT | Check whether all characters are in the class TODO |
62+
+-------------+--------------------------------------------------------------+
63+
| IS_SPACE | The result of sic.isspace() |
64+
+-------------+--------------------------------------------------------------+
65+
| IS_TITLE | The result of sic.istitle() |
66+
+-------------+--------------------------------------------------------------+
67+
| IS_UPPER | The result of sic.isupper() |
68+
+-------------+--------------------------------------------------------------+
69+
| LIKE_URL | Check whether the string looks like it could be a URL. Aims |
70+
| | for low false negative rate. |
71+
+-------------+--------------------------------------------------------------+
72+
| LIKE_NUMBER | Check whether the string looks like it could be a numeric |
73+
| | entity, e.g. 10,000 10th .10 . Skews for low false negative |
74+
| | rate. |
75+
+-------------+--------------------------------------------------------------+
76+
| IN_LIST | Facility for loading arbitrary run-time word lists? |
77+
+-------------+--------------------------------------------------------------+
78+

0 commit comments

Comments
 (0)