Home / Open source / Terimber 2.0
tokenizer Class Referenceclass tokenize input string finding atomic tokens
More...
#include <tokenizer.h>
List of all members.
|
Public Member Functions |
| tokenizer () |
| constructor
|
| ~tokenizer () |
| destructor
|
bool | add_regex (const char *regex, size_t key, size_t min, size_t max) |
| adds regular expressions to the internal map tokenizer will recognize the longest possible regular expression
|
bool | load_regex (const char *file_name) |
| loads regular expressions from file file has to have the following format regex tab key tab min tab max endline
|
void | clear_regex () |
| cleans up regular expressions
|
bool | add_abbreviation (const char *abbr) |
| adds abbreviation to the internal map
|
bool | load_abbr (const char *file_name) |
| loads abbreviations from file file has to have the following format abbr
|
void | clear_abbr () |
| cleans up abbreviations
|
bool | tokenize (const char *str, tokenizer_output_sequence_t &out, byte_allocator &all, size_t flags=T_ALL) const |
| tokenizes string into array of tokens,
|
const string_t & | get_last_error () const |
| gets the last error
|
Private Member Functions |
void | clear () |
| clears all internal resouces
|
void | do_regex (const char *phrase, tokenizer_output_sequence_t &tokens) const |
| processes regular expression, if any
|
void | do_abbr (const char *phrase, tokenizer_output_sequence_t &tokens) const |
| processes abbreviations
|
void | do_hyphen (const char *phrase, tokenizer_output_sequence_t &tokens) const |
| does hyphens
|
size_t | match (const char *x, size_t len, size_t tokens, size_t &key) const |
| matches the input string and the regular expression
|
Private Attributes |
string_t | _error |
| error
|
abbreviation_map_t | _abbr_map |
| map of hashed abbreviations
|
regex_map_t | _regex_map |
| map of regular expression
|
Detailed Description
class tokenize input string finding atomic tokens
Definition at line 136 of file tokenizer.h.
Constructor & Destructor Documentation
tokenizer::~tokenizer |
( |
|
) |
|
Member Function Documentation
bool tokenizer::add_regex |
( |
const char * |
regex, |
|
|
size_t |
key, |
|
|
size_t |
min, |
|
|
size_t |
max | |
|
) |
| | |
adds regular expressions to the internal map tokenizer will recognize the longest possible regular expression
- Parameters:
-
regex |
regular expression |
key |
a parameter, which will be assigned to the detected regular expression |
min |
minimum atomic tokens in a regular expression |
max |
maximum atomic tokens in a regular expression |
Definition at line 132 of file tokenizer.cpp.
References _error, _regex_map, base_map< K, T, Pr, M >::end(), and map< K, T, Pr, M >::insert().
Referenced by load_regex().
bool tokenizer::load_regex |
( |
const char * |
file_name |
) |
|
void tokenizer::clear_regex |
( |
|
) |
|
cleans up regular expressions
Definition at line 247 of file tokenizer.cpp.
References _regex_map, base_map< K, T, Pr, M >::begin(), map< K, T, Pr, M >::clear(), base_map< K, T, Pr, M >::end(), and base_map< pcre_key, pcre_entry, less< pcre_key >, M >::iterator.
Referenced by clear(), and fuzzy_matcher_impl::reset().
bool tokenizer::add_abbreviation |
( |
const char * |
abbr |
) |
|
adds abbreviation to the internal map
- Parameters:
-
Definition at line 260 of file tokenizer.cpp.
References _abbr_map, do_hash_lowercase(), base_map< K, T, Pr, M >::end(), map< K, T, Pr, M >::insert(), and os_minus_one.
Referenced by load_abbr().
bool tokenizer::load_abbr |
( |
const char * |
file_name |
) |
|
void tokenizer::clear_abbr |
( |
|
) |
|
tokenizes string into array of tokens,
- Parameters:
-
str |
input string |
out |
[out] tokenized atomic tokens |
all |
allocator |
flags |
tokenization flags |
Definition at line 321 of file tokenizer.cpp.
References base_list< T >::begin(), base_list< T >::clear(), detect_token_type(), do_abbr(), do_hyphen(), do_regex(), base_list< T >::end(), _list< T, A >::push_back(), T_ABBR, T_HYPHEN, T_REGEX, TT_ALPHABETIC, TT_DIGIT, TT_UNKNOWN, and TT_WHITESPACE.
Referenced by fuzzy_matcher_impl::_match(), fuzzy_matcher_impl::add(), and fuzzy_matcher_impl::remove().
const string_t& tokenizer::get_last_error |
( |
|
) |
const [inline] |
void tokenizer::clear |
( |
|
) |
[private] |
processes regular expression, if any
- Parameters:
-
phrase |
input phrase |
tokens |
[in,out] atomic tokens |
Definition at line 420 of file tokenizer.cpp.
References _regex_map, base_list< T >::begin(), base_map< pcre_key, pcre_entry, less< pcre_key >, M >::const_iterator, base_map< K, T, Pr, M >::empty(), base_list< T >::end(), base_map< K, T, Pr, M >::end(), _list< T, A >::erase(), match(), TT_REGEX, and base_map< K, T, Pr, M >::upper_bound().
Referenced by tokenize().
processes abbreviations
- Parameters:
-
phrase |
input phrase |
tokens |
[in,out] atomic tokens |
Definition at line 484 of file tokenizer.cpp.
References _abbr_map, base_list< T >::begin(), do_hash_lowercase(), base_map< K, T, Pr, M >::end(), base_list< T >::end(), _list< T, A >::erase(), base_map< K, T, Pr, M >::const_iterator::key(), base_map< K, T, Pr, M >::lower_bound(), str_template::strnocasecmp(), TT_ABBR, TT_ALPHABETIC, and TT_DOT.
Referenced by tokenize().
size_t tokenizer::match |
( |
const char * |
x, |
|
|
size_t |
len, |
|
|
size_t |
tokens, |
|
|
size_t & |
key | |
|
) |
| | const [private] |
matches the input string and the regular expression
- Parameters:
-
x |
input string |
len |
length |
tokens |
tokens to match |
key |
[out] regex key |
Definition at line 390 of file tokenizer.cpp.
References _regex_map, base_map< pcre_key, pcre_entry, less< pcre_key >, M >::const_iterator, base_map< K, T, Pr, M >::end(), and base_map< K, T, Pr, M >::lower_bound().
Referenced by do_regex().
Member Data Documentation
The documentation for this class was generated from the following files:
|
|