Thread portable class


Gate portable class


Timer class


Pool class


Terimber 2.0


About C++


Downloads Products & Services Support Clients Open Source About



Home / Open source / Terimber 2.0

tokenizer Class Reference

class tokenize input string finding atomic tokens More...

#include <tokenizer.h>

List of all members.

Public Member Functions

 tokenizer ()
 constructor
 ~tokenizer ()
 destructor
bool add_regex (const char *regex, size_t key, size_t min, size_t max)
 adds regular expressions to the internal map tokenizer will recognize the longest possible regular expression
bool load_regex (const char *file_name)
 loads regular expressions from file file has to have the following format regex tab key tab min tab max endline
void clear_regex ()
 cleans up regular expressions
bool add_abbreviation (const char *abbr)
 adds abbreviation to the internal map
bool load_abbr (const char *file_name)
 loads abbreviations from file file has to have the following format abbr

void clear_abbr ()
 cleans up abbreviations
bool tokenize (const char *str, tokenizer_output_sequence_t &out, byte_allocator &all, size_t flags=T_ALL) const
 tokenizes string into array of tokens,
const string_tget_last_error () const
 gets the last error

Private Member Functions

void clear ()
 clears all internal resouces
void do_regex (const char *phrase, tokenizer_output_sequence_t &tokens) const
 processes regular expression, if any
void do_abbr (const char *phrase, tokenizer_output_sequence_t &tokens) const
 processes abbreviations
void do_hyphen (const char *phrase, tokenizer_output_sequence_t &tokens) const
 does hyphens
size_t match (const char *x, size_t len, size_t tokens, size_t &key) const
 matches the input string and the regular expression

Private Attributes

string_t _error
 error
abbreviation_map_t _abbr_map
 map of hashed abbreviations
regex_map_t _regex_map
 map of regular expression


Detailed Description

class tokenize input string finding atomic tokens

Definition at line 136 of file tokenizer.h.


Constructor & Destructor Documentation

tokenizer::tokenizer (  ) 

constructor

Definition at line 113 of file tokenizer.cpp.

tokenizer::~tokenizer (  ) 

destructor

Definition at line 118 of file tokenizer.cpp.

References clear().


Member Function Documentation

bool tokenizer::add_regex ( const char *  regex,
size_t  key,
size_t  min,
size_t  max 
)

adds regular expressions to the internal map tokenizer will recognize the longest possible regular expression

Parameters:
regex  regular expression
key  a parameter, which will be assigned to the detected regular expression
min  minimum atomic tokens in a regular expression
max  maximum atomic tokens in a regular expression

Definition at line 132 of file tokenizer.cpp.

References _error, _regex_map, base_map< K, T, Pr, M >::end(), and map< K, T, Pr, M >::insert().

Referenced by load_regex().

bool tokenizer::load_regex ( const char *  file_name  ) 

loads regular expressions from file file has to have the following format regex tab key tab min tab max endline

Parameters:
file_name  file name

Definition at line 154 of file tokenizer.cpp.

References _error, add_regex(), and str_template::strscan().

void tokenizer::clear_regex (  ) 

bool tokenizer::add_abbreviation ( const char *  abbr  ) 

adds abbreviation to the internal map

Parameters:
abbr  abbreviation

Definition at line 260 of file tokenizer.cpp.

References _abbr_map, do_hash_lowercase(), base_map< K, T, Pr, M >::end(), map< K, T, Pr, M >::insert(), and os_minus_one.

Referenced by load_abbr().

bool tokenizer::load_abbr ( const char *  file_name  ) 

loads abbreviations from file file has to have the following format abbr

Parameters:
file_name  file name

Definition at line 273 of file tokenizer.cpp.

References _error, and add_abbreviation().

void tokenizer::clear_abbr (  ) 

cleans up abbreviations

Definition at line 314 of file tokenizer.cpp.

References _abbr_map, and map< K, T, Pr, M >::clear().

Referenced by clear(), and fuzzy_matcher_impl::reset().

bool tokenizer::tokenize ( const char *  str,
tokenizer_output_sequence_t out,
byte_allocator all,
size_t  flags = T_ALL 
) const

tokenizes string into array of tokens,

Parameters:
str  input string
out  [out] tokenized atomic tokens
all  allocator
flags  tokenization flags

Definition at line 321 of file tokenizer.cpp.

References base_list< T >::begin(), base_list< T >::clear(), detect_token_type(), do_abbr(), do_hyphen(), do_regex(), base_list< T >::end(), _list< T, A >::push_back(), T_ABBR, T_HYPHEN, T_REGEX, TT_ALPHABETIC, TT_DIGIT, TT_UNKNOWN, and TT_WHITESPACE.

Referenced by fuzzy_matcher_impl::_match(), fuzzy_matcher_impl::add(), and fuzzy_matcher_impl::remove().

const string_t& tokenizer::get_last_error (  )  const [inline]

gets the last error

Definition at line 190 of file tokenizer.h.

References _error.

void tokenizer::clear (  )  [private]

clears all internal resouces

Definition at line 124 of file tokenizer.cpp.

References clear_abbr(), and clear_regex().

Referenced by ~tokenizer().

void tokenizer::do_regex ( const char *  phrase,
tokenizer_output_sequence_t tokens 
) const [private]

void tokenizer::do_abbr ( const char *  phrase,
tokenizer_output_sequence_t tokens 
) const [private]

void tokenizer::do_hyphen ( const char *  phrase,
tokenizer_output_sequence_t tokens 
) const [private]

does hyphens

Parameters:
phrase  input phrase
tokens  [in,out] atomic tokens

Definition at line 529 of file tokenizer.cpp.

References base_list< T >::begin(), ch_dash, base_list< T >::end(), _list< T, A >::erase(), TT_ALPHABETIC, TT_COMPOSE, TT_DIGIT, and TT_SYMBOL.

Referenced by tokenize().

size_t tokenizer::match ( const char *  x,
size_t  len,
size_t  tokens,
size_t &  key 
) const [private]

matches the input string and the regular expression

Parameters:
x  input string
len  length
tokens  tokens to match
key  [out] regex key

Definition at line 390 of file tokenizer.cpp.

References _regex_map, base_map< pcre_key, pcre_entry, less< pcre_key >, M >::const_iterator, base_map< K, T, Pr, M >::end(), and base_map< K, T, Pr, M >::lower_bound().

Referenced by do_regex().


Member Data Documentation

error

Definition at line 222 of file tokenizer.h.

Referenced by add_regex(), get_last_error(), load_abbr(), and load_regex().

map of hashed abbreviations

Definition at line 223 of file tokenizer.h.

Referenced by add_abbreviation(), clear_abbr(), and do_abbr().

map of regular expression

Definition at line 224 of file tokenizer.h.

Referenced by add_regex(), clear_regex(), do_regex(), and match().


The documentation for this class was generated from the following files:


© Copyright Terimber 2003-.