Thread portable class


	Gate portable class


	Timer class


	Pool class


	Terimber 2.0


	About C++

Products & Services

Open Source

Home / Open source / Terimber 2.0

tokenizer Class Reference

class tokenize input string finding atomic tokens More...

#include <tokenizer.h>

List of all members.

Public Member Functions

tokenizer ()

constructor

~tokenizer ()

destructor

bool add_regex (const char *regex, size_t key, size_t min, size_t max)

adds regular expressions to the internal map tokenizer will recognize the longest possible regular expression

bool load_regex (const char *file_name)

loads regular expressions from file file has to have the following format regex tab key tab min tab max endline

void clear_regex ()

cleans up regular expressions

bool add_abbreviation (const char *abbr)

adds abbreviation to the internal map

bool load_abbr (const char *file_name)

loads abbreviations from file file has to have the following format abbr

void clear_abbr ()

cleans up abbreviations

bool tokenize (const char *str, tokenizer_output_sequence_t &out, byte_allocator &all, size_t flags=T_ALL) const

tokenizes string into array of tokens,

const string_t & get_last_error () const

gets the last error

Private Member Functions

void clear ()

clears all internal resouces

void do_regex (const char *phrase, tokenizer_output_sequence_t &tokens) const

processes regular expression, if any

void do_abbr (const char *phrase, tokenizer_output_sequence_t &tokens) const

processes abbreviations

void do_hyphen (const char *phrase, tokenizer_output_sequence_t &tokens) const

does hyphens

size_t match (const char *x, size_t len, size_t tokens, size_t &key) const

matches the input string and the regular expression

Private Attributes

string_t _error

error

abbreviation_map_t _abbr_map

map of hashed abbreviations

regex_map_t _regex_map

map of regular expression

Detailed Description

class tokenize input string finding atomic tokens

Definition at line 136 of file tokenizer.h.

Constructor & Destructor Documentation

tokenizer::tokenizer ( )

constructor

Definition at line 113 of file tokenizer.cpp.

tokenizer::~tokenizer ( )

destructor

Definition at line 118 of file tokenizer.cpp.

References clear().

Member Function Documentation

bool tokenizer::add_regex	(	const char *	regex,
		size_t	key,
		size_t	min,
		size_t	max
	)

adds regular expressions to the internal map tokenizer will recognize the longest possible regular expression

Parameters:

regex	regular expression
key	a parameter, which will be assigned to the detected regular expression
min	minimum atomic tokens in a regular expression
max	maximum atomic tokens in a regular expression

Definition at line 132 of file tokenizer.cpp.

References _error, _regex_map, base_map< K, T, Pr, M >::end(), and map< K, T, Pr, M >::insert().

Referenced by load_regex().

bool tokenizer::load_regex ( const char * file_name )

loads regular expressions from file file has to have the following format regex tab key tab min tab max endline

Parameters:

file_name

file name

Definition at line 154 of file tokenizer.cpp.

References _error, add_regex(), and str_template::strscan().

void tokenizer::clear_regex ( )

cleans up regular expressions

Definition at line 247 of file tokenizer.cpp.

References _regex_map, base_map< K, T, Pr, M >::begin(), map< K, T, Pr, M >::clear(), base_map< K, T, Pr, M >::end(), and base_map< pcre_key, pcre_entry, less< pcre_key >, M >::iterator.

Referenced by clear(), and fuzzy_matcher_impl::reset().

bool tokenizer::add_abbreviation ( const char * abbr )

adds abbreviation to the internal map

Parameters:

abbr

abbreviation

Definition at line 260 of file tokenizer.cpp.

References _abbr_map, do_hash_lowercase(), base_map< K, T, Pr, M >::end(), map< K, T, Pr, M >::insert(), and os_minus_one.

Referenced by load_abbr().

bool tokenizer::load_abbr ( const char * file_name )

loads abbreviations from file file has to have the following format abbr

Parameters:

file_name

file name

Definition at line 273 of file tokenizer.cpp.

References _error, and add_abbreviation().

void tokenizer::clear_abbr ( )

cleans up abbreviations

Definition at line 314 of file tokenizer.cpp.

References _abbr_map, and map< K, T, Pr, M >::clear().

Referenced by clear(), and fuzzy_matcher_impl::reset().

bool tokenizer::tokenize	(	const char *	str,
		tokenizer_output_sequence_t &	out,
		byte_allocator &	all,
		size_t	flags = `T_ALL`
	)			const

tokenizes string into array of tokens,

Parameters:

str	input string
out	[out] tokenized atomic tokens
all	allocator
flags	tokenization flags

Definition at line 321 of file tokenizer.cpp.

References base_list< T >::begin(), base_list< T >::clear(), detect_token_type(), do_abbr(), do_hyphen(), do_regex(), base_list< T >::end(), _list< T, A >::push_back(), T_ABBR, T_HYPHEN, T_REGEX, TT_ALPHABETIC, TT_DIGIT, TT_UNKNOWN, and TT_WHITESPACE.

Referenced by fuzzy_matcher_impl::_match(), fuzzy_matcher_impl::add(), and fuzzy_matcher_impl::remove().

const string_t& tokenizer::get_last_error ( ) const [inline]

gets the last error

Definition at line 190 of file tokenizer.h.

References _error.

void tokenizer::clear ( ) [private]

clears all internal resouces

Definition at line 124 of file tokenizer.cpp.

References clear_abbr(), and clear_regex().

Referenced by ~tokenizer().

void tokenizer::do_regex	(	const char *	phrase,
		tokenizer_output_sequence_t &	tokens
	)			const `[private]`

processes regular expression, if any

Parameters:

phrase	input phrase
tokens	[in,out] atomic tokens

Definition at line 420 of file tokenizer.cpp.

References _regex_map, base_list< T >::begin(), base_map< pcre_key, pcre_entry, less< pcre_key >, M >::const_iterator, base_map< K, T, Pr, M >::empty(), base_list< T >::end(), base_map< K, T, Pr, M >::end(), _list< T, A >::erase(), match(), TT_REGEX, and base_map< K, T, Pr, M >::upper_bound().

Referenced by tokenize().

void tokenizer::do_abbr	(	const char *	phrase,
		tokenizer_output_sequence_t &	tokens
	)			const `[private]`

processes abbreviations

Parameters:

phrase	input phrase
tokens	[in,out] atomic tokens

Definition at line 484 of file tokenizer.cpp.

References _abbr_map, base_list< T >::begin(), do_hash_lowercase(), base_map< K, T, Pr, M >::end(), base_list< T >::end(), _list< T, A >::erase(), base_map< K, T, Pr, M >::const_iterator::key(), base_map< K, T, Pr, M >::lower_bound(), str_template::strnocasecmp(), TT_ABBR, TT_ALPHABETIC, and TT_DOT.

Referenced by tokenize().

void tokenizer::do_hyphen	(	const char *	phrase,
		tokenizer_output_sequence_t &	tokens
	)			const `[private]`

does hyphens

Parameters:

phrase	input phrase
tokens	[in,out] atomic tokens

Definition at line 529 of file tokenizer.cpp.

References base_list< T >::begin(), ch_dash, base_list< T >::end(), _list< T, A >::erase(), TT_ALPHABETIC, TT_COMPOSE, TT_DIGIT, and TT_SYMBOL.

Referenced by tokenize().

size_t tokenizer::match	(	const char *	x,
		size_t	len,
		size_t	tokens,
		size_t &	key
	)			const `[private]`

matches the input string and the regular expression

Parameters:

x	input string
len	length
tokens	tokens to match
key	[out] regex key

Definition at line 390 of file tokenizer.cpp.

References _regex_map, base_map< pcre_key, pcre_entry, less< pcre_key >, M >::const_iterator, base_map< K, T, Pr, M >::end(), and base_map< K, T, Pr, M >::lower_bound().

Referenced by do_regex().

Member Data Documentation

string_t tokenizer::_error [private]

error

Definition at line 222 of file tokenizer.h.

Referenced by add_regex(), get_last_error(), load_abbr(), and load_regex().

abbreviation_map_t tokenizer::_abbr_map [private]

map of hashed abbreviations

Definition at line 223 of file tokenizer.h.

Referenced by add_abbreviation(), clear_abbr(), and do_abbr().

regex_map_t tokenizer::_regex_map [private]

map of regular expression

Definition at line 224 of file tokenizer.h.

Referenced by add_regex(), clear_regex(), do_regex(), and match().

The documentation for this class was generated from the following files:


	© Copyright Terimber 2003-.