Services for searching and matching of text.
Interface for differrent indexing engines for the Translate Toolkit.
base class for interfaces to indexing engines for pootle
translate.search.indexing.CommonIndexer.
CommonDatabase
(basedir, analyzer=None, create_allowed=True)¶Base class for indexing support.
Any real implementation must override most methods of this class.
ANALYZER_DEFAULT
= 6¶the default analyzer to be used if nothing is configured
ANALYZER_EXACT
= 0¶exact matching: the query string must equal the whole term string
ANALYZER_PARTIAL
= 2¶partial matching: a document matches, even if the query string only matches the beginning of the term value.
ANALYZER_TOKENIZE
= 4¶tokenize terms and queries automatically
INDEX_DIRECTORY_NAME
= None¶override this with a string to be used as the name of the indexing directory/file in the filesystem
QUERY_TYPE
= None¶override this with the query class of the implementation
begin_transaction
()¶begin a transaction
You can group multiple modifications of a database as a transaction. This prevents time-consuming database flushing and helps, if you want that a changeset is committed either completely or not at all. No changes will be written to disk until ‘commit_transaction’. ‘cancel_transaction’ can be used to revert an ongoing transaction.
Database types that do not support transactions may silently ignore it.
cancel_transaction
()¶cancel an ongoing transaction
See ‘start_transaction’ for details.
commit_transaction
()¶Submit the currently ongoing transaction and write changes to disk.
See ‘start_transaction’ for details.
delete_doc
(ident)¶Delete the documents returned by a query.
Parameters: | ident (int | list of tuples | dict | list of dicts | query (e.g. xapian.Query) | list of queries) – [list of] document IDs | dict describing a query | query |
---|
delete_document_by_id
(docid)¶Delete a specified document.
Parameters: | docid (int) – the document ID to be deleted |
---|
field_analyzers
= {}¶mapping of field names and analyzers - see
set_field_analyzers()
flush
(optimize=False)¶Flush the content of the database - to force changes to be written to disk.
Some databases also support index optimization.
Parameters: | optimize (bool) – should the index be optimized if possible? |
---|
get_field_analyzers
(fieldnames=None)¶Return the analyzer that was mapped to a specific field.
See set_field_analyzers()
for details.
Parameters: | fieldnames (str | list of str | None) – the analyzer of this field (or all/multiple fields) is requested; leave empty (or None) to request all fields. |
---|---|
Returns: | The analyzer setting of the field - see CommonDatabase.ANALYZER_??? or a dict of field names and analyzers |
Return type: | int | dict |
get_query_result
(query)¶return an object containing the results of a query
Parameters: | query (a query object of the real implementation) – a pre-compiled query |
---|---|
Returns: | an object that allows access to the results |
Return type: | subclass of CommonEnquire |
index_document
(data)¶Add the given data to the database.
Parameters: | data (dict | list of str) – the data to be indexed.
A dictionary will be treated as fieldname:value
combinations.
If the fieldname is None then the value will be
interpreted as a plain term or as a list of plain terms.
Lists of terms are indexed separately.
Lists of strings are treated as plain terms. |
---|
make_query
(args, require_all=True, analyzer=None)¶Create simple queries (strings or field searches) or combine multiple queries (AND/OR).
To specifiy rules for field searches, you may want to take a look at
set_field_analyzers()
. The parameter
‘match_text_partial’ can override the previously defined
default setting.
Parameters: |
|
---|---|
Returns: | the combined query |
Return type: | query type of the specific implementation |
search
(query, fieldnames)¶Return a list of the contents of specified fields for all matches of a query.
Parameters: |
|
---|---|
Returns: | a list of dicts containing the specified field(s) |
Return type: | list of dicts |
set_field_analyzers
(field_analyzers)¶Set the analyzers for different fields of the database documents.
All bitwise combinations of CommonIndexer.ANALYZER_??? are possible.
Parameters: | field_analyzers (dict containing field names and analyzers) – mapping of field names and analyzers |
---|---|
Raises: | TypeError – invalid values in field_analyzers |
translate.search.indexing.CommonIndexer.
CommonEnquire
(enquire)¶An enquire object contains the information about the result of a request.
get_matches
(start, number)¶Return a specified number of qualified matches of a previous query.
Parameters: | |
---|---|
Returns: | a set of matching entries and some statistics |
Return type: | tuple of (returned number, available number, matches) “matches” is a dictionary of: ["rank", "percent", "document", "docid"]
|
interface for the PyLucene (v2.x) indexing engine
take a look at PyLuceneIndexer1.py for the PyLucene v1.x interface
translate.search.indexing.PyLuceneIndexer.
PyLuceneDatabase
(basedir, analyzer=None, create_allowed=True)¶Manage and use a pylucene indexing database.
begin_transaction
()¶PyLucene does not support transactions
Thus this function just opens the database for write access. Call “cancel_transaction” or “commit_transaction” to close write access in order to remove the exclusive lock from the database directory.
cancel_transaction
()¶PyLucene does not support transactions
Thus this function just closes the database write access and removes the exclusive lock.
See ‘start_transaction’ for details.
commit_transaction
()¶PyLucene does not support transactions
Thus this function just closes the database write access and removes the exclusive lock.
See ‘start_transaction’ for details.
delete_document_by_id
(docid)¶delete a specified document
Parameters: | docid (int) – the document ID to be deleted |
---|
flush
(optimize=False)¶flush the content of the database - to force changes to be written to disk
some databases also support index optimization
Parameters: | optimize (bool) – should the index be optimized if possible? |
---|
get_field_analyzers
(fieldnames=None)¶Return the analyzer that was mapped to a specific field.
See set_field_analyzers()
for details.
Parameters: | fieldnames (str | list of str | None) – the analyzer of this field (or all/multiple fields) is requested; leave empty (or None) to request all fields. |
---|---|
Returns: | The analyzer setting of the field - see CommonDatabase.ANALYZER_??? or a dict of field names and analyzers |
Return type: | int | dict |
get_query_result
(query)¶return an object containing the results of a query
Parameters: | query (a query object of the real implementation) – a pre-compiled query |
---|---|
Returns: | an object that allows access to the results |
Return type: | subclass of CommonEnquire |
index_document
(data)¶Add the given data to the database.
Parameters: | data (dict | list of str) – the data to be indexed.
A dictionary will be treated as fieldname:value
combinations.
If the fieldname is None then the value will be
interpreted as a plain term or as a list of plain terms.
Lists of terms are indexed separately.
Lists of strings are treated as plain terms. |
---|
search
(query, fieldnames)¶Return a list of the contents of specified fields for all matches of a query.
Parameters: |
|
---|---|
Returns: | a list of dicts containing the specified field(s) |
Return type: | list of dicts |
set_field_analyzers
(field_analyzers)¶Set the analyzers for different fields of the database documents.
All bitwise combinations of CommonIndexer.ANALYZER_??? are possible.
Parameters: | field_analyzers (dict containing field names and analyzers) – mapping of field names and analyzers |
---|---|
Raises: | TypeError – invalid values in field_analyzers |
translate.search.indexing.PyLuceneIndexer.
PyLuceneHits
(enquire)¶an enquire object contains the information about the result of a request
get_matches
(start, number)¶return a specified number of qualified matches of a previous query
Parameters: | |
---|---|
Returns: | a set of matching entries and some statistics |
Return type: | tuple of (returned number, available number, matches) “matches” is a dictionary of: ["rank", "percent", "document", "docid"]
|
Interface to the Xapian indexing engine for the Translate Toolkit
Xapian v1.0 or higher is supported.
If you are interested in writing an interface for Xapian 0.x, then you should checkout the following:
svn export -r 7235 https://translate.svn.sourceforge.net/svnroot/translate/src/branches/translate-search-indexer-generic-merging/translate/search/indexer/
It is not completely working, but it should give you a good start.
translate.search.indexing.XapianIndexer.
XapianDatabase
(basedir, analyzer=None, create_allowed=True)¶Interface to the Xapian indexer.
begin_transaction
()¶Begin a transaction.
Xapian supports transactions to group multiple database modifications. This avoids intermediate flushing and therefore increases performance.
cancel_transaction
()¶cancel an ongoing transaction
no changes since the last execution of ‘begin_transcation’ are written
commit_transaction
()¶Submit the changes of an ongoing transaction.
All changes since the last execution of ‘begin_transaction’ are written.
delete_doc
(ident)¶Delete the documents returned by a query.
Parameters: | ident (int | list of tuples | dict | list of dicts | query (e.g. xapian.Query) | list of queries) – [list of] document IDs | dict describing a query | query |
---|
delete_document_by_id
(docid)¶Delete a specified document.
Parameters: | docid (int) – the document ID to be deleted |
---|
flush
(optimize=False)¶force to write the current changes to disk immediately
Parameters: | optimize (bool) – ignored for xapian |
---|
get_field_analyzers
(fieldnames=None)¶Return the analyzer that was mapped to a specific field.
See set_field_analyzers()
for details.
Parameters: | fieldnames (str | list of str | None) – the analyzer of this field (or all/multiple fields) is requested; leave empty (or None) to request all fields. |
---|---|
Returns: | The analyzer setting of the field - see CommonDatabase.ANALYZER_??? or a dict of field names and analyzers |
Return type: | int | dict |
get_query_result
(query)¶Return an object containing the results of a query.
Parameters: | query (xapian.Query) – a pre-compiled xapian query |
---|---|
Returns: | an object that allows access to the results |
Return type: | XapianIndexer.CommonEnquire |
index_document
(data)¶Add the given data to the database.
Parameters: | data (dict | list of str) – the data to be indexed.
A dictionary will be treated as fieldname:value
combinations.
If the fieldname is None then the value will be
interpreted as a plain term or as a list of plain terms.
Lists of terms are indexed separately.
Lists of strings are treated as plain terms. |
---|
search
(query, fieldnames)¶Return a list of the contents of specified fields for all matches of a query.
Parameters: |
|
---|---|
Returns: | a list of dicts containing the specified field(s) |
Return type: | list of dicts |
set_field_analyzers
(field_analyzers)¶Set the analyzers for different fields of the database documents.
All bitwise combinations of CommonIndexer.ANALYZER_??? are possible.
Parameters: | field_analyzers (dict containing field names and analyzers) – mapping of field names and analyzers |
---|---|
Raises: | TypeError – invalid values in field_analyzers |
translate.search.indexing.XapianIndexer.
XapianEnquire
(enquire)¶interface to the xapian object for storing sets of matches
get_matches
(start, number)¶Return a specified number of qualified matches of a previous query.
Parameters: | |
---|---|
Returns: | a set of matching entries and some statistics |
Return type: | tuple of (returned number, available number, matches) “matches” is a dictionary of: ["rank", "percent", "document", "docid"]
|
A class to calculate a similarity based on the Levenshtein distance.
See http://en.wikipedia.org/wiki/Levenshtein_distance.
If available, the python-Levenshtein will be used which will provide better performance as it is implemented natively.
translate.search.lshtein.
distance
(a, b, stopvalue=0)¶Same as python_distance in functionality. This uses the fast C version if we detected it earlier.
Note that this does not support arbitrary sequence types, but only string types.
translate.search.lshtein.
native_distance
(a, b, stopvalue=0)¶Same as python_distance in functionality. This uses the fast C version if we detected it earlier.
Note that this does not support arbitrary sequence types, but only string types.
translate.search.lshtein.
python_distance
(a, b, stopvalue=-1)¶Calculates the distance for use in similarity calculation. Python version.
Class to perform translation memory matching from a store of translation units.
translate.search.match.
matcher
(store, max_candidates=10, min_similarity=75, max_length=70, comparer=None, usefuzzy=False)¶A class that will do matching and store configuration for the matching process.
buildunits
(candidates)¶Builds a list of units conforming to base API, with the score in the comment.
extendtm
(units, store=None, sort=True)¶Extends the memory with extra unit(s).
Parameters: |
|
---|
getstartlength
(min_similarity, text)¶Calculates the minimum length we are interested in. The extra fat is because we don’t use plain character distance only.
getstoplength
(min_similarity, text)¶Calculates a length beyond which we are not interested. The extra fat is because we don’t use plain character distance only.
inittm
(stores, reverse=False)¶Initialises the memory for later use. We use simple base units for speedup.
matches
(text)¶Returns a list of possible matches for given source text.
Parameters: | text (String) – The text that will be search for in the translation memory |
---|---|
Return type: | list |
Returns: | a list of units with the source and target strings from the
translation memory. If self.addpercentage is
True (default) the match quality is given as a
percentage in the notes. |
setparameters
(max_candidates=10, min_similarity=75, max_length=70)¶Sets the parameters without reinitialising the tm. If a parameter is not specified, it is set to the default, not ignored
usable
(unit)¶Returns whether this translation unit is usable for TM
translate.search.match.
sourcelen
(unit)¶Returns the length of the source string.
translate.search.match.
terminologymatcher
(store, max_candidates=10, min_similarity=75, max_length=500, comparer=None)¶A matcher with settings specifically for terminology matching.
buildunits
(candidates)¶Builds a list of units conforming to base API, with the score in the comment.
extendtm
(units, store=None, sort=True)¶Extends the memory with extra unit(s).
Parameters: |
|
---|
inittm
(store)¶Normal initialisation, but convert all source strings to lower case
matches
(text)¶Normal matching after converting text to lower case. Then replace with the original unit to retain comments, etc.
setparameters
(max_candidates=10, min_similarity=75, max_length=70)¶Sets the parameters without reinitialising the tm. If a parameter is not specified, it is set to the default, not ignored
usable
(unit)¶Returns whether this translation unit is usable for terminology.
translate.search.match.
unit2dict
(unit)¶converts a pounit to a simple dict structure for use over the web
Module to deal with different types and uses of segmentation
translate.search.segment.
character_iter
(text)¶Returns an iterator over the characters in text.
translate.search.segment.
characters
(text)¶Returns a list of characters in text.
translate.search.segment.
sentence_iter
(text)¶Returns an iterator over the senteces in text.
translate.search.segment.
sentences
(text)¶Returns a list of senteces in text.
translate.search.segment.
word_iter
(text)¶Returns an iterator over the words in text.
translate.search.segment.
words
(text)¶Returns a list of words in text.
A class that does terminology matching