search

Services for searching and matching of text.

indexing

Interface for differrent indexing engines for the Translate Toolkit.

CommonIndexer

base class for interfaces to indexing engines for pootle

class translate.search.indexing.CommonIndexer.CommonDatabase(basedir, analyzer=None, create_allowed=True)

Base class for indexing support.

Any real implementation must override most methods of this class.

ANALYZER_DEFAULT = 6

the default analyzer to be used if nothing is configured

ANALYZER_EXACT = 0

exact matching: the query string must equal the whole term string

ANALYZER_PARTIAL = 2

partial matching: a document matches, even if the query string only matches the beginning of the term value.

ANALYZER_TOKENIZE = 4

tokenize terms and queries automatically

INDEX_DIRECTORY_NAME = None

override this with a string to be used as the name of the indexing directory/file in the filesystem

QUERY_TYPE = None

override this with the query class of the implementation

begin_transaction()

begin a transaction

You can group multiple modifications of a database as a transaction. This prevents time-consuming database flushing and helps, if you want that a changeset is committed either completely or not at all. No changes will be written to disk until ‘commit_transaction’. ‘cancel_transaction’ can be used to revert an ongoing transaction.

Database types that do not support transactions may silently ignore it.

cancel_transaction()

cancel an ongoing transaction

See ‘start_transaction’ for details.

commit_transaction()

Submit the currently ongoing transaction and write changes to disk.

See ‘start_transaction’ for details.

delete_doc(ident)

Delete the documents returned by a query.

Parameters:ident (int | list of tuples | dict | list of dicts | query (e.g. xapian.Query) | list of queries) – [list of] document IDs | dict describing a query | query
delete_document_by_id(docid)

Delete a specified document.

Parameters:docid (int) – the document ID to be deleted
field_analyzers = {}

mapping of field names and analyzers - see set_field_analyzers()

flush(optimize=False)

Flush the content of the database - to force changes to be written to disk.

Some databases also support index optimization.

Parameters:optimize (bool) – should the index be optimized if possible?
get_field_analyzers(fieldnames=None)

Return the analyzer that was mapped to a specific field.

See set_field_analyzers() for details.

Parameters:fieldnames (str | list of str | None) – the analyzer of this field (or all/multiple fields) is requested; leave empty (or None) to request all fields.
Returns:The analyzer setting of the field - see CommonDatabase.ANALYZER_??? or a dict of field names and analyzers
Return type:int | dict
get_query_result(query)

return an object containing the results of a query

Parameters:query (a query object of the real implementation) – a pre-compiled query
Returns:an object that allows access to the results
Return type:subclass of CommonEnquire
index_document(data)

Add the given data to the database.

Parameters:data (dict | list of str) – the data to be indexed. A dictionary will be treated as fieldname:value combinations. If the fieldname is None then the value will be interpreted as a plain term or as a list of plain terms. Lists of terms are indexed separately. Lists of strings are treated as plain terms.
make_query(args, require_all=True, analyzer=None)

Create simple queries (strings or field searches) or combine multiple queries (AND/OR).

To specifiy rules for field searches, you may want to take a look at set_field_analyzers(). The parameter ‘match_text_partial’ can override the previously defined default setting.

Parameters:
  • args (list of queries | single query | str | dict) –

    queries or search string or description of field query examples:

    [xapian.Query("foo"), xapian.Query("bar")]
    xapian.Query("foo")
    "bar"
    {"foo": "bar", "foobar": "foo"}
    
  • require_all (boolean) – boolean operator (True -> AND (default) / False -> OR)
  • analyzer (int) –

    (only applicable for ‘dict’ or ‘str’) Define query options (partial matching, exact matching, tokenizing, …) as bitwise combinations of CommonIndexer.ANALYZER_???.

    This can override previously defined field analyzer settings.

    If analyzer is None (default), then the configured analyzer for the field is used.

Returns:

the combined query

Return type:

query type of the specific implementation

search(query, fieldnames)

Return a list of the contents of specified fields for all matches of a query.

Parameters:
  • query (a query object of the real implementation) – the query to be issued
  • fieldnames (string | list of strings) – the name(s) of a field of the document content
Returns:

a list of dicts containing the specified field(s)

Return type:

list of dicts

set_field_analyzers(field_analyzers)

Set the analyzers for different fields of the database documents.

All bitwise combinations of CommonIndexer.ANALYZER_??? are possible.

Parameters:field_analyzers (dict containing field names and analyzers) – mapping of field names and analyzers
Raises:TypeError – invalid values in field_analyzers
class translate.search.indexing.CommonIndexer.CommonEnquire(enquire)

An enquire object contains the information about the result of a request.

get_matches(start, number)

Return a specified number of qualified matches of a previous query.

Parameters:
  • start (int) – index of the first match to return (starting from zero)
  • number (int) – the number of matching entries to return
Returns:

a set of matching entries and some statistics

Return type:

tuple of (returned number, available number, matches) “matches” is a dictionary of:

["rank", "percent", "document", "docid"]

get_matches_count()

Return the estimated number of matches.

Use translate.search.indexing.CommonIndexer.search() to retrieve the exact number of matches

Returns:The estimated number of matches
Return type:int
translate.search.indexing.CommonIndexer.is_available()

Check if this indexing engine interface is usable.

This function must exist in every module that contains indexing engine interfaces.

Returns:is this interface usable?
Return type:bool

PyLuceneIndexer

interface for the PyLucene (v2.x) indexing engine

take a look at PyLuceneIndexer1.py for the PyLucene v1.x interface

class translate.search.indexing.PyLuceneIndexer.PyLuceneDatabase(basedir, analyzer=None, create_allowed=True)

Manage and use a pylucene indexing database.

begin_transaction()

PyLucene does not support transactions

Thus this function just opens the database for write access. Call “cancel_transaction” or “commit_transaction” to close write access in order to remove the exclusive lock from the database directory.

cancel_transaction()

PyLucene does not support transactions

Thus this function just closes the database write access and removes the exclusive lock.

See ‘start_transaction’ for details.

commit_transaction()

PyLucene does not support transactions

Thus this function just closes the database write access and removes the exclusive lock.

See ‘start_transaction’ for details.

delete_document_by_id(docid)

delete a specified document

Parameters:docid (int) – the document ID to be deleted
flush(optimize=False)

flush the content of the database - to force changes to be written to disk

some databases also support index optimization

Parameters:optimize (bool) – should the index be optimized if possible?
get_field_analyzers(fieldnames=None)

Return the analyzer that was mapped to a specific field.

See set_field_analyzers() for details.

Parameters:fieldnames (str | list of str | None) – the analyzer of this field (or all/multiple fields) is requested; leave empty (or None) to request all fields.
Returns:The analyzer setting of the field - see CommonDatabase.ANALYZER_??? or a dict of field names and analyzers
Return type:int | dict
get_query_result(query)

return an object containing the results of a query

Parameters:query (a query object of the real implementation) – a pre-compiled query
Returns:an object that allows access to the results
Return type:subclass of CommonEnquire
index_document(data)

Add the given data to the database.

Parameters:data (dict | list of str) – the data to be indexed. A dictionary will be treated as fieldname:value combinations. If the fieldname is None then the value will be interpreted as a plain term or as a list of plain terms. Lists of terms are indexed separately. Lists of strings are treated as plain terms.
search(query, fieldnames)

Return a list of the contents of specified fields for all matches of a query.

Parameters:
  • query (a query object of the real implementation) – the query to be issued
  • fieldnames (string | list of strings) – the name(s) of a field of the document content
Returns:

a list of dicts containing the specified field(s)

Return type:

list of dicts

set_field_analyzers(field_analyzers)

Set the analyzers for different fields of the database documents.

All bitwise combinations of CommonIndexer.ANALYZER_??? are possible.

Parameters:field_analyzers (dict containing field names and analyzers) – mapping of field names and analyzers
Raises:TypeError – invalid values in field_analyzers
class translate.search.indexing.PyLuceneIndexer.PyLuceneHits(enquire)

an enquire object contains the information about the result of a request

get_matches(start, number)

return a specified number of qualified matches of a previous query

Parameters:
  • start (int) – index of the first match to return (starting from zero)
  • number (int) – the number of matching entries to return
Returns:

a set of matching entries and some statistics

Return type:

tuple of (returned number, available number, matches) “matches” is a dictionary of:

["rank", "percent", "document", "docid"]

get_matches_count()

Return the estimated number of matches.

Use translate.search.indexing.CommonIndexer.search() to retrieve the exact number of matches

Returns:The estimated number of matches
Return type:int

XapianIndexer

Interface to the Xapian indexing engine for the Translate Toolkit

Xapian v1.0 or higher is supported.

If you are interested in writing an interface for Xapian 0.x, then you should checkout the following:

svn export -r 7235 https://translate.svn.sourceforge.net/svnroot/translate/src/branches/translate-search-indexer-generic-merging/translate/search/indexer/

It is not completely working, but it should give you a good start.

class translate.search.indexing.XapianIndexer.XapianDatabase(basedir, analyzer=None, create_allowed=True)

Interface to the Xapian indexer.

begin_transaction()

Begin a transaction.

Xapian supports transactions to group multiple database modifications. This avoids intermediate flushing and therefore increases performance.

cancel_transaction()

cancel an ongoing transaction

no changes since the last execution of ‘begin_transcation’ are written

commit_transaction()

Submit the changes of an ongoing transaction.

All changes since the last execution of ‘begin_transaction’ are written.

delete_doc(ident)

Delete the documents returned by a query.

Parameters:ident (int | list of tuples | dict | list of dicts | query (e.g. xapian.Query) | list of queries) – [list of] document IDs | dict describing a query | query
delete_document_by_id(docid)

Delete a specified document.

Parameters:docid (int) – the document ID to be deleted
flush(optimize=False)

force to write the current changes to disk immediately

Parameters:optimize (bool) – ignored for xapian
get_field_analyzers(fieldnames=None)

Return the analyzer that was mapped to a specific field.

See set_field_analyzers() for details.

Parameters:fieldnames (str | list of str | None) – the analyzer of this field (or all/multiple fields) is requested; leave empty (or None) to request all fields.
Returns:The analyzer setting of the field - see CommonDatabase.ANALYZER_??? or a dict of field names and analyzers
Return type:int | dict
get_query_result(query)

Return an object containing the results of a query.

Parameters:query (xapian.Query) – a pre-compiled xapian query
Returns:an object that allows access to the results
Return type:XapianIndexer.CommonEnquire
index_document(data)

Add the given data to the database.

Parameters:data (dict | list of str) – the data to be indexed. A dictionary will be treated as fieldname:value combinations. If the fieldname is None then the value will be interpreted as a plain term or as a list of plain terms. Lists of terms are indexed separately. Lists of strings are treated as plain terms.
search(query, fieldnames)

Return a list of the contents of specified fields for all matches of a query.

Parameters:
  • query (xapian.Query) – the query to be issued
  • fieldnames (string | list of strings) – the name(s) of a field of the document content
Returns:

a list of dicts containing the specified field(s)

Return type:

list of dicts

set_field_analyzers(field_analyzers)

Set the analyzers for different fields of the database documents.

All bitwise combinations of CommonIndexer.ANALYZER_??? are possible.

Parameters:field_analyzers (dict containing field names and analyzers) – mapping of field names and analyzers
Raises:TypeError – invalid values in field_analyzers
class translate.search.indexing.XapianIndexer.XapianEnquire(enquire)

interface to the xapian object for storing sets of matches

get_matches(start, number)

Return a specified number of qualified matches of a previous query.

Parameters:
  • start (int) – index of the first match to return (starting from zero)
  • number (int) – the number of matching entries to return
Returns:

a set of matching entries and some statistics

Return type:

tuple of (returned number, available number, matches) “matches” is a dictionary of:

["rank", "percent", "document", "docid"]

get_matches_count()

Return the estimated number of matches.

Use translate.search.indexing.CommonIndexer.search() to retrieve the exact number of matches

Returns:The estimated number of matches
Return type:int

lshtein

A class to calculate a similarity based on the Levenshtein distance.

See http://en.wikipedia.org/wiki/Levenshtein_distance.

If available, the python-Levenshtein will be used which will provide better performance as it is implemented natively.

translate.search.lshtein.distance(a, b, stopvalue=0)

Same as python_distance in functionality. This uses the fast C version if we detected it earlier.

Note that this does not support arbitrary sequence types, but only string types.

translate.search.lshtein.native_distance(a, b, stopvalue=0)

Same as python_distance in functionality. This uses the fast C version if we detected it earlier.

Note that this does not support arbitrary sequence types, but only string types.

translate.search.lshtein.python_distance(a, b, stopvalue=-1)

Calculates the distance for use in similarity calculation. Python version.

match

Class to perform translation memory matching from a store of translation units.

class translate.search.match.matcher(store, max_candidates=10, min_similarity=75, max_length=70, comparer=None, usefuzzy=False)

A class that will do matching and store configuration for the matching process.

buildunits(candidates)

Builds a list of units conforming to base API, with the score in the comment.

extendtm(units, store=None, sort=True)

Extends the memory with extra unit(s).

Parameters:
  • units – The units to add to the TM.
  • store – Optional store from where some metadata can be retrieved and associated with each unit.
  • sort – Optional parameter that can be set to False to supress sorting of the candidates list. This should probably only be used in matcher.inittm().
getstartlength(min_similarity, text)

Calculates the minimum length we are interested in. The extra fat is because we don’t use plain character distance only.

getstoplength(min_similarity, text)

Calculates a length beyond which we are not interested. The extra fat is because we don’t use plain character distance only.

inittm(stores, reverse=False)

Initialises the memory for later use. We use simple base units for speedup.

matches(text)

Returns a list of possible matches for given source text.

Parameters:text (String) – The text that will be search for in the translation memory
Return type:list
Returns:a list of units with the source and target strings from the translation memory. If self.addpercentage is True (default) the match quality is given as a percentage in the notes.
setparameters(max_candidates=10, min_similarity=75, max_length=70)

Sets the parameters without reinitialising the tm. If a parameter is not specified, it is set to the default, not ignored

usable(unit)

Returns whether this translation unit is usable for TM

translate.search.match.sourcelen(unit)

Returns the length of the source string.

class translate.search.match.terminologymatcher(store, max_candidates=10, min_similarity=75, max_length=500, comparer=None)

A matcher with settings specifically for terminology matching.

buildunits(candidates)

Builds a list of units conforming to base API, with the score in the comment.

extendtm(units, store=None, sort=True)

Extends the memory with extra unit(s).

Parameters:
  • units – The units to add to the TM.
  • store – Optional store from where some metadata can be retrieved and associated with each unit.
  • sort – Optional parameter that can be set to False to supress sorting of the candidates list. This should probably only be used in matcher.inittm().
inittm(store)

Normal initialisation, but convert all source strings to lower case

matches(text)

Normal matching after converting text to lower case. Then replace with the original unit to retain comments, etc.

setparameters(max_candidates=10, min_similarity=75, max_length=70)

Sets the parameters without reinitialising the tm. If a parameter is not specified, it is set to the default, not ignored

usable(unit)

Returns whether this translation unit is usable for terminology.

translate.search.match.unit2dict(unit)

converts a pounit to a simple dict structure for use over the web

segment

Module to deal with different types and uses of segmentation

translate.search.segment.character_iter(text)

Returns an iterator over the characters in text.

translate.search.segment.characters(text)

Returns a list of characters in text.

translate.search.segment.sentence_iter(text)

Returns an iterator over the senteces in text.

translate.search.segment.sentences(text)

Returns a list of senteces in text.

translate.search.segment.word_iter(text)

Returns an iterator over the words in text.

translate.search.segment.words(text)

Returns a list of words in text.

terminology

A class that does terminology matching