Creating a terminology list from your existing translations¶
If you did not create a terminology list when you started your translation project or if you have inherited some old translations you probably now want to create a terminology list.
A terminology list or glossary is a list of words and phrases with their expected translation. They are useful for ensuring that your translations are consistent across your project.
With existing translations you have embedded a list of valid translation. This example will help you to extract the terms. It is only the first step you will need to review the terms and must not regard this as a complete list. And of course you would want to take your corrections and feed them back into the original translations.
Quick Overview¶
This describes a multi-stage process for extracting terminology from translation files. It is provided for historical interest and completeness, but you will probably find that using poterminology is easier and will give better results than following this process.
Filter our phrases of more than N words
Remove obviously erroneous phrases such as numbers and punctuation
Create a single PO compendium
Extract and review items that are fuzzy and drop untranslated items
Create a new PO files and process into CSV and TMX format
Get short phrases from the current translations¶
We will not be able to identify terminology within bodies of text, we are only going to extract short bit of text i.e. ones that are between 1 and 3 words long.
pogrep --header --search=msgid -e '^\w+(\s+\w+){0,2}$' zulu zulu-short
We use --header
to ensure that the PO files have a header entry (which
is important for encoding). We are searching only in the msgid and the regular
expression we use is looking for a string with between 1 and 3 words in it. We
are searching through the folder zulu and outputting the result in
zulu-short
Remove any translations with issues¶
You can for instance remove all entries with only a single letter. Useful for eliminating all those spurious accelerator keys.
pogrep --header --search=msgid -v -e "^.$" zulu-short zulu-short-clean
We use the -v
option to invert the search. Our cleaner potential
glossary words are now in zulu-short-clean. What you can eliminate is only
limited by your ability to build regular expressions but yu could eliminate:
Entries with only numbers
Entries that only contain punctuation
Create a compendium¶
Now that we have our words we want to create a single files of all terminology. Thus we create a PO compendium:
~/path/to/pocompendium -i -su zulu-gnome-glossary.po -d zulu-short-clean
You can use various methods but our bash script is quite good. Here we ignore
case, -i
, and ignore the underscore (_) accelerator key, -su
,
outputting the results in.
We now have a single file containing all glossary terms and the clean up and review can begin.
Split the file¶
We want to split the file into translated, untranslated and fuzzy entries:
~/path/to/posplit ./zulu-gnome-glossary.po
This will create three files:
zulu-gnome-glossary-translated.po – all fully translated entries
zulu-gnome-glossary-untranslated.po – messages with no translation
zulu-gnome-glossary-fuzzy.po – words that need investigation
rm zulu-gnome-glossary-untranslated.po
We discard zulu-gnome-glossary-untranslated.po
since they are of no use to
us.
Dealing with the fuzzies¶
The fuzzies come in two kinds. Those that are simply wrong or needed updating and those where there was more then one translation for a given term. So if someone had translated ‘File’ differently across the translations we’d have an entry that was marked fuzzy with the two options displayed.
pofilter -t compendiumconflicts zulu-gnome-glossary-fuzzy.po zulu-gnome-glossary-conflicts.po
These compendium conflicts are what we are interested in so we use pofilter to filter them from the other fuzzies.
rm zulu-gnome-glossary-fuzzy.po
We discard the other fuzzies as they where probably wrong in the first place. You could review these but it is not recommended.
Now edit zulu-gnome-glossary-conflicts.po
to resolve the conflicts. You
can edit them however you like but we usually follow the format:
option1, option2, option3
You can get them into that layout by doing the following:
sed '/#, fuzzy/d; /\"#-#-#-#-# /d; /# (pofilter) compendiumconflicts:/d; s/\\n"$/, "/' zulu-gnome-glossary-conflicts.po > tmp.po
msgcat tmp.po > zulu-gnome-glossary-conflicts.po
Of course if a word is clearly wrong, misspelled etc. then you can eliminate it. Often you will find the “problem” relates to the part of speech of the source word and that indeed there are two options depending on the context.
You now have a cleaned fuzzy file and we are ready to proceed.
Put it back together again¶
msgcat zulu-gnome-glossary-translated.po zulu-gnome-glossary-conflicts.po > zulu-gnome-glossary.po
We now have a single file zulu-gnome-glossary.po
which contains our
glossary texts.
Create other formats¶
It is probably good to make your terminology available in other formats. You can create CSV and TMX files from your PO.
po2csv zulu-gnome-glossary.po zulu-gnome-glossary.csv
po2tmx -l zu zulu-gnome-glossary.po zulu-gnome-glossary.tmx
For the terminology to be usable by Trados or Wordfast translators they need to be in the following formats:
Trados – comma delimited file
source,target
Wordfast – tab delimited file
source[tab]target
In that format they are now available to almost all localisers in the world.
FIXME need scripts to generate these formats.
The work has only just begun¶
The lists you have just created are useful in their own right. But you most likely want to keep growing them, cleaning and improving them.
You should as a first step review what you have created and fix spelling and other errors or disambiguate terms as needed.
But congratulations a Terminology list or Glossary is one of your most important assets for creating good and consistent translations and it acts as a valuable resource for both new and experienced translators when they need prompting as to how to translate a term.