Clunky method to create glossaries

Note

A newer tool called poterminology is now part of the Translate Toolkit and should work better for all users. This page is only left for reference.

This old method is for Windows users, but the principles can be applied in Linux as well. It is for creating a subject specific glossary in a specific software translation project. The idea is to get a list of commonly used jargon from the project that can be translated beforehand.

1. Extract the source text

Take the PO file and do po2csv on it:

po2csv file.po file.csv

Open the CSV file in Excel or Calc.

Copy the “original” (or “source”) column into a text editor, like Babelpad, and save the file as text.txt.

2. Extract commonly occurring words

Use ExtPhr32 to extract the most used terms from it:

  • Do not select a stoplist (click “Cancel” when asked)
  • Go Options ‣ Minimum occurrences, and select “1”
  • Go Options ‣ Maximum words in phrase, and select “1”
  • Go File -> Extract from, and select the file with source text
  • In ExtPhr, change the “Minimum occurrences” higher and higher until you have about 200 terms (or, about 5 occurrences should do it).
  • Save the result as 5ormore.txt.

3. Create general common words list

Create a file containing the 3000 most used words in English. I used Ogden’s 850 words, plus the 3000 words, adjectives, adverbs and pronouns from here, but you could use Kevin’s lists as well (actually, that would be better because then you could legally distribute it).

4. Remove general words from extracted common words list

Use Kastrul to spellcheck the “5ormore.txt”:

  • Go File ‣ Open dictionary, and select your 3000 most common words.
  • Go File ‣ Check file, and select 5ormore.txt
  • Copy the result, and save it as glossary.txt.

That is your glossary. Remember to change it all to lowercase, and to ensure that dud words are removed.