Back to Silas S. Brown's home page

Annotator Generator

Jump to: Download and usage | License and citation

Annotator Generator is an examples-driven generator of fast text annotators. "Annotate" in this context means to add pronunciation or other information to each word, and/or to split text into words in a language that does not use spaces.

If you have a collection of high-quality hand-edited annotations, your generated annotator can:

Download and Usage

Download annogen.py; you will need Python and a command prompt.

Version 0.6282
Usage: annogen.py [options]

Options:

-h, --help
show this help message and exit
--infile=INFILE
Filename of a text file (or a compressed .gz, .bz2 or .xz file) to read the input examples from. If this is not specified, standard input is used.
--incode=INCODE
Character encoding of the input file (default utf-8)
--mstart=MARKUPSTART
The string that starts a piece of text with annotation markup in the input examples; default <ruby><rb>
--mmid=MARKUPMID
The string that occurs in the middle of a piece of markup in the input examples, with the word on its left and the added markup on its right (or the other way around if mreverse is set); default </rb><rt>
--mend=MARKUPEND
The string that ends a piece of annotation markup in the input examples; default </rt></ruby>
--mreverse
Specifies that the annotation markup is reversed, so the text BEFORE mmid is the annotation and the text AFTER it is the base text
--reference-sep=REFERENCE_SEP
Reference separator code used in the example input. If you want to keep example source references for each rule, you can label the input with 'references' (chapter and section numbers or whatever), and use this option to specify what keyword or other markup the input will use between each 'reference'. The name of the next reference will be whatever text immediately follows this string. Note that the reference separator, and the reference name that follows it, should not be part of the text itself and should therefore not be part of any annotation markup. If this option is not set then references will not be tracked.
--ref-name-end=REF_NAME_END
Sets what the input uses to END a reference name. The default is a single space, so that the first space after the reference-sep string will end the reference name.
--ref-pri=REF_PRI
Name of a reference to be considered "high priority" for Yarowsky-like seed collocations (if these are in use). Normally the Yarowsky-like logic tries to identify a "default" annotation based on what is most common in the examples, with the exceptions indicated by collocations. If however a word is found in a high priority reference then the first annotation found in that reference will be considered the ideal "default" even if it's in a minority in the examples; everything else will be considered as an exception. In languages without spaces, this override should normally be used only for one-character words; if used with longer words it might have unexpected effects on rule-overlap ambiguities.
-s, --spaces
Set this if you are working with a language that uses whitespace in its non-markedup version (not fully tested). The default is to assume that there will not be any whitespace in the language, which is correct for Chinese and Japanese.
-c, --capitalisation
Don't try to normalise capitalisation in the input. Normally, to simplify the rules, the analyser will try to remove start-of-sentence capitals in annotations, so that the only remaining words with capital letters are the ones that are ALWAYS capitalised such as names. (That's not perfect: some words might always be capitalised just because they never occur mid-sentence in the examples.) If this option is used, the analyser will instead try to "learn" how to predict the capitalisation of ALL words (including start of sentence words) from their contexts.
-w, --annot-whitespace
Don't try to normalise the use of whitespace and hyphenation in the example annotations. Normally the analyser will try to do this, to reduce the risk of missing possible rules due to minor typographical variations.
--keep-whitespace=KEEP_WHITESPACE
Comma-separated list of words (without annotation markup) for which whitespace and hyphenation should always be kept even without the --annot-whitespace option. Use when you know the variation is legitimate. This option expects words to be encoded using the system locale (UTF-8 if it cannot be detected).
--glossfile=GLOSSFILE
Filename of an optional text file (or compressed .gz, .bz2 or .xz file) to read auxiliary "gloss" information. Each line of this should be of the form: word (tab) annotation (tab) gloss. Extra tabs in the gloss will be converted to newlines (useful if you want to quote multiple dictionaries). When the compiled annotator generates ruby markup, it will add the gloss string as a popup title whenever that word is used with that annotation. The annotation field may be left blank to indicate that the gloss will appear for any annotation of that word. The entries in glossfile do NOT affect the annotation process itself, so it's not necessary to completely debug glossfile's word segmentation etc.
--glossmiss=GLOSSMISS
Name of an optional file to which to write information about words recognised by the annotator that are missing in glossfile (along with frequency counts and references, if available)
--glossmiss-omit
Omit rules containing any word not mentioned in glossfile. Might be useful if you want to train on a text that uses proprietary terms and don't want to accidentally 'leak' those terms (assuming they're not accidentally included in glossfile also). Words may also be listed in glossfile with an empty gloss field to indicate that no gloss is available but rules using this word needn't be omitted.
--manualrules=MANUALRULES
Filename of an optional text file (or compressed .gz, .bz2 or .xz file) to read extra, manually-written rules. Each line of this should be a marked-up phrase (in the input format) which is to be unconditionally added as a rule. Use this sparingly, because these rules are not taken into account when generating the others and they will be applied regardless of context (although a manual rule might fail to activate if the annotator is part-way through processing a different rule); try checking messages from --diagnose-manual.
--rulesFile=RULESFILE
Filename of an optional auxiliary binary file to hold the accumulated rules. Adding .gz, .bz2 or .xz for compression is acceptable. If this is set then the rules will be written to it (in binary format) as well as to the output. Additionally, if the file already exists then rules will be read from it and incrementally updated. This might be useful if you have made some small additions to the examples and would like these to be incorporated without a complete re-run. It might not work as well as a re-run but it should be faster. If using a rulesFile then you must keep the same input (you may make small additions etc, but it won't work properly if you delete many examples or change the format between runs) and you must keep the same ybytes-related options if any.
--no-input
Don't actually read the input, just use the rules that were previously stored in rulesFile. This can be used to increase speed if the only changes made are to the output options. You should still specify the input formatting options (which should not change), and any glossfile or manualrules options (which may change).
--c-filename=C_FILENAME
Where to write the C program. Defaults to standard output, or annotator.c in the system temporary directory if standard output seems to be the terminal (the program might be large, especially if Yarowsky-like indicators are not used, so it's best not to use a server home directory where you might have limited quota). If MPI is in use then the default will always be standard output.
--c-compiler=C_COMPILER
The C compiler to run if standard output is not connected to a pipe. The default is to use the "cc" command which usually redirects to your "normal" compiler. You can add options (remembering to enclose this whole parameter in quotes if it contains spaces), but if the C program is large then adding optimisation options may make the compile take a LONG time. If standard output is connected to a pipe, then this option is ignored because the C code will simply be written to the pipe. You can also set this option to an empty string to skip compilation. Default: cc -o annotator
--max-or-length=MAX_OR_LENGTH
The maximum number of items allowed in an OR-expression in non table-driven code (used when ybytes is in effect). When an OR-expression becomes larger than this limit, it will be made into a function. 0 means unlimited, which works for tcc and gcc; many other compilers have limits. Default: 100
--nested-switch=NESTED_SWITCH
Allow C/C#/Java/Go switch() constructs to be nested to about this depth. Default 0 tries to avoid nesting, as it slows down most C compilers for small savings in executable size. Setting 1 nests 1 level deeper which can occasionally help get around memory problems with Java compilers. -1 means nest to unlimited depth, which is not recommended.
--outcode=OUTCODE
Character encoding to use in the generated parser and rules summary (default utf-8, must be ASCII-compatible i.e. not utf-16)
-S, --summary-only
Don't generate a parser, just write the rules summary to standard output
--no-summary
Don't add a large rules-summary comment at the end of the parser code
-O SUMMARY_OMIT, --summary-omit=SUMMARY_OMIT
Filename of a text file (or a compressed .gz, .bz2 or .xz file) specifying what should be omitted from the rules summary. Each line should be a word or phrase, a tab, and its annotation (without the mstart/mmid/mend markup). If any rule in the summary exactly matches any of the lines in this text file, then that rule will be omitted from the summary (but still included in the parser). Use for example to take out of the summary any entries that correspond to things you already have in your dictionary, so you can see what's new.
--maxrefs=MAXREFS
The maximum number of example references to record in each summary line, if references are being recorded (0 means unlimited). Default is 3.
--norefs
Don't write references in the rules summary (or the glossmiss file). Use this if you need to specify reference-sep and ref-name-end for the ref-pri option but you don't actually want references in the summary (which speeds up summary generation slightly). This option is automatically turned on if --no-input is specified.
--newlines-reset
Have the annotator reset its state on every newline byte. By default newlines do not affect state such as whether a space is required before the next word, so that if the annotator is used with Web Adjuster's htmlText option (which defaults to using newline separators) the spacing should be handled sensibly when there is HTML markup in mid-sentence.
--compress
Compress annotation strings in the C code. This compression is designed for fast on-the-fly decoding, so it saves only a limited amount of space (typically 10-20%) but that might help if memory is short; see also --data-driven.
--ios=IOS
Include Objective-C code for an iOS app that opens a web-browser component and annotates the text on every page it loads. The initial page is specified by this option: it can be a URL, or a markup fragment starting with < to hard-code the contents of the page. Also provided is a custom URL scheme to annotate the local clipboard. You will need Xcode to compile the app (see the start of the generated C file for instructions); if it runs out of space, try using --data-driven
--data-driven
Generate a program that works by interpreting embedded data tables for comparisons, instead of writing these as code. This can take some load off the compiler (so try it if you get errors like clang's "section too large"), as well as compiling faster and reducing the resulting binary's RAM size (by 35-40% is typical), at the expense of a small reduction in execution speed. Javascript and Python output is always data-driven anyway.
--zlib
Enable --data-driven and compress the embedded data table using zlib, and include code to call zlib to decompress it on load. Useful if the runtime machine has the zlib library and you need to save disk space but not RAM (the decompressed table is stored separately in RAM, unlike --compress which, although giving less compression, at least works 'in place'). Once --zlib is in use, specifying --compress too will typically give an additional disk space saving of less than 1% (and a runtime RAM saving that's greater but more than offset by zlib's extraction RAM).
--windows-clipboard
Include C code to read the clipboard on Windows or Windows Mobile and to write an annotated HTML file and launch a browser, instead of using the default cross-platform command-line C wrapper. See the start of the generated C file for instructions on how to compile for Windows or Windows Mobile.
--c-sharp
Instead of generating C code, generate C# (not quite as efficient as the C code but close; might be useful for adding an annotator to a C# project; see comments at the start for usage)
--java=JAVA
Instead of generating C code, generate Java, and place the *.java files in the directory specified by this option, removing any existing *.java files. See --android for example use. The last part of the directory should be made up of the package name; a double slash (//) should separate the rest of the path from the package name, e.g. --java=/path/to/wherever//org/example/package and the main class will be called Annotator.
--android=ANDROID
URL for an Android app to browse. If this is set, code is generated for an Android app which starts a browser with that URL as the start page, and annotates the text on every page it loads. A function to annotate the local clipboard is also provided. You will need the Android SDK to compile the app; see comments in MainActivity.java for details.
--ndk=NDK
Android NDK: make a C annotator and use ndk-build to compile it into an Android JNI library. This is a more complex setup than a Java-based annotator, but it improves speed and size. The --ndk option should be set to the name of the package that will use the library, and --android should be set to the initial URL. See comments in the output file for details.
--javascript
Instead of generating C code, generate JavaScript. This might be useful if you want to run an annotator on a device that has a JS interpreter but doesn't let you run native code. The JS will be table-driven to make it load faster (and --no-summary will also be set). See comments at the start for usage.
--python
Instead of generating C code, generate a Python module. Similar to the Javascript option, this is for when you can't run native code, and it is table-driven for fast loading.
--golang=GOLANG
Package name for a Go library to generate instead of C code. See comments in the generated file for how to run this on AppEngine.
--reannotator=REANNOTATOR
Shell command through which to pipe each word of the original text to obtain new annotation for that word. This might be useful as a quick way of generating a new annotator (e.g. for a different topolect) while keeping the information about word separation and/or glosses from the previous annotator, but it is limited to commands that don't need to look beyond the boundaries of each word. If the command is prefixed by a # character, it will be given the word's existing annotation instead of its original text, and if prefixed by ## it will be given text#annotation. The command should treat each line of its input independently, and both its input and its output should be in the encoding specified by --outcode.
-o, --allow-overlaps
Normally, the analyser avoids generating rules that could overlap with each other in a way that would leave the program not knowing which one to apply. If a short rule would cause overlaps, the analyser will prefer to generate a longer rule that uses more context, and if even the entire phrase cannot be made into a rule without causing overlaps then the analyser will give up on trying to cover that phrase. This option allows the analyser to generate rules that could overlap, as long as none of the overlaps would cause actual problems in the example phrases. Thus more of the examples can be covered, at the expense of a higher risk of ambiguity problems when applying the rules to other texts. See also the -y option.
-P, --primitive
Don't bother with any overlap or conflict checks at all, just make a rule for each word. The resulting parser is not likely to be useful, but the summary might be.
-y YBYTES, --ybytes=YBYTES
Look for candidate Yarowsky seed-collocations within this number of bytes of the end of a word. If this is set then overlaps and rule conflicts will be allowed if the seed collocations can be used to distinguish between them. Markup examples that are completely separate (e.g. sentences from different sources) must have at least this number of (non-whitespace) bytes between them.
--ybytes-max=YBYTES_MAX
Extend the Yarowsky seed-collocation search to check over larger ranges up to this maximum. If this is set then several ranges will be checked in an attempt to determine the best one for each word, but see also ymax-threshold.
--ymax-threshold=YMAX_THRESHOLD
Limits the length of word that receives the narrower-range Yarowsky search when ybytes-max is in use. For words longer than this, the search will go directly to ybytes-max. This is for languages where the likelihood of a word's annotation being influenced by its immediate neighbours more than its distant collocations increases for shorter words, and less is to be gained by comparing different ranges when processing longer words. Setting this to 0 means no limit, i.e. the full range will be explored on ALL Yarowsky checks.
--ybytes-step=YBYTES_STEP
The increment value for the loop between ybytes and ybytes-max
--warn-yarowsky
Warn when absolutely no distinguishing Yarowsky seed collocations can be found for a word in the examples
--yarowsky-all
Accept Yarowsky seed collocations even from input characters that never occur in annotated words (this might include punctuation and example-separation markup)
--yarowsky-debug=YAROWSKY_DEBUG
Report the details of seed-collocation false positives if there are a large number of matches and at most this number of false positives (default 1). Occasionally these might be due to typos in the corpus, so it might be worth a check.
--single-words
Do not consider any rule longer than 1 word, although it can still have Yarowsky seed collocations if -y is set. This speeds up the search, but at the expense of thoroughness. You might want to use this in conjuction with -y to make a parser quickly. It is like -P (primitive) but without removing the conflict checks.
--max-words=MAX_WORDS
Limits the number of words in a rule; rules longer than this are not considered. 0 means no limit. --single-words is equivalent to --max-words=1. If you need to limit the search time, and are using -y, it should suffice to use --single-words for a quick annotator or --max-words=5 for a more thorough one.
--checkpoint=CHECKPOINT
Periodically save checkpoint files in the specified directory. These files can save time when starting again after a reboot (and it's easier than setting up Condor etc). As well as a protection against random reboots, this can be used for scheduled reboots: if file called ExitASAP appears in the checkpoint directory, annogen will checkpoint, remove the ExitASAP file, and exit. After a run has completed, the checkpoint directory should be removed, unless you want to re-do the last part of the run for some reason.
-d DIAGNOSE, --diagnose=DIAGNOSE
Output some diagnostics for the specified word. Use this option to help answer "why doesn't it have a rule for...?" issues. This option expects the word without markup and uses the system locale (UTF-8 if it cannot be detected).
--diagnose-limit=DIAGNOSE_LIMIT
Maximum number of phrases to print diagnostics for (0 means unlimited); can be useful when trying to diagnose a common word in rulesFile without re-evaluating all phrases that contain it. Default: 10
--diagnose-manual
Check and diagnose potential failures of --manualrules
--diagnose-quick
Ignore all phrases that do not contain the word specified by the --diagnose option, for getting a faster (but possibly less accurate) diagnostic. The generated annotator is not likely to be useful when this option is present. You may get quick diagnostics WITHOUT these disadvantages by loading a --rulesFile instead.
--time-estimate
Estimate time to completion. The code to do this is unreliable and is prone to underestimate. If you turn it on, its estimate is displayed at the end of the status line as days, hours or minutes.
--single-core
Use only one CPU core even when others are available. (If this option is not set, multiple cores are used if a 'futures' package is installed or if run under MPI or SCOOP; this currently requires --checkpoint + shared filespace, and is currently used only for large collocation checks in limited circumstances.)
-p STATUS_PREFIX, --status-prefix=STATUS_PREFIX
Label to add at the start of the status line, for use if you batch-run annogen in multiple configurations and want to know which one is currently running

License

Annotator Generator is free software licensed under the Apache License, Version 2.0 (this is also the license used by Web Adjuster). If you use it in a good project, I'd appreciate hearing about it.

Citation

If you need to cite a peer-reviewed paper:
Silas S. Brown.  Web Annotation with Modified-Yarowsky and Other Algorithms.  Overload 112 (December 2012) pp.4-7 

All material © Silas S. Brown unless otherwise stated.