Status and Changes
Overview
Generic TODO list for the SVN version
Below you find release notes on specific releases. For the latest version in the SVN Repository, we have a TODO list in the repository itsself, which can be found here: TODO.txt
Web as corpus Toolkit version pre 3 (November 2005)
Download
- See Downloads section.
New
- The first version to be officially released under the terms of the
GNU General Public License
Open issues
Some filter modules are in beta or alpha stadium:
- insert-wactags: Conflicts with boilerplate and is far from usable
- doctor-unicode: Always needs to be improved to repair previously unseen encoding accidents
- string-tokenizer: Too slow and not intelligent. Should maybe rewritten
- british-american: Language classifier needing evaluation
- generic-shell-filter: Currently not parallelization-safe
- Sketchout should support other output formats than the input for Word Sketch Engine. Real XML output will be added in the future.
- Unicode::MapUTF8 is claimed to be required, which is false. See FAQ.
Wishes and Remarks
People using or testing version pre3 came up with those:
FilterPack:
- ... should also work as standalone application on raw data.
- to-unicode relies on character set information of web pages with is unfortunate for Chinese. The guessing with Encode::Guess is not very accurate compared to other tools.
- boilerplate can be too radical and is maybe not usable at all with languages that do not use white space as word delimiter.
Last edited on Monday 28 August 2006 14:00:30