HomePage
As of August 2010, these pages are not going to be maintained any further and the WaC TK project has been discontinued. The web site will be kept up as an archive.
Release pre3 is functional, but the latest version from the SVN repository (see below) includes many more features such as MS Word or PDF input, and managing multiple corpus configurations/profiles.
(This site is work in progress)
Welcome to the web site of the Web as Corpus Toolkit. This web page is based on a wiki to make maintainance easier and to bring you all those nice features like RSS feeds, fulltext search and so on.
What is the Web as Corpus Toolkit?
The Web as Corpus Toolkit is a collection of programs that can be used to create a (large) text corpus from a list of URLs. The corpus can then be used for linguistic purposes or for lexicography. While it is questionable whether you are allowed to distribute a corpus of web pages you do are not the copyright holder of, it is much easier to distribute only pointers to all those pages - a list of URLs.
The programs are easy to use and written entirely in Perl. Some extra Perl modules from CPAN are required. Linux is recommended. Other Un*x systems are not approved to work with it. A detailed manual is available. The tools are licensed unter the terms of the GNU General Public License.
If you now are wondering what exactly a corpus is and what to do with it, we suggest you to go on a short excursion to Wikipedia.
If you wish get a quick overview, try the Schematic graphic that shows the steps of processing done by the WaC Toolkit. There is also a Screenshots page.
Features
Here is an incomplete list of the most important features of the latest release, not including upcoming features being already present in the SVN releases
- Parallel downloading from URLs to seize your internet connection
- Runs with parallel processes wherever possible to seize multi-CPU machines
- Exessive reporting in logfiles: You can find out everything
- Uses Unicode
- A number of filter modules do everything for you. If they don't do enough, you can write your own
- The common problems of web as corpus are addressed: Wrong character set information, conversion to unicode, boilerplate removal (navigation frames, etc.), sentence-segmentation, tokenization, etc.
More Information (Instead of a Menu)
- Status and Changes - what is to do and what has been done
- Downloads - get the toolkit
- SVN Repository - get the most recent stuff
- Documentation - read the detailed manual
- FAQ - what people want to know and what is not answered by the manual
- Screenshots - everybody likes those
- People - people who are involved
- History - why the WaC Toolkit exists and how it all started
Concacting Developers
You find possibilities of contacting us on the People page.
Related Projects and Sites
- Web as Corpus Workshop 2005
- Web as Corpus Workshop 2006
- Web as Corpus Kool Ynitiative
- CLEANEVAL Project
- Word Sketch Engine is more than a concordancer. And it can use the output of the WaC Tk.
- OpenNLP provides open-source libraries and tools for natural language processing.
- BootCat Utilities - Bootstrap Corpora and Terms from the Web
The original contents of the virgin Wiki has been moved to the VirginWiki page.
Last edited on Sunday 1 October 2006 12:58:09