As of August 2010, these pages are not going to be maintained any further and the WaC TK project has been discontinued. The web site will be kept up as an archive.

Release pre3 is functional, but the latest version from the SVN repository (see below) includes many more features such as MS Word or PDF input, and managing multiple corpus configurations/profiles.

What is the Web as Corpus Toolkit?

The Web as Corpus Toolkit is a collection of programs that can be used to create a (large) text corpus from a list of URLs. The corpus can then be used for linguistic purposes or for lexicography. While it is questionable whether you are allowed to distribute a corpus of web pages you do are not the copyright holder of, it is much easier to distribute only pointers to all those pages - a list of URLs.

The programs are easy to use and written entirely in Perl. Some extra Perl modules from CPAN are required. Linux is recommended. Other Un*x systems are not approved to work with it. A detailed manual is available. The tools are licensed unter the terms of the GNU General Public License.

Here is an incomplete list of the most important features of the latest release, not including upcoming features being already present in the SVN releases

