FAQ

This page contains (probably) frequently asked questions that are related with the WaC Tk and creating a corpus.

FAQ: Table of Contents

FAQ: Table of Contents
URL lists
Where do I get such an URL list?
How do I create my own URL list?
Spidering or Crawling?
Is ParaGet suitable for spidering?
What can do I if I want to download whole sites?
Installation Issues
I can't find HTML::Strip::StripX on CPAN?!
Unicode::MapUTF8 won't install from CPAN. What to do?
What is happening to my data? (Or: Logfile analysis)
FilterPack mixes logfile entries?!
What are the most common errors that cause documents to be filtered away by FilterPack?
ParaGet created a huge mess!?
Obscure Runtime Errors
Could not create semaphore set: No space left on device

URL lists

Where do I get such an URL list?

A good starting point is Serge Sharoff's collection of internet corpora. He provides URL lists for all corpora he created. At least the ones for English and German will work. Processing of Russion and Chinese proved to be difficult with the WaC Tk.

Affected releases: All

How do I create my own URL list?

You can use Marco Baroni's BootCat Tools to create a corpus and/or list of URLs. His tools are based on queries to Google.

A more complex solution might be to use a crawler yourself.

Affected releases: All

Spidering or Crawling?

Is ParaGet ? suitable for spidering?

No. ParaGet ? can't do more than downloading from the URLs given in the list. Some ideas about making it a crawler exist but they are difficult to implement. Maybe in the future, there will be a fork called DeepParaGet ? that will do this.

Affected releases: All

What can do I if I want to download whole sites?

For downloading whole sites, there is a hacky workaround based on wget. It works like this:

Create a file with one site/URL per line
Create a directory to dump data to. Change to that directory.
Use wget to crawl over the sites, storing its output. (The directory from the previous step will be filled with a lot if files.)
Extract all URLs from the wget output to create an URL list for ParaGet ?.
Run ParaGet ? with the created URL list and use the WaC Tk in the normal fashion.

For step 2, you can use this command:

wget -r -np -S -E -T 10 -c -R ".pdf,.png,.jpg,.gif,.avi,.wmv,.wma,.mp3,.mid,.ogg,.wav,.mpg,.iso,.gz,.zip.rar" -i site-addresses.dat 2> wget-output.dat

For extracting the URLs from wget's output (step 4), use this Perl script:

#!/usr/bin/perl
use warnings;
use strict;
while (my $in = <> ) {
        if ($in =~  m/  (http:\/\/.*)/ ) {
                my $url = $1;
                $url =~ s/^\s+//g;
                $url =~ s/\s+$//g;
                print "$url\n";
        }
}

Yes, this workaround will make you download all data twice.

Affected releases: All

Installation Issues

I can't find HTML::Strip::StripX on CPAN?!

Sure you can't, it is not there. It ships with the WaC Tk. As soon as HTML::Strip is installed from CPAN, it will become available and modules won't complain about it any more.

Affected releases: All

Unicode::MapUTF8 won't install from CPAN. What to do?

Ignore it and remove the corresponding line

use Unicode::MapUTF8 qw(to_utf8 from_utf8 utf8_supported_charset);

from fp-filters/doctor-uniccode.pm.

The module is not required, we forgot to remove it in release pre3.

Affected releases: pre3

What is happening to my data? (Or: Logfile analysis)

FilterPack ? mixes logfile entries?!

This is true. Multiple processes are writing to the output at the same time, which can have inconvenient effects. This has been fixed in the SVN version.

Affected releases: All before SVN revision 19

What are the most common errors that cause documents to be filtered away by FilterPack ??

You can create statistics on the FilterPack ? logfile by using the following shell command:

grep "\-FIL" filterpack.log | sort | uniq -c | sort -n -r

Affected releases: All

ParaGet ? created a huge mess!?

This can happen if you ran ParaGet? several times. The reason is that it will not remove old .dl files from the directory tree. So you will end up with two mixed version of the download. To fix this, rename or delete the data directory you specified in paraget.conf

Affected releases: All

Obscure Runtime Errors

Could not create semaphore set: No space left on device

This message can occur with FilterPack? and ParaGet?. They both use System V Inter-process Communication (IPC) to talk to their child processes. Killing the parent processes (e.g. by pressing CTRL+C) in some rare cases leaves some shared memory segments and semaphores open until you hit the system's limit. You can use ipcs to detect those.

To remove all your shared memory segments and semaphores you can use the following dangerous commands that you never want to run as root:

ipcs -m | awk '{print($2)}' |grep  '^[0-9]' | xargs -n 1 ipcrm -m
ipcs -s | awk '{print($2)}' |grep  '^[0-9]' | xargs -n 1 ipcrm -s

This might disturb some other programs running on your machine under your user. Be careful!

Affected releases: pre3 and maybe later TODO: FIND OUT AND FILL IN HERE!

Last edited on Monday 4 February 2008 11:37:38

FAQ

FAQ: Table of Contents

URL lists

Where do I get such an URL list?

How do I create my own URL list?

Spidering or Crawling?

Is ParaGet? suitable for spidering?

What can do I if I want to download whole sites?

Installation Issues

I can't find HTML::Strip::StripX on CPAN?!

Unicode::MapUTF8 won't install from CPAN. What to do?

What is happening to my data? (Or: Logfile analysis)

FilterPack? mixes logfile entries?!

What are the most common errors that cause documents to be filtered away by FilterPack??

ParaGet? created a huge mess!?

Obscure Runtime Errors

Could not create semaphore set: No space left on device

Is ParaGet ? suitable for spidering?

FilterPack ? mixes logfile entries?!

What are the most common errors that cause documents to be filtered away by FilterPack ??

ParaGet ? created a huge mess!?