LEA

strip_headers.cpp File Reference


Detailed Description

This program preprocesses text files from the Project Gutenberg.

It strips the headers and footers from a Project Gutenberg ebook text file (http://www.gutenberg.org/). This is necessary because unfortunately there is no standard delimiter to separate the actual text from the header and footer, so we have to apply some heuristics here.

This program has been tested on nearly all the Project Gutenberg texts. Only for some files it might leave in some lines of the headers or remove too many lines. For most of the thousands of files it determines the boundaries correctly.

Usage:

  •      strip_headers INFILE OUTFILE
    
Parameters:
INFILEThe name of the input file downloaded from Project Gutenberg.
OUTFILEThe name of the output file (has to be different from input file because they will be read and written at the same time).
Returns:
0 on success, something else on error

Download: