yaw angle

The right direction..

Archive for February 19th, 2007

Accoutrements for document processing / text extraction

Posted by lordoftheflame on February 19, 2007

Here are some tools and things one should(read: I should) never forget while trying to extract text from documents of all kinds and file formats. Here they go :

The survival armour:

  • Lynx -dump” : The first phase of screen-scraping . Nothing else comes before this. The one that comes to mind next to this is to convert the given file(of arbitrary file format) to text format and start ascending the hill from there.
  • Convert to XML : Open office has excellent format to convert several kinds of formats into XML standard documents. Open office rocks, except for its memory constraints whilevista, windows to trash starting up.

Windows? You will disappear to antiquity. Need Suggestion? GNU homepage will tell you.

  • HTML Parser : See sourceforge.
  • XML Parsing : All kinds of Parsers are available in all flavours, in all languages. But when it comes to handling huge files, go for SAX parsing, I prefer JAVA.
  • Body Text Extraction : This is by far the best script (sorry.. program) I’ve ever known to extract body(in its real sense) from any HTML page. It does have performance problems, it cannot be used per se for real time extraction, a minor modification involving dynamic programming will make it ready for the race.
  • Along with that, we have up our sleeve the divine editor “Vi”, the swiss-army-knife “sed”, the dark horse “Grep” along with “tr” and many of its friends from GNU, to solve much of the problems that seem to trouble us.
  • Anything beyond that, the God “Perl” takes over the reins. The moment you get tired of any more hacks, Java evinces its importance. The story should end there, otherwise you are making a mistake, somewhere, huge.

That’s probably not everything. I can’t remember others right away, probably because of half-interest in posting this and probably because of the sound sleep that’s taking over. Good night.

Posted in Blogroll | No Comments »