Posted by lordoftheflame on February 23, 2007
After my post on the tools and methods for screen scraping, I was on the wild internet again to find some more interesting and useful tools, given the travails I’ve undergone through to obtain a machine readable format of the ICD-9- CM codes. I still wonder why, even in this age, people think only of themselves and don’t even give a dime to the stupid machines working that hard to make our life simpler. Any ways, here we go :
Posted in Research | 2 Comments »
Posted by lordoftheflame on February 19, 2007
Here are some tools and things one should(read: I should) never forget while trying to extract text from documents of all kinds and file formats. Here they go :
The survival armour:
- “Lynx -dump” : The first phase of screen-scraping . Nothing else comes before this. The one that comes to mind next to this is to convert the given file(of arbitrary file format) to text format and start ascending the hill from there.
- Convert to XML : Open office has excellent format to convert several kinds of formats into XML standard documents. Open office rocks, except for its memory constraints while
starting up.
Windows? You will disappear to antiquity. Need Suggestion? GNU homepage will tell you.
- HTML Parser : See sourceforge.
- XML Parsing : All kinds of Parsers are available in all flavours, in all languages. But when it comes to handling huge files, go for SAX parsing, I prefer JAVA.
- Body Text Extraction : This is by far the best script (sorry.. program) I’ve ever known to extract body(in its real sense) from any HTML page. It does have performance problems, it cannot be used per se for real time extraction, a minor modification involving dynamic programming will make it ready for the race.
- Along with that, we have up our sleeve the divine editor “Vi”, the swiss-army-knife “sed”, the dark horse “Grep” along with “tr” and many of its friends from GNU, to solve much of the problems that seem to trouble us.
- Anything beyond that, the God “Perl” takes over the reins. The moment you get tired of any more hacks, Java evinces its importance. The story should end there, otherwise you are making a mistake, somewhere, huge.
That’s probably not everything. I can’t remember others right away, probably because of half-interest in posting this and probably because of the sound sleep that’s taking over. Good night.
Posted in Blogroll | No Comments »
Posted by lordoftheflame on February 17, 2007
After that TS(test series, name sake), I am participating (read: working towards participation/ planning to participate) in the International Challenge: Classifying Clinical Free Text Using Natural Language Processing that involves assignment of ICD-9-CM codes to clinical free text.
Hope, it would be decent effort, if not outright success.
Posted in Blogroll, Research | No Comments »
Posted by lordoftheflame on February 8, 2007
Its been quite some time. Days back, I was trying to figure out ways to post to my blog even without logging into it, by email. Blogspot does give me this, but somehow I could clearly feel wp far better than blogspot, it lets me enjoy freedom, tons of it. But then, every time I try to login to wp, I quickly realise how busy it is, from its load time (don’t tell me its software is bad). Eventually I failed to figure out such tricks, forcing myself back again to posting through wp on-line.
And then the big idea! I don’t know for sure how many hours a day I spend online, but I am afraid, if I really start taking statistics, it would make me face some music from my parents. Leaving that aside, I often feel this.. Yes the web 2.0 way of social bookmarking is good. Pretty good. Equally good was the google psearch (personalized search, underline ‘was’, for me delicious is far better). But would it not be better, if a firefox plugin(I dont take the risk of suggesting new features for IE.. even after witnessing the ‘grand’ release of Vista!) could observe me all the time and noting now nicely what links I am traversing and storing it in some cute data structure, so that when I return back trying to figure out what I was looking for and felt very interesting that day after the Math class? (that one seems to run quite long..sentence that is.). That would be one more aspect of personalizing search and the online experience. For eg, I see that it is helpful if a browser plugin notes down that I visited wikipedia main page, then to the ACL wiki page, and the from that page to the nlpers blog page and then (after realising this idea) visiting the wordpress login page to submit this post and so on.
If the current bookmarking without giving importance to the previous and later hyperlinks followed is to the bag of words approach of the state-of-art search engines, my idea would be mapped to something kind of discourse in natural language processing( twas a boring analogy perhaps).
And realising that my age old plans to contribute to the open source community with some decent project, here I sign off with a brand new exciting plan to jump into the open source software contribution, along with bolstering my research background. See ya.
Posted in Blogroll | 1 Comment »