Posted by lordoftheflame on February 23, 2007
After my post on the tools and methods for screen scraping, I was on the wild internet again to find some more interesting and useful tools, given the travails I’ve undergone through to obtain a machine readable format of the ICD-9- CM codes. I still wonder why, even in this age, people think only of themselves and don’t even give a dime to the stupid machines working that hard to make our life simpler. Any ways, here we go :
Posted in Research | 2 Comments »
Posted by lordoftheflame on February 19, 2007
Here are some tools and things one should(read: I should) never forget while trying to extract text from documents of all kinds and file formats. Here they go :
The survival armour:
- “Lynx -dump” : The first phase of screen-scraping . Nothing else comes before this. The one that comes to mind next to this is to convert the given file(of arbitrary file format) to text format and start ascending the hill from there.
- Convert to XML : Open office has excellent format to convert several kinds of formats into XML standard documents. Open office rocks, except for its memory constraints while starting up.
Windows? You will disappear to antiquity. Need Suggestion? GNU homepage will tell you.
- HTML Parser : See sourceforge.
- XML Parsing : All kinds of Parsers are available in all flavours, in all languages. But when it comes to handling huge files, go for SAX parsing, I prefer JAVA.
- Body Text Extraction : This is by far the best script (sorry.. program) I’ve ever known to extract body(in its real sense) from any HTML page. It does have performance problems, it cannot be used per se for real time extraction, a minor modification involving dynamic programming will make it ready for the race.
- Along with that, we have up our sleeve the divine editor “Vi”, the swiss-army-knife “sed”, the dark horse “Grep” along with “tr” and many of its friends from GNU, to solve much of the problems that seem to trouble us.
- Anything beyond that, the God “Perl” takes over the reins. The moment you get tired of any more hacks, Java evinces its importance. The story should end there, otherwise you are making a mistake, somewhere, huge.
That’s probably not everything. I can’t remember others right away, probably because of half-interest in posting this and probably because of the sound sleep that’s taking over. Good night.
Posted in Blogroll | Leave a Comment »
Posted by lordoftheflame on February 17, 2007
After that TS(test series, name sake), I am participating (read: working towards participation/ planning to participate) in the International Challenge: Classifying Clinical Free Text Using Natural Language Processing that involves assignment of ICD-9-CM codes to clinical free text.
Hope, it would be decent effort, if not outright success.
Posted in Blogroll, Research | Leave a Comment »
Posted by lordoftheflame on February 8, 2007
Its been quite some time. Days back, I was trying to figure out ways to post to my blog even without logging into it, by email. Blogspot does give me this, but somehow I could clearly feel wp far better than blogspot, it lets me enjoy freedom, tons of it. But then, every time I try to login to wp, I quickly realise how busy it is, from its load time (don’t tell me its software is bad). Eventually I failed to figure out such tricks, forcing myself back again to posting through wp on-line.
And then the big idea! I don’t know for sure how many hours a day I spend online, but I am afraid, if I really start taking statistics, it would make me face some music from my parents. Leaving that aside, I often feel this.. Yes the web 2.0 way of social bookmarking is good. Pretty good. Equally good was the google psearch (personalized search, underline ‘was’, for me delicious is far better). But would it not be better, if a firefox plugin(I dont take the risk of suggesting new features for IE.. even after witnessing the ‘grand’ release of Vista!) could observe me all the time and noting now nicely what links I am traversing and storing it in some cute data structure, so that when I return back trying to figure out what I was looking for and felt very interesting that day after the Math class? (that one seems to run quite long..sentence that is.). That would be one more aspect of personalizing search and the online experience. For eg, I see that it is helpful if a browser plugin notes down that I visited wikipedia main page, then to the ACL wiki page, and the from that page to the nlpers blog page and then (after realising this idea) visiting the wordpress login page to submit this post and so on.
If the current bookmarking without giving importance to the previous and later hyperlinks followed is to the bag of words approach of the state-of-art search engines, my idea would be mapped to something kind of discourse in natural language processing( twas a boring analogy perhaps).
And realising that my age old plans to contribute to the open source community with some decent project, here I sign off with a brand new exciting plan to jump into the open source software contribution, along with bolstering my research background. See ya.
Posted in Blogroll | 1 Comment »
Posted by lordoftheflame on January 18, 2007
From now on I should better post all my notes on this category. Here is the first post in Research category.
My interest in NLP,Question Answering urges me have a deep study of Information Retrieval as well. Here are some points that keep recurring (reference: wikipedia and the links from there) -
- The performance measures heavily rely on the collection of documents and the query for which the relevancy of the documents is known.
- And they assume binary relevancy: the document is either relevant or completely irrelevant which is different from what we face in practice.
- Precision, recall, F-measure, Fall-out and Average precision are measures for IR. Question answering needs a different kind of measures or modified versions of these.
Modelling the document for retrieval
- Set-theoretic Models represent documents by sets : boolean and fuzzy models.
- Algebraic Models represent documents and queries usually as vectors, matrices or tuples. Those vectors, matrices or tuples are transformed by the use of a finite number of algebraic operations to a one-dimensional similarity measurement : vector space model and oh.. the latent semantic analysis, I’ve read quite a bit about it, but could never fit it into the big picture. Things now seem to arrange themselves into the divine order.
- Probabilistic Models: And this my favourite, for reasons that remained unknown even to me tiil now. Its all probabilities down the lane which would probably justify my interest : Bayesian inference which I studied hard to gasp their points of applications and of course yes, the language models which kept haunting me, just like LSA never fitting into the big picture. Conditional random fields , I’ve heard, is the latest model of the town, and is involved in all of this.
- A statistical language model assigns a probability to a sequence of words P(w1..n) by means of a probability distribution(wikipedia).
- Estimating the probabilty of sequences can become difficult in corpora, in which phrasessentences can be arbitrarily long and hence some sequences are not observed during training of the language model (data sparseness problem). For that reason these models are often approximated using smoothed N-gram models. or
- Fits perfectly in speech recognition, in guessing the next few words.
- And what does it signify when used in IR? A language model is then associated with a document in a collection. With query Q as input, retrieved documents are ranked based on the probability that the document’s language model would generate the terms of the query, P(Q|Md).
Read the rest of this entry »
Posted in Research | Leave a Comment »
Posted by lordoftheflame on August 18, 2006
Here is my first academic work which can be dubbed a project(in my opinion). This is work done by me during summer of 06 at IIIT.
Posted in Blogroll | 1 Comment »
Posted by lordoftheflame on August 6, 2006
This is a record of my drop-in-an-ocean contributions for bettering of this world.
Posted in Blogroll | 3 Comments »