yaw angle

The right direction..

Archive for the 'Research' Category

My notes of what I am learning and what my ideas are..

cat data-extraction.techniques | more

Posted by lordoftheflame on February 23, 2007

After my post on the tools and methods for screen scraping, I was on the wild internet again to find some more interesting and useful tools, given the travails I’ve undergone through to obtain a machine readable format of the ICD-9- CM codes. I still wonder why, even in this age, people think only of themselves and don’t even give a dime to the stupid machines working that hard to make our life simpler. Any ways, here we go :

Posted in Research | 2 Comments »

Challenge

Posted by lordoftheflame on February 17, 2007

After that TS(test series, name sake), I am participating (read: working towards participation/ planning to participate) in the International Challenge: Classifying Clinical Free Text Using Natural Language Processing that involves assignment of ICD-9-CM codes to clinical free text.

Hope, it would be decent effort, if not outright success.

Posted in Blogroll, Research | No Comments »

IR and language modelling

Posted by lordoftheflame on January 18, 2007

From now on I should better post all my notes on this category. Here is the first post in Research category.

My interest in NLP,Question Answering urges me have a deep study of Information Retrieval as well. Here are some points that keep recurring (reference: wikipedia and the links from there) -

Information Retrieval

  • The performance measures heavily rely on the collection of documents and the query for which the relevancy of the documents is known.
  • And they assume binary relevancy: the document is either relevant or completely irrelevant which is different from what we face in practice.
  • Precision, recall, F-measure, Fall-out and Average precision are measures for IR. Question answering needs a different kind of measures or modified versions of these.

Modelling the document for retrieval

  • Set-theoretic Models represent documents by sets : boolean and fuzzy models.
  • Algebraic Models represent documents and queries usually as vectors, matrices or tuples. Those vectors, matrices or tuples are transformed by the use of a finite number of algebraic operations to a one-dimensional similarity measurement : vector space model and oh.. the latent semantic analysis, I’ve read quite a bit about it, but could never fit it into the big picture. Things now seem to arrange themselves into the divine order.
  • Probabilistic Models: And this my favourite, for reasons that remained unknown even to me tiil now. Its all probabilities down the lane which would probably justify my interest : Bayesian inference which I studied hard to gasp their points of applications and of course yes, the language models which kept haunting me, just like LSA never fitting into the big picture. Conditional random fields , I’ve heard, is the latest model of the town, and is involved in all of this.

Language modelling

  • A statistical language model assigns a probability to a sequence of words P(w1..n) by means of a probability distribution(wikipedia).
  • Estimating the probabilty of sequences can become difficult in corpora, in which phrasessentences can be arbitrarily long and hence some sequences are not observed during training of the language model (data sparseness problem). For that reason these models are often approximated using smoothed N-gram models. or
  • Fits perfectly in speech recognition, in guessing the next few words.
  • And what does it signify when used in IR? A language model is then associated with a document in a collection. With query Q as input, retrieved documents are ranked based on the probability that the document’s language model would generate the terms of the query, P(Q|Md).

Read the rest of this entry »

Posted in Research | No Comments »