[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#504504: RFP: lemur -- Toolkit for Language Modeling and Information Retrieval



Package: wnpp
Severity: wishlist

* Package name    : lemur
  Version         : 4.7
  Upstream Author : The Lemur Project
* URL             : http://www.lemurproject.org/
* License         : MIT/X
  Programming Lang: C, C++
  Description     : Toolkit for Language Modeling and Information Retrieval

The Lemur Toolkit is designed to facilitate research in language
modeling and information retrieval. Lemur supports a wide range of
industrial and research language applications such as ad-hoc retrieval,
site-search, and text mining.

The toolkit supports indexing of large-scale text databases, the
construction of simple language models for documents, queries, or
subcollections, and the implementation of retrieval systems based on
language models as well as a variety of other retrieval models. 

The system is written in the C and C++ languages.

Below is a summary listing of the features found within the Lemur
Toolkit:

 * Sophisticated structured query languages (using InQuery and Indri)
 * Support for XML and structured document retrieval
 * Used commonly with a wide range of research test collections (e.g.,
   TREC CDs 1-5, wt10g, RCV1, gov, gov2)
 * Index your web pages with an "out-of-the-box" site search capability
 * Interactive interfaces for Windows, Linux, and Web
 * Distributed information retrieval and document clustering
   applications
 * Cross-platform, fast and modular code written in C++
 * C++, Java and C# APIs
 * In use since 2002 by a large and growing user community

Indexing features:

 * Multiple indexing methods for small, medium and large-scale
   (terabyte) collections
 * Built-in support for English, Chinese and Arabic text
 * Porter and Krovetz word stemming
 * Incremental indexing
 * Out-of-the-box indexing support for TREC Text, TREC Web, plain text,
   HTML, XML, PDF, MBox, Microsoft Word, and Microsoft PowerPoint
 * Indexes inline and offset text annotations (e.g., part-of-speech and
   named entities)
 * Indexes document attributes

Retrieval features:

 * Supports major language modeling approaches such as Indri and
   KL-divergence, as well as vector space, tf.idf, Okapi and InQuery
 * Relevance- and pseudo-relevance feedback
 * Wildcard term expansion (using Indri)
 * Passage and XML element retrieval
 * Cross-lingual retrieval
 * Smoothing via Dirichlet priors and Markov chains
 * Supports arbitrary document priors (e.g., Page Rank, URL depth)

-----------------------------------------------------------------------------

I'll start working on the packaging myself today.  My work will likely
appear somewhere on http://non-gnu.uvt.nl/.

Bye,

Joost



Reply to: