Bug#504504: RFP: lemur -- Toolkit for Language Modeling and Information Retrieval
Package: wnpp
Severity: wishlist
* Package name : lemur
Version : 4.7
Upstream Author : The Lemur Project
* URL : http://www.lemurproject.org/
* License : MIT/X
Programming Lang: C, C++
Description : Toolkit for Language Modeling and Information Retrieval
The Lemur Toolkit is designed to facilitate research in language
modeling and information retrieval. Lemur supports a wide range of
industrial and research language applications such as ad-hoc retrieval,
site-search, and text mining.
The toolkit supports indexing of large-scale text databases, the
construction of simple language models for documents, queries, or
subcollections, and the implementation of retrieval systems based on
language models as well as a variety of other retrieval models.
The system is written in the C and C++ languages.
Below is a summary listing of the features found within the Lemur
Toolkit:
* Sophisticated structured query languages (using InQuery and Indri)
* Support for XML and structured document retrieval
* Used commonly with a wide range of research test collections (e.g.,
TREC CDs 1-5, wt10g, RCV1, gov, gov2)
* Index your web pages with an "out-of-the-box" site search capability
* Interactive interfaces for Windows, Linux, and Web
* Distributed information retrieval and document clustering
applications
* Cross-platform, fast and modular code written in C++
* C++, Java and C# APIs
* In use since 2002 by a large and growing user community
Indexing features:
* Multiple indexing methods for small, medium and large-scale
(terabyte) collections
* Built-in support for English, Chinese and Arabic text
* Porter and Krovetz word stemming
* Incremental indexing
* Out-of-the-box indexing support for TREC Text, TREC Web, plain text,
HTML, XML, PDF, MBox, Microsoft Word, and Microsoft PowerPoint
* Indexes inline and offset text annotations (e.g., part-of-speech and
named entities)
* Indexes document attributes
Retrieval features:
* Supports major language modeling approaches such as Indri and
KL-divergence, as well as vector space, tf.idf, Okapi and InQuery
* Relevance- and pseudo-relevance feedback
* Wildcard term expansion (using Indri)
* Passage and XML element retrieval
* Cross-lingual retrieval
* Smoothing via Dirichlet priors and Markov chains
* Supports arbitrary document priors (e.g., Page Rank, URL depth)
-----------------------------------------------------------------------------
I'll start working on the packaging myself today. My work will likely
appear somewhere on http://non-gnu.uvt.nl/.
Bye,
Joost
Reply to: