Apache Lucene – Knowledge and References

Explore chapters and articles related to this topic

Dealing With Raw Data

Published in Cliff Wootton, Developing Quality Metadata, 2009

Apache Lucene is an open source high-performance, full-featured text search engine library written entirely in Java. It is suitable for a wide variety of applications that requires full-text search, especially cross-platform. It is, (for the moment), only available in Java but you could download the source code and integrate it via a bridge with other languages.

Information retrieval

View Chapter

Purchase Book

Published in Catherine Dawson, A–Z of Digital Research Methods, 2019

Catherine Dawson

There are a wide variety of digital tools and software packages available for information retrieval. Some of these are open source and free to use, whereas others have significant costs attached. Tools and software vary enormously and your choice will depend on your research topic, needs and purpose. The following list provides a snapshot of what is available at time of writing and is listed in alphabetical order: Apache Lucene (http://lucene.apache.org/core);Cheshire (http://cheshire.berkeley.edu);Elasticsearch (www.elastic.co/products/elasticsearch);Hibernate Search (http://hibernate.org/search);Lemur (www.lemurproject.org/lemur);OpenText™ Search Server (www.opentext.com);Solr (http://lucene.apache.org/solr);Terrier (http://terrier.org);UltraSearch (www.jam-software.com/ultrasearch);Windows Search (https://docs.microsoft.com/en-us/windows/desktop/search/windows-search);Xapian (https://xapian.org);Zettair (www.seg.rmit.edu.au/zettair).

Towards intelligent geospatial data discovery: a machine learning framework for search ranking

View Article

Journal Information

Published in International Journal of Digital Earth, 2018

Yongyao Jiang, Yun Li, Chaowei Yang, Fei Hu, Edward M. Armstrong, Thomas Huang, David Moroni, Lewis J. McGibbney, Christopher J. Finch

Some authors consider that the core search functionality of most existing geospatial data portals is powered by Apache Lucene, an open-source information retrieval library or products built upon Lucene such as Apache Solr or Elasticsearch (Li, Goodchild, and Raskin 2014). For example, NOAA’s OneStop project is based on Elasticsearch, and the search engine of PO.DAAC is developed using Solr. Lucene-based techniques use the Boolean model to find matching documents (e.g. data) and various similarity algorithms to calculate relevance (Gormley and Tong 2015). As one of the widely used similarity algorithms, the formula of the practical scoring function is described in the Appendix. Solely relying on the practical scoring function is insufficient for discovering the most applicable dataset out of a vast range of available geospatial datasets, as it only considers text content while the domain knowledge (e.g. spatial resolution and processing level) is ignored. Therefore, two questions need to be answered in order to address the ranking challenge of geospatial data discovery: (1) What features can represent users’ search preferences for geospatial data? (2) How can the ranking reach a balance of all these features?

A Comparison of Lucene Search Queries Evolved as Text Classifiers

View Article

Journal Information

Published in Applied Artificial Intelligence, 2018

Laurence Hirsch, Teresa Brunsdon

Systems using methods based on Darwinian evolution are generally computationally intensive. In our case, each individual in the population will produce a search query for each category of the dataset and the fitness is evaluated by applying the search query to a potentially large set of text documents. With a population of a reasonable size (for example, 1024 individuals) evolving over 100 or more generations, it is critical that such queries can be executed in a timely and efficient manner. For this reason, we decided to use Apache Lucene which is an open source high-performance, full-featured text search engine. We use Lucene to build inverted indexes on the text datasets and to execute the queries produced by the GA.