Solr: data indexing for fun and profit

Solr is not very friendly to novice users. Preparing good schema file requires some experience. Assuming that we have prepared the configuration files, what remains for us is to share our data with the search server and take care of update ability.

There are a few ways to import data:

Update Handler
Cvs Request Handler
Data Import Handler
Extracting Request Handler (Solr Cell)
Client libraries (for example Solrj)
Apache Connector Framework (formerly Lucene Connector Framework)
Apache Nutch

In addition to the mentioned above you casn stream your data to search server. As you can see, there is some confusion here and its to provide the best method to use in a particular case at first glance.

Update Handler

Perhaps the most popular method because of simplicity. It requires the preparation of the corresponding XML file and then You must send it via HTTP to a Solr server. It enables document and individual fields boosting.

CSV Request Handler

When we have data in CSV format (Coma Separated Values) or in TSV format (Tab Separated Values) this option may be most convenient. Unfortunately, in contrast to the Update Handler is not possible to boost documents or fields.

Data Import Handler

This method is less common, requires additional and sometimes quite complicated configuration, but allows direct linking to the data source. Using DIH we do not need any additional scripts for data exporting from a source to the format required by Solr. What we get out of the box is: integration with databases (based on JDBC), integration with sources available in XML (for example RSS), e-mail integration (via IMAP protocol) and integration with documents which can be parsed by Apache Tika (like OpenOffice documents, Microsoft Word, RTF, HTML, and many, many more). In addition it is possible to develop your own sources and transformations.

Extracting Request Handler (Solre Cell)

Specialized handler for indexing the content of documents stored in files of different formats. List of supported formats is quite extensive and the indexing is performed by Apache Tika. The drawback of this method is the need of building additional solutions that provide Solr the information about the document and its identifier and that there is no support for providing additional meta data, external to the document.

Client Libraries

Solr provides client libraries for many programming languages. Their capabilities differ, but if the data are generated onboard by the application and the time after in which the data must be available for searching is very low, this way of indexing is often the only available option.

Apache Connector Framework

ACF is a relatively new project, which revealed a wider audience in early 2010. The project was initially an internal project run by the company MetaCarta, and was donated to the open source community and is currently being developed within Apache incubator. The idea is to build a system that allows making connection to the data source with a help of a series of plug-ins. At the moment there is no published version, but the system itself is already worth of interest in the case of the need to integrate with such systems as: FileNet P8 (IBM), Documentum (EMC), LiveLink (OpenText), Patriarch (Memex), Meridio (Autonomy) Windows shares (Microsoft) and SharePoint (Microsoft).

Apache Nutch

Nutch is in fact, a separate project run by the Apache (previously under the Apache Lucene, now a top level project). For the person using Solr Nutch is interesting as it allows to crawl through Web pages and index them by Solr.

Word about streaming

Streaming means the ability to notice Solr, where to download the data to be indexed. This avoids unnecessary data transmission over the network, if the data is on the same server as indexer, or double data transmission (from the source to the importer and from the importer to Solr).

And a word about security

Solr, bye design, is intended to be used in a architecture assuming safe environment. It is very important to note, who and how is able to query solr. While the returned data can be easily reduced, by forcing the use of filters in the definition of the handler, then in the case of indexing is not so easy. In particular, the most dangerous seems to be Solr Cell – it will not only allow to read any file to which Solr have access(eg. files with passwords), but will also will provide a convenient method of searching in those files 😉

Other options

I tried to mention all the methods that does not require any additional work to make indexing work. The problem may be the definition of this additional work, because sometimes it is easier to write additional plug-in than break through numerous configuration options and create a giant XML file. Therefore, the choice of methods was guided by my own sense, which resulted in skipping of some methods (like fetching data from WWW pages with the use of Apache Droids or Heritrix, or solutionsa based on Open Pipeline or Open Pipe).

Certainly in this short article I managed to miss some interesting methods. If so, please comment, I`ll be glad update this entry 🙂

Solr.pl