Lucene Eurocon 2010

Following the announcement by the Apache Software Foundation’s intention to abandon the organization of ApacheCon conference on the old continent, we Europeans were left with no conference under the sign of the Apache near us. But as we all know, nature does not like emptiness, and thus the company Lucid Imagination, in cooperation with the sponsors, decided to organize the first conference dedicated to Lucene and Solr – Lucene Eurocon. Due to the fact that we had the pleasure to participate in this conference, we decided to pass you a short account of its progress.

The conference was divided into two parts: a training and a typical conference. I wont write about the training part, because I did not participated in this part of the conference. Trainings were conducted by two commiters of Lucene/Solr project: Erik Hatcher and Grant Ingersoll. More information on topics that were addressed in this part of the conference can be found at http://lucene-eurocon.org/training.html.

The Begining

Proper conference began Thursday, May 20, with the greeting done spoken by Grant Ingersoll, followed by two presentations: “The Search Revolution – How Lucene & Solr Are Changing the World“, led by Lucid Imagination`s CEO – Eric Gries (slides) and “From Publisher To Platform: How The Guardian Used Content, Search, and Open Source To Build a Powerful New Business Model” (slides), presented by Stephen Dunn – the head of strategies and technologies department in the The Guardian. Eric Gries in his presentation focused on the increasing demand for information processing, information retrieval and of course on Lucene/Solr project which we can use in both discused cases. Of course, there was an information about growing interest in people with skills in both projects. I must mention that I envy the skills of talking with such audience interest. I just bow my forehead. While the second presentation, led by Stephen Dunn, was a discussion about the new developer platform released by The Guardian named “The Open Platform” which allow access to a database of articles published by The Guardian since 1991 (millions of documents and still growing). The author focused on describing the technical details of implementation based on a Solr search engine and the profits of using their platform.

Then, the conference split into two tracks. Due to the fact that physical presence was possible only on one of the two presentations, I will describe only those presentations in which I was present. To be precise – full agenda is available at http://lucene-eurocon.org/agenda.html.

Tika and the future of Lucene

I decided that I`ll go for the track one. For the start some interesting and sometimes detailed product information about Tika in presentation called “Text and Metadata Extraction with Apache Tika” (slides) led by Jukka Zitting. He started with a bit of background information, assumptions and historical overview of the project. Then came purely technical informations including code fragments showing how to use the framework. Overall, from the developer perspective, I think it was a good presentation. Without any break we got a second presentation led by Uwe Schindler and Simon Willnauer under the intriguing title “Lucene Forecast: Version, Unicode, Flex and Modules” (slides). In my humble opinion I received enormous amount of information about the future of both Lucene and Solr. The presentation began by explaining the recent moves in the Lucene and Solr projects – the development merge. From now on the trunk for both projects is collective, same as the commiters. Further more the trunk is now a place where the development of the newest version of Lucene and Solr are kept. We also heard that the 4.0 version of Lucene wont be backwards compatible. Further information provided by speakers were no less interesting – plans to port the actual faceting mechanism to Lucene. Much time was devoted to discussing the changes associated with full support for Unicode (ICU module of Lucene) and then speakers went on to lead a very interesting topic for me, a Flexible Indexing. I`ll try to write something more about the ICU in the near future.

Below are the titles of two presentations of the second track. Unfortunately due to the fact that I did not take part in them, I can not say anything more about them:

“Use of Solr at Trovit, A Leading Search Engine For Classified Ads“, led by Mark Sturlese
“Implementing Solr in Online Media As An Alternative to Commercial Search Products“, led by Bo Raun

Magic of post-processing

After a lunch break I decided not to change the trac and I went to see presentation titled “Munching and Crunching: Lucene Index Post-processing” (slides) led by Andrzej Białecki and after the whole presentation I think it was the right choice. I decided that I wont write much about the topic he discussed because in my opinion, the amount of information deserves a separate post. He provided information main about the possibility of Lucene index post-processing like: separation, cleaning, filtering and strong the whole index in RAM memory. As I said, more on this topics in a separate entry. Meanwhile, on the second track, Peter Kiraly led a presentation titled “Bringing Solr to Drupal: A General, and a Library-Specific Use Case” (slides).

Solr and NLP

I decided to leave the first track and the presentation “Solr Schema Demystified” led by Uri Boness in favor of “Integration of Natural Language Processing tools with Solr” (slides) led by Joan Codina-Filbà. I though that the topic should be interesting for those seeking to integrate Solr with the linguistic analysis tools – to save the results returned by these tools and use them. The presented information included use of UIMA, the problems that the developers encountered and how they were resolved. Joan also mentioned how they used Carrot2 in their systems. The presenter also showed differences between stem and lemma, how they diagnosed which part of speech the term is and how this information was used for classification of positive or negative connotation of comments about the products. All this have been said in the context of Lucene and Solr and the use of payloads and without it. It`s a pity that it was discussed only in the context of the Spanish language, but hey, You can`t have everything right ? 😉

Document processing and pipe-line

Due to the fact that early this day I heard a bit about “The Open Platform” made by the Guardian, I decided not to go and see a presentation titled “Solr in the Wild: The Guardian’s Open Content Platform API” led by Graham Tackley (slides) and instead listen to a speech led by Max Charas and Karl Jansson titled “Modular Document Processing for Solr/Lucene“(slides). Presentation in my opinion, was more about the capabilities of sponsors products, than something very insightful, especially comparing to the previous presentation. That was my impression after this presentation, but maybe it was the result of information overload and general tiredness.

Solr in IBM

After another shot break two further presentations began. Given the choice: “Make Your Domain Objects Searchable with Hibernate Search” (slides) led by Gustavo Fernandes and “Social and Network Discover (SaND) over Lucene” (slides) led by Shai Erera i chose the second one. I short, the presentation concentrated on the application created by IBM to search for information within the company, information like: documents, people, parties, etc. The presenter showed how search was implemented (what can we find), how to narrow the results (faceting, narrowing based on date, location or source), and how they presented the relationship between individuals or documents. In addition to search, we were showed a pretty interesting functionality – the graph of relationships (slide 25).

From FAST to Solr

Being tired already I choose the last presentation of the day, the “Key Topics When Migrating From FAST to Solr” (slides) led by John Høydahl over “Query by Document: When “More Like This” Is Insufficient” (slides) led by Dusan Omercevic. It was interesting to hear about some solutions that have been implemented in FAST that are not in Lucene/Solr projects. John Høydahl spoke quite interestingly, showed differences between the two technologies and how to handle the case of deficiencies of certain functionality, both with one side or the other. For the person who had no commercial experience with FAST, like me, it was interesting to know that most of the data processing in FAST is prepared in the stage of indexing – for example, sorting must be defined at the stage of indexing. Looking at the FAST can see, what else is missing in Solr – such as multilingual fields and pipeline. In the later part of the presentation we were shown how to make migration from FAST to Solr, of course, highly simplified, but very informative. Overall, a very good presentation in my opinion.

End of the first day

Then, after a 90 minut break, we had a five short presentations, less formal. Due to their nature, and that our thoughts were already at the Czech Beer Festival, I decided only to mention them:

“Social Media Scheduler based on Solr + Hadoop + Amazon EC2” (slides) led by Pablo Aragón
“Introduction to Collaborative Filtering using Mahout” (slides) led by Frank Scholten
“Enterprise Search meets Enterprise CMS – TYPO3 and Apache Solr” (slides) led by Olivier Dobberkau
“BM25 Scoring for Lucene – From Academia to Industry” (slides) led by Yuval Feinstein
“How We Scaled Solr to 3+ Billion Documents” (slides) led by Jason Rutherglen

Second day

Similar to the first day, the second day began with a short introduction by Grant Ingersoll.

So we start again

Just like in the case of the previous day, at the beginning we got two presentations: “Software Disruption: How Open Source, Search, Big Date and Cloud Technology are Disrupting IT” (slides), led by Zack Urlocker and “Solr 1.5 and Beyond” (slides) led by Yonik Seeley. The first presentation, due to its marketing target didn`t interest me much, which does not change the fact that it has been forwarded to give interesting facts about the state of development of the open source software. There were also predictions about the direction of software development and it`s turn to the processing in the cloud. Zak Urlocker is for sure a speaker with a wide knowledge of open source as one of the fathers of the success of the MySQL. The second presentation was from my point of view much more interesting, because of technical orientation. Yonik Seeley concentrated on several key themes for future of Solr – merge of developments, extended Dismax parser, integration with Zookeeper (so-called Solr Cloud), spatial search, field collapsing and NRE (near real time) search. In my opinion it is worth keep an eye on changes in Solr, since the future of the project appears very bright, if only everything that was mentioned during the presentation can be achieved and yet we know that not everything was told.

Grant Ingersoll about relevance

And again the conference split into two paths. It seemed to me, that I`ll benefit more by listening to Grant`s Ingersoll talk about “Practical Relevance” (slides) rather than “Building Multilingual Search Based Applications” (slides) led by Steve Kearn. I was not mistaken. Grant spoke about the improvement of search results quality – how to do it, how to analyze logs, how to do what can be drawn from the search statistics and what needs to be targeted during the process of quality improvement. He referred to the need to collect information from users, because they ultimately decide the success or failure of the application. There were also details how to quickly gain satisfactory results by adding phrase boosting. Approaching the end of the presentation, Grant talked about the advanced methods of influencing the so-called relevance, i.e. by developing own components responsible for counting the validity of the document. He also referred to the “Open Relevance” project – one of the projects of Lucene ecosystem.

Lucene Connectors Framework

I decided to stay in the same conference room and listen to Karl Wrigth, who spoke about Lucene Connectors Framework, in a presentation titled “Lucene Connectors Framework: An Introduction” (slides). At the same time conference participants could listen to Karel Braeckman presenting “Unlocking a Breadcaster Video Archive Using Solr” (slides). Presentation about LCF was mainly providing assumptions and architecture of the framework. We learned that the framework itself is currently in the process of migration from a commercial to open source form, so at this time a lot of functionality may not work, because some of them were based on commercial libraries, which for obvious reasons can not be used in a project open source. While the framework itself will certainly be interesting and useful tool to supplement Solr and Lucene with capabilities such as reading the various data sources (both by PULL and PUSH), the ability to deliver data periodically. Nice thing about Lucene Connectors Framework is it`s security model. In my opinion, at this point, the framework should be regarded as a curiosity, but I really hope this will change soon.

Solr + Zookeeper = Solr Cloud

After a lunch break and the free talks with the commiters of Lucene/Solr projects began the afternoon presentation session. Rested and hungry for information I went to see Mark Miller talking about “Solr in the Cloud” (slides). At the same time in the next room Tyler Tate and Stefan Olafsson talked about “The Path to Discovery: Facets and the Scent of Information” (slides). As you can guess the theme of the presentation it was about so-called Solr Cloud, which is a distributed instance of Solr farm managed using Zookeeper framework. The talk began with the presentation of master – slave architecture, and how this architecture looks like in the case of distributed index among many shards. Then we had a few words about the index replication – from shell scripts to new Java based mechanism. Then, Mark Miller, moved to describe the core assumptions that underlie the integration with the Zookeeper – centralized configuration, fault tolerance, the ability to automatically delete and add more Solr instances or support for checking the status of each instance. Then he discussed what has been done so far with the integration with Zookeeper and what more needs to be done. He also mentioned about what is planned for the future. In addition to matters related to the Zookeeper integration Mark Miller presented the new features in Solr 1.4, ie. LoadBalanced Solr Http Server. In conclusion, the presentation was very interesting, showing a further path of development Solr.

A moment of grace

When we talked with other participants of the conference and made some contacts, there were two consecutive presentations:

“Neural Networks, Newspapers and Solr – A short tour through through extending Solr for real-world text-classification” (slides), led by Sven Maurmann
“Rapid Prototyping” (slides), led by Eric Hatcher

Almost the end

The last presentation I watched at the conference was “Combating Information Overload – Search in Military Information gathering Systems” (slides), the speaker was Alexandra Larsson, Captain of the Swedish Air Force, but at first I went to hear about “European Language Analysis with Hunspell” led by Chris Male. After a while I switched to adjacent conference room and listen about how the Swedish military service is using Lucene and Solr in their systems. Because I didn`t listen any of the presentations from the beginning to the end I will not write anything about them. I`ll just say, that the Alexandra Larsson showed some pretty examples of how military is using Solr.

Summary

The last word said on the Lucene Eurocon 2010 were spoken by Grant Ingersoll, there were some books given, t-shirts and lottery for a pass to Lucene Revolution in Boston, something fun for the audience 😉

To sum up the whole event I’m happy that I could participate. Interesting presentations, the sea of ideas and plans and to make sure the community is that Lucene and Solr are mature products that have not rested on its laurels, and despite the growing interest in their present shape there are peaople behind who care about it, so projects are not standing still. I hope that I will participate in the Lucene Eurocon 2011.

Solr.pl