Session Abstracts | Day 1

Track Sessions

Go  to  Day 2


Indexing Wikipedia as a Benchmark of Single Machine Performance Limits

Presented by Paddy Mullen,Independent Contractor

This talk walks through using the wikipedia_Solr and wikipedia_elasticsearch repositories to quickly get up to speed with search at scale. When choosing a search solution, a common question is "Can this architecture handle my volume of data", figuring out how to answer that problem without integrating with your existing document store saves a lot of time. If your document corpus is similar to Wikipedia's document corpus, you can save a lot of time using wikipedia_Solr/wikipedia_elasticsearch as comparison points.

Wikipedia is a great source for a tutorial such as mine because of it's familiarity and free availability. The uncompressed Wikipedia data dump I used was 33GB, it had 12M documents. The documents can be further split into paragraphs and links to test search over a large number of small items. To add extra scale, prior revisions can be used bringing the corpus size into terabytes.

 


Using Solr/Lucene to Surface the Big Data of Social Media

Presented by Glenn Engstrand, Zoosk, Inc

Although you need Big Data to effectively implement a large scale social media solution, Hadoop is not always the right tool. This implementation description details how Zoosk is using Solr/Lucene as a NoSql solution to meet the near real-time Big Data needs of a social news feed in its evolution into a Romantic Social Network.

 


Using the LucidWorks REST API to Support User-Configurable Big Data Search Experiences

Presented by Mark Davis, Kitenga

Kitenga's Analyst system uses the LucidWorks Enterprise REST API in a variety of ways, including for configuring collections and managing Solr schema. As part of the Kitenga platform, the ZettaSearch Designer empowers the end-user to dynamically drag-and-drop search widgets to create a specialized search interface. For a user to effectively design search UIs that meet their needs, they need to be able to understand the available schema fields that populate a given collection. ZettaSearch Designer interrogates the Solr infrastructure using the Lucid REST API to provide an overview of the available metadata. It is then easy for the user to build rich, facetted search experiences around the metadata library indexed into the collection. In this implementation overview, I will describe the design of ZettaSearch Designer, how it interacts with big data technologies like Hadoop as part of the indexing pipeline, and how it uses the LucidWorks API to enable user discovery of the metadata needed to create novel search user interfaces on the fly

 


How SolrCloud Changes the User Experience in a Sharded Environment

Presented by Erick Erickson, Lucid Imagination

The next major release of Solr (4.0) will include "SolrCloud", which provides new distributed capabilities for both in-house and externally-hosted Solr installations. Among the new capabilities are: Automatic Distributed Indexing, High Availability and Failover, Near Real Time searching and Fault Tolerance. This talk will focus, at a high level, on how these new capabilities impact the design of Solr-based search applications primarily from infrastructure and operational perspectives.


Using Solr/Lucene to Build CiteSeerX and Friends

Presented by C. Lee Giles, Pennsylvania State University

Cyberinfrastructure or e-science has become crucial in many areas of science as data access often defines scientific progress. Open source systems have greatly facilitated design and implementation and supporting cyberinfrastructure. However, there exists no open source integrated system for building an integrated search engine and digital library that focuses on all phases of information and knowledge extraction, such as citation extraction, automated indexing and ranking, chemical formulae search, table indexing, etc. We propose the open source SeerSuite architecture which is a modular, extensible system built on successful OS projects such as Lucene/Solr and discuss its uses in building enterprise search and cyberinfrastructure for the sciences and academia. We highlight application domains with examples of specialized search engines that we have built for computer science, CiteSeerX, chemistry, ChemXSeer, archaeology, ArchSeer. acknowledgements, AckSeer, reference recommendation, RefSeer, collaboration recommendation, CollabSeer, and others, all using Solr/Lucene. Because such enterprise systems require unique information extraction approaches, several different machine learning methods, such as conditional random fields, support vector machines, mutual information based feature selection, sequence mining, etc. are critical for performance.


Grouping and Joining in Lucene / Solr

Presented by Martijn van Groningen, SearchWorkings

In the real world data isn’t flat. Data is often modelled into complex models. Lucene is document oriented and doesn’t support relations natively. The only way you could index this data is by de-normalizing the relations in a document with many fields and execute subsequent queries. Subsequent queries can be expensive and data gets duplicated. This isn’t always ideal. Recently Solr and Lucene provide features that allow you to join and group. You can join and group on fields across documents and still have the power of Lucene’s awesome free text search. In this presentation, we’ll look at these new alternatives, the advantages and disadvantages and how these features can be utilized. how these new capabilities impact the design of Solr-based search applications primarily from infrastructure and operational perspectives.


Building Query Auto-Completion Systems with Lucene 4.0

Presented by Sudarshan Gaikaiwari, Yelp

Query auto completion (often called suggest) is like magic for users! Seeing suggestions in the query box as soon as a user starts typing their queries dramatically changes the experience. Integrating query suggestions with a search system, however, is not an easy task. We will discuss different types of suggest systems and the most important criteria that a suggest system must satisfy. We will also look at how suggest impacts different search quality metrics as well as define metrics to evaluate the suggest system itself. Finally we will look at implementing our own suggest system using the classes provided by lucene such as WFSTCompletionLookup.


NetDocuments - Journey from FAST to Solr

Presented by David Hamson & Mou Nandi, NetDocuments

NetDocuments, a SaaS document management company, is migrating their large document repository from Microsoft FAST to Solr. During this presentation, the speakers will discuss the the entire process, including major decision points and lessons learned. The migration is a two-phase implementation: The first being a short-cut of moving the FAST xml data directly to Solr to get a Solr meta-data index available quickly and the second phase implements the full architecture, including both meta-data and full text processing and search. The presenters will talk about architecting Solr to meet the company's requirements of scaling to billions of work-product documents, low indexing latency, and high availability. NetDocuments uses the search engine to build the user experience and also for document discovery by users. Solr was architected to scale and perform in order to address these two very different needs and also to match all the features and functionality available with FAST. Finally, the presenters will share the benchmark results from tests run on various hardware configurations and on different file systems, and also share results from search quality testing as the capabilities of Solr were tested on a single server, both single Solr core as well as multiple Solr cores.


Hydra –Introducing an Open Source Document Processing Framework

Presented by Joel Westberg, Findwise AB

This presentation will detail the document-processing framework called Hydra that has been developed by Findwise. It is intended as a description of the framework and the problem it aims to solve. We will first discuss the need for scalable document processing, outlining that there is a missing link between the open source chain to bridge the gap between source system and the search engine, then will move on to describe the design goals of Hydra, as well as how it has been implemented to meet those demands on flexibility, robustness and ease of use. This session will end by discussing some of the possibilities that this new pipeline framework can offer, such as freely seamlessly scaling up the solution during peak loads, metadata enrichment as well as proposed integration with Hadoop for Map/Reduce tasks such as page rank calculations.


Automata Invasion

Presented by Robert Muir and Michael Mccandless, Lucid,IBM

Finite-state technology, including automata and weighted finite state transducers (wFSTs), are compact data structures well suited to text processing and searching applications. Low level support for both automata and wFSTs is now available in Lucene and has recently enabled a number of surprisingly powerful improvements.In this joint talk, Robert Muir and Michael McCandless will provide an overview of finite-state technology and then describe how it's used today in Lucene: synonym filtering, fuzzy queries, respelling/suggesting, terms dictionary, in-memory postings format (MemoryPostingsFormat) and Japanese analysis (Kuromoji analyzer).


Integrating Lucene search engine into transactional XML database

Presented by Petr Pleshachkov, EMC

In this talk we will present an integration of the Lucene search engine with EMC Documentum xDB database (native XML database). We will introduce a new approach implemented in xDB 10.3 which integrates Lucene index (used for XQuery queries optimization) into transactional xDB engine on the storage level. That is, Lucene files are stored to the XDB data pages instead of the file system as in earlier releases, Lucene accesses all the files through xDB buffer pool instead of the just the Operating system buffer cache. This approach allows us to simplify the implementation of traditional database features for Lucene within xDB like transactions isolation, rollbacks, recovery after database crashes, snapshots construction , replication, hot backups, buffer management, etc. We cover performance analysis of new approach for queries and ingest operations, performance tuning tips and future optimization techniques in the area. The presentation is intended as a description of an implementation and performance analysis.


Delivering on the Promise of Big Data at the "Tactical Edge"

Presented by Wes Caldwell, Chief Architect, ISS, Inc.

In U.S. military operations worldwide, the intelligence information enterprise consists of hundreds of data sources and feeds.  Mostly unstructured, the collection of personal interactions, millions of documents and rich media get indexed into relational and non-relational data stores.  The ability to search, discover, and correlate information quickly is critical to the analysis of "connecting the dots" between disparate sources to make the right decisions.  We will describe some real-world use cases, where ISS-built capabilities (powered by Solr) enable time critical analysis of routing through hostile urban areas, and how advanced analytical techniques in Social Network Analysis (SNA) and Text Analytics are being applied to identify complex associations between various entities.  This discussion will provide a window into the world of how search and analytics are making a real difference in U.S. military operations.

 


Japanese Linguistics in Lucene and Solr

Presented by Christian Moen, Founder and CEO Atilika Inc.

This talk gives an introduction to searching Japanese text and an overview of the new Japanese search features available out-of-the-box in Lucene and Solr.

Atilika developed a new Japanese morphological analyzer (Kuromoji) in 2010 when they couldn't find any easy-to-use, high-quality morphological analyzer in Java that was good for both search and other Japanese NLP tasks.  Kuromoji was built with the goal of donating it to the Apache Software Foundation in order to make Japanese work well for both Lucene and Solr, and is now a standard part of these software packages.

 


Big Search with Big Data Principles

Presented by Eric Pugh, Principle, OpenSource Connections

Got hundreds of millions of documents to search? DataImportHandler blowing up while indexing? Random thread errors thrown by Solr Cell during document extraction? Query performance collapsing? Then you've searching at Big Data scale. This talk will focus on the underlying principles of Big Data, and how to apply them to Solr. This talk isn't a deep dive into SolrCloud, though we'll talk about it. It also isn't meant to be a talk on traditional scaling of Solr. Instead we'll talk about how to apply principles of big data like "Bring the code to the data, not the data to the code" to Solr. How to answer the question "How many servers will I need?" when your volume of data is exploding. Some examples of models for predicting server and data growth, and how to look back and see how good your models are! You'll leave this session armed with an understanding of why Big Data is the buzzword of the year, and how you can apply some of the principles to your own search environment.

 


Search is Not Enough: Using Solr for Analytics

Presented by Steve Kearns, Basis Technology

Search is everywhere, and it is a crucially important capability in any enterprise, application, or website. However, an increasingly sophisticated user base expects their search engine to bring them more than just document hits - they want the facts, answers, and context that connect the results with their workflow. In this talk, Steve Kearns will discuss and demonstrate how the combination of structured data, text analytics on unstructured data, and Solr can be used to power advanced analytics applications at scale. This includes integrating text analytics components into Solr, adjustments to the Solr Schema, as well as UI-level changes that support the integration of structured and unstructured data from several sources.

 


"Stump The Chump": Get On The Spot Solutions To Your Real Life Solr/Lucene Challenges

Presented by Chris Hostetter | Lucid Imagination

Got a tough problem with your Solr or Lucene application? Facing challenges that you'd like some advice on? Looking for new approaches to overcome a Lucene/Solr issue? Not sure how to get the results you expected? Don't know where to get started? Then this session is for you.

Now, you can get your questions answered live, in front of an audience of hundreds of Lucene Revolution attendees! Back again by popular demand, "Stump the Chump" at Lucene Revolution 2012 puts Chris Hostetter (aka Hoss) in the hot seat to tackle questions live.

All you need to do is send in your questions to us here at stump@lucenerevolution.com. You can ask anything you like, but consider topics in areas like:

  1. 1. Data modelling
  2. 2. Query parsing
  3. 3. Tricky faceting
  4. 4. Text analysis
  5. 4. Scalability

You can email your questions to stump@lucenerevolution.com. Please describe in detail the challenge you have faced and possible approach you have taken to solve the problem. Anything related to Solr/Lucene is fair game.

Our moderator, Erik Hatcher, will will read the questions, and Hoss have to formulate a solution on the spot. A panel of judges (Erick Erickson, Eric Pugh, and Grant Ingersoll) will decide if he has provided an effective answer. Prizes will be awarded by the panel for the best question - and for those deemed to have "stumped the chump".

 





Apache Lucene, Lucene, Apache Solr, Solr, Apache Hadoop and Hadoop are trademarks of The Apache Software Foundation.