Lucene/Solr in Action

State Decoded: Empowering The Masses with Open Source State Law Search

Presented by Douglas Turnbull, Search and Big Data Architect, OpenSource Connections

The Law has traditionally been a topic dominated by an elite group of experts. Watch how State Decoded has transformed the law from a scary, academic topic to a friendly resource that empowers everyone using Apache Solr. This talk is a call to action for discovery and design to break open ivory towers of expertise by baking rich discovery into your UI and data structures.


Personalized Search on the Largest Flash Sale Site in America

Presented by Adrian Trenaman, Senior Software Engineer, Gilt Groupe

Gilt Groupe is an innovative online shopping destination offering its members special access to the most inspiring merchandise, culinary offerings, and experiences every day, many at insider prices. Every day new merchandising is offered for sale at discounts of up to 70%. Sales start at 12 noon EST resulting in an avalanche of hits to the site, so delivering a rich user experience requires substantial technical innovation.

Implementing search for a flash-sales business, where inventory is limited and changes rapidly as our sales go live to a stampede of members every noon, poses a number of technical challenges. For example, with small numbers of fast moving inventory we want to be sure that search results reflect those products we still have available for sale. Also, personalizing search – where search listings may contain exclusive items that are available only to certain users – was also a big challenge

Gilt has built out keyword search using Scala, Play Framework and Apache Solr / Lucene. The solution, which involves less than 4,000 lines of code, comfortably provides search results to members in under 40ms. In this talk, we'll give a tour of the logical and physical architecture of the solution, the approach to schema definition for the search index, and how we use custom filters to perform personalization and enforce product availability windows. We'll discuss lessons learnt, and describe how we plan to adopt Solr to power sale, brand, category and search listings throughout all of Gilt's estate.


Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled

Presented by John Berryman, Search Architect, Opensource Connections

In a recent project with the US Patent and Trademark Office, Opensource Connections was asked to prototype the next generation of patent search - using Solr and Lucene. An important aspect of this project was the implementation of BRS, a specialize search syntax used by patent examiners during the examination process.

In this fast paced session we will relate our experiences and describe how we used a combination of Parboiled (a Parser Expression Grammar [PEG] parser), Lucene Queries and SpanQueries, and an extension of Solr's QParserPlugin to build BRS search functionality in Solr. First we will characterize the patent search problem and then define the BRS syntax itself. We will then introduce the Parboiled parser and discuss various considerations that one must make when designing a syntax parser.

Following this we will describe the methodology used to implement the search functionality in Lucene/Solr. Finally, we will include an overview our syntactic and semantic testing strategies. The audience will leave this session with an understanding of how Solr, Lucene, and Parboiled may be used to implement their own custom search parser.


CMS Integration of Apache Solr - How we did it.

Presented by Ingo Renner, Software Engineer, Infield Design

TYPO3 is an Open Source Content Management System that is very popular in Europe, especially in the German market, and gaining traction in the U.S., too.

TYPO3 is a good example of how to integrate Solr with a CMS. The challenges we faced are typical of any CMS integration. We came up with solutions and ideas to these challenges and our hope is that they might be of help for other CMS integrations as well.

That includes content indexing, file indexing, keeping track of content changes, handling multi-language sites, search and facetting, access restrictions, result presentation, and how to keep all these things flexible and re-usable for many different sites.

For all these things we used a couple additional Apache projects and we would like to show how we use them and how we contributed back to them while building our Solr integration.


Next Generation Electronic Medical Records and Search: A Test Implementation in Radiology

Presented by David Piraino, Chief Imaging Information Officer, Imaging Institute Cleveland Clinic, Cleveland Clinic
& Daniel Palmer, Chief Imaging Information Officer, Imaging Institute Cleveland Clinic, Cleveland Clinic

Most patient specifc medical information is document oriented with varying amounts of associated meta-data. Most of pateint medical information is textual and semi-structured. Electronic Medical Record Systems (EMR) are not optimized to present the textual information to users in the most understandable ways. Present EMRs show information to the user in a reverse time oriented patient specific manner only. This talk discribes the construction and use of Solr search technologies to provide relevant historical information at the point of care while intepreting radiology images.

Radiology reports over a 4 year period were extracted from our Radiology Information System (RIS) and passed through a text processing engine to extract the results, impression, exam description, location, history, and date. Fifteen cases reported during clinical practice were used as test cases to determine if ""similar"" historical cases were found . The results were evaluated by the number of searches that returned any result in less than 3 seconds and the number of cases that illustrated the questioned diagnosis in the top 10 results returned as determined by a bone and joint radiologist. Also methods to better optimize the search results were reviewed.

An average of 7.8 out of the 10 highest rated reports showed a similar case highly related to the present case. The best search showed 10 out of 10 cases that were good examples and the lowest match search showed 2 out of 10 cases that were good examples.The talk will highlight this specific use case and the issues and advances of using Solr search technology in medicine with focus on point of care applications.


Semantic Search in the Cloud

Presented by Roberto Masiero, Vice President ADP Innovation Lab, ADP

In this presentation we will cover ADP's Semantic Search strategy and implementation. From the use cases to the design to support semantic searches on a vast set of data, to crawling data from hundreds of data sources. We will also cover our architecture to scale the search service on a multi-tenant SaaS environment.


Text Tagging with Finite State Transducers

Presented by David Smiley, Software Systems Engineer, Lead, MITRE

OpenSextant is an unstructured-text geotagger. A core component of OpenSextant is a general-purpose text tagger that scans a text document for matching multi-word based substrings from a large dictionary. Harnessing the power of Lucene’s state-of-the-art finite state transducer (FST) technology, the text tagger was able to save over 40x the amount of memory estimated for a leading in-memory alternative. Lucene’s FSTs are elusive due to their technical complexity but overcoming the learning curve can pay off handsomely.


Internalizing location services with GeoNames

Presented by John Marc Imbrescia, Senior Software Engineer, Etsy.com

Etsy recently chose to bring our location services in house. We used the open source GeoNames data set and built the tools we needed to use that data to allow members to select their location, show translations of place names, and to feed data into our search database for local, regional, and country based searches.

This talk will cover the implementation details and decisions we made along the way. How we mapped places from our old data set to the GeoNames data. The internal tools we built including a SOLR core for doing location place name autosuggest. Modifications to our Listings Search and Shop Search cores and the different ways we use location based search around the site both distance and region based using GeoNames hierarchy data.

There will also be a discussion about choosing to release some of the tools we built for this project open source and the decisions behind the non-search (display etc.) related elements of the project and the tools we chose for them and why.


Building a Near Real-time Search Engine and Analytics for logs using Solr

Presented by Rahul Jain, System Analyst (Software Engineer), IVY Comptech Pvt Ltd

Consolidation and Indexing of logs to search them in real time poses an array of challenges when you have hundreds of servers producing terabytes of logs every day. Since the log events mostly have a small size of around 200 bytes to few KBs, makes it more difficult to handle because lesser the size of a log event, more the number of documents to index. In this session, we will discuss the challenges faced by us and solutions developed to overcome them. The list of items that will be covered in the talk are as follows. 

  • Methods to collect logs in real time.
  • How Lucene was tuned to achieve an indexing rate of 1 GB in 46 seconds
  • Tips and techniques incorporated/used to manage distributed index generation and search on multiple shards
  • How choosing a layer based partition strategy helped us to bring down the search response times.
  • Log analysis and generation of analytics using Solr.
  • Design and architecture used to build the search platform.

CommerceSearch: Moving from FAST to Solr on ATG

Presented by Ricardo Merizalde, Software Development Manager, Backcountry.com

The intent of this presentation is to describe an implementation of an open source framework for e-commerce sites. The presentation will focus on Oracle ATG but can be extended to other platforms. First, a brief introduction on what CommerceSearch is (an open source integration and framework for eCommerce sites). Review challenges we had with FAST Impulse. Review the main search artifacts merchandisers can manage through CommerceSearch. Review how CommerceSearch integrates with ATG to deploy changes in near real time. Review CommerceSearch integration test framework and automated test framework (Selenium). Finally, summarize the benefits we got by moving from FAST to Solr.


Solr at Zvents, nearly 6 years later and still going strong

Presented by Amit Nithianandan, Lead Engineer Search/Analytics New Platforms, Zvents/Stubhub

Zvents has been a user of Apache Solr since 2007 when it was very early. Since then, the team has made extensive use of the various features and most recently completed an overhaul of the search engine to Solr 4.0. We'll touch on a variety of development/operational topics including how we manage the build lifecycle of the search application using Maven, release the deployment package using Capistrano and monitor using NewRelic as well as the extensive use of virtual machines to simplify node management. Also, we’ll talk about application level details such as our unique federated search product, and the integration of technologies such as Hypertable, RabbitMQ, and EHCache to power more real-time ranking and filtering based on traffic statistics and ticket inventory.


Brahe - Mass scale flexible indexing

Presented by Ben Brown, Software Architect, Cerner Corporation

Our team made their first foray into Solr building out Chart Search, an offering on top of Cerner's primary EMR to help make search over a patient's chart smarter and easier. After bringing on over 100 client hospitals and indexing many tens of billions of clinical documents and discrete results we've (thankfully) learned a couple of things.

The traditional hashed document ID over many shards and no easily accessible source of truth doesn't make for a flexible index.
Learn the finer points of the strategy where we shifted our source of truth to HBase. How we deploy new indexes with the click of a button, take an existing index and expand the number of shards on the fly, and several other fancy features we enabled.


Using Solr/Lucene to Build Advertising Systems

Presented by Hideharu Hatayama, Rakuten, Inc.

I want to talk about architecture patterns of Solr centered ad systems and practical knowledge which we gained by operating the system with high availability for years, and these topics would be applicable for other systems such as e-commerce site or restaurant recommendation site.Through the presentation, I'll aim that beginners will get the hints of how to design their system architecture using Solr with high performance, and how to manage or operate the systems avoiding down time.


Multi-faceted responsive search, autocomplete, feeds engine and logging

Presented by Remi Mikalsen, Search Engineer, The Norwegian Centre for ICT in Education

Learn how utdanning.no leverages open source technologies to deliver a blazing fast multi-faceted responsive search experience and a flexible and efficient feeds engine on top of Solr 3.6. Among the key open source projects that will be covered are Solr, Ajax-Solr, SolrPHPClient, Bootstrap, jQuery and Drupal. Notable highlights are ajaxified pivot facets, multiple parents hierarchical facets, ajax autocomplete with edge-n-gram and grouping, integrating our search widgets on any external website, custom Solr logging and using Solr to deliver Atom feeds. utdanning.no is a governmental website that collects, normalizes and publishes study information for related to secondary school and higher education in Norway. With 1.2 million visitors each year and 12.000 indexed documents we focus on precise information and a high degree of usability for students, potential students and counselors.


Implementing Search with Solr at 7digital

Presented by James Atherton, Search Team Lead, 7digital

A usage/case study, describing our journey as we implemented Lucene/Solr, the lessons we learned along the way and where we hope to go in the future.How we implemented our instant search/search suggest. How we handle trying to index 400 million tracks and metadata for over 40 countries, comprising over 300GB of data, and about 70GB of indexes. Finally where we hope to go in the future.


Rapid pruning of search space through hierarchical matching

Presented by Chandra Mouleeswaran, Co Chair at Intellifest.org, ThreatMetrix

This talk will present our experiences in using Lucene/Solr to the classification of user and device data. On a daily basis, ThreatMetrix, Inc., handles a huge volume of volatile data. The primary challenge is rapidly and precisely classifying each incoming transaction, by searching a huge index within a very strict latency specification. The audience will be taken through the various design choices and the lessons learned. Details on introducing a hierarchical search procedure that systematically divides the search space into manageable partitions, yet maintaining precision, will be presented.


Beyond simple search – adding business value in the enterprise

Presented by Kathy Phillips, Enterprise Search Services Manager/VP, Wells Fargo & Co.
& Tom Lutmer, eBusiness Systems Consultant, Enterprise Search Services team, Wells Fargo & Co.

What is enterprise search? Is it a single search box that spans all enterprise resources or is it much more than that? Explore how enterprise search applications can move beyond simple keyword search to add unique business value. Attendees will learn about the benefits and challenges to different types of search applications such as site search, interactive search, search as business intelligence, and niche search applications. Join the discussion about the possibilities and future direction of new business applications within the enterprise.


Beyond TF-IDF: Why, What and How

Presented by Stephen Murtagh, Etsy.com, Inc.

TF-IDF (term frequency, inverse document frequency) is a standard method of weighting query terms for scoring documents, and is the method that is used by default in Solr/Lucene. Unfortunately, TF-IDF is really only a measure of rarity, not quality or usefulness. This means it would give more weight to a useless, rare term, such as a misspelling, than to a more useful, but more common, term.

In this presentation, we will discuss our experiences replacing Lucene's TF-IDF based scoring function with a more useful one using information gain, a standard machine-learning measure that combines frequency and specificity. Information gain is much more expensive to compute, however, so this requires periodically computing the term weights outside of Solr/Lucene and making the results accessible within Solr/Lucene.


Concept Search for eCommerce with Solr

Presented by Mikhail Khludnev, eCommerce Search Platform, Grid Dynamics

This talk describes our experience in eCommerce Search: challenges which we’ve faced and the chosen approaches. It’s not indented to be a full description of implementation, because too many details need to be touched. This talk is more like problem statement and general solutions description, which have a number of points for technical or even academic discussion. It’s focused on text search use-case, structures (or scoped) search is out of agenda as well as faceted navigation.


Building big social network search system using Lucene

Presented by Aleksey Shevchuk, Lead developer, odnoklassniki.ru

We will explain how search systems of social network Odnoklassniki work. Each day 40mln people use Odnoklassniki to communicate and entertain themselves. These activities are hard to imagine without proper search system. A dozen big index's and thousands of small indexes are responding to more than 4000 searches per second at peak times. Users can search within specific site sections of the site or the whole site. Search system will decide which indexes should be queried, and which results to show. To improve relevance we use information from social graph and various activity statistics available for indexed entities. Query log analysis? Again Lucene!