Lightning Talks
![]() Download the Lucene Revolution e-Guide here. Slides posted as available. |
TALK SUMMARIES
"HathiTrust: Indexing the full text of 8 million books with Solr"
Presented by Tom Burton-West, Information Retrieval Programmer | Hathi Trust Project/University of Michigan
HathiTrust is a partnership of major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future. The HathiTrust Digital Library is a digital preservation repository and highly functional access platform. The HathiTrust full-text search application uses Solr to provide full-text searching of 8 million books in about 400 languages. This talk will discuss some of the scalability challenges we have encountered including; serving a distributed index of 6+ terabytes on 4 machines, large documents (1 MB of OCR), large indexes (350 GB), and large numbers of unique terms per index (over 2.4 billion). We will also talk briefly our plans to improve the user experience when users search the full text of 8 million books.
Bio: Tom Burton-West is an Information Retrieval Programmer in the University of Michigan's Digital Library Production Service; He works on DLXS, the DLPS's Digital Library Software, and on the Solr-based Hathi Trust Large Scale Search project. He blogs about the Large Scale Search project at www.hathitrust.org/blogs
"Improve Relevance by Using Morphology and Named Entity Recognition"
Presented by Christoph Goller, Director, Research | Intrafind Software AG
This talk will show how the relevance of search results can be improved by using morphology and named entity recognition. After briefly explaining the purpose of morphological analysis and of named entity recognition we will analyze their potential advantages for search, faceting, and clustering of search results. Based on these ideas we will briefly sketch details how to implement a morphological analyzer in Lucene and how to implement a natural language question answering system based on Lucene using named entity recognition. The talk will be accompanied by a life demo of these ideas.
Bio: Christoph Goller has more than 10 years of experience in the search industry. He got a Ph.D in computer science from the Technical University of Munich where he worked in several research projects on artificial intelligence, machine learning and neural networks. Christoph started his career at Lernout & Hauspie. Since 2002 he has been Director Research of Intrafind Software AG (www.intrafind.de), a German company specialized on full-text search and text mining based on Lucene and Solr. Christoph has been a Lucene committer since 2004. He has accompanied dozends of commercial projects using Lucene and Solr. Christoph is author of more than 15 scientific papers, frequently gives presentations on search related topics and is responsible for partner training at Intrafind.
"Scientific Data Search in the Pharmaceutical Industry with Solr"
Presented by Jeffrey Guo, CEO | Semtific Software, Inc.
Tremendous amount of experimental information and scientific knowledge has been locked or lost in data silos in the forms of semi-structured or unstructured data in today’s pharmaceutical industry. Out of the box full text search engines do not understand embedded scientific terms and objects and their relationships to facilitate context sensitive and relevant searches. This presentation will discuss a successful implementation at a major pharmaceutical company that utilizes Solr as enterprise search platform and enhances it with chemistry (molecular entities and reactions) search capabilities. The scope of the document indexing process is expanded to cover embedded chemistry objects and terms of various types such as common chemical names, corporate IDs, SMILES, and InChI from documents. Scientifically aware search based on query structure drawing or chemical terms is therefore enabled. Enterprise scientific search strategies and lessons learned will be discussed during the presentation.
Bio: Founder of Semtific Software, Inc., a company that provides products and services that streamline drug discovery workflow and enterprise search of scientific research data.
"Using Lucene's Test Framework"
Presented by Robert Muir | Lucid Imagination
The Lucene/Solr community takes testing seriously: we have a suite of over 3500 tests to ensure software quality. Over time we accumulated some useful extensions to JUnit testing, and several people found themselves using our extensions for other projects. We released this "test framework" for the first time in Lucene 3.1, and this talk is a short summary of its feature list to hopefully encourage you to go check it out for yourself. Find out how you can:
- Improve test coverage for custom Lucene components.
- Speed up your unit test suite by running tests in parallel
- Find resource leaks, localization or timezone-sensitive bugs in your application
- Use our extensions to make unit tests easier to write.
Bio: Robert Muir, software engineer for Lucid Imagination, us a Lucene/Solr committer & PMC member.
"Using Apache Solr and Active Directory to unify data access across Intranet, ERP and Filesystem Cluster"
Presented by Robert Weißgraeber, Project Director | Lightwerk.
SOLR is tightly linked into all available data and business intelligence sources in the enterprise: Indexing the TYPO3 CMS-based Intranet, downloads, forms, handbooks, an Oxaion based ERP-Database, and the filesystem Cluster running Microsoft Distributed File System – using TIKA for full-text content extraction. All data is connected via ActiveDirectory servers into user based fine-grained access control lists, which are evaluated in real-time and early-binding mode by SOLR. A worldwide SOLR-Cluster using different shards gives additional security for world-wide deployment, e.g. keeping confidential data inside the heardquarters own data centers.
Bio: Robert Weißgraeber is Project Director at Lightwerk, primary specialized in designing, planning and executing corporate portals.
"Thousands of Indexes in the Cloud"
Presented by Shaneal Manek, Lead search engineer | Greplin
Indexes at Greplin are strange - instead of having one giant index that is searched all the time and updated infrequently, there are thousands of relatively small indexes that are updated much more frequently than they are searched. These unorthodox requirements lead to an unorthodox architecture that uses techniques inspired by Zoie and Bobo. We will discuss techniques that allowed us to exploit the inherent shardability and access patterns of our data to build an extremely high throughput information retrieval architecture. We will also examine some of the challenges and opportunities presented by running Lucene on Amazon's Elastic Compute cloud.
Bio: Shaneal Manek is the lead search engineer at Greplin. He was previously the founder and CTO of Signpost.com, which built a geospatial search and recommendation engine on top of Lucene and Lisp.
"Lucene Search in the Salesforce.com Cloud"
Presented by Bill Press, Software Development Manager for Search | Salesforce.com
How do you deploy Lucene to support millions of searches per day, by hundreds of thousands of users (each with distinct privacy settings), over tens of thousands of document sets containing both structured and unstructured data, all the while indexing hundreds of millions of document updates per day? In this talk, we will talk briefly about the search architecture at salesforce.com, and the new challenges posed by new product lines, including Chatter, our new collaboration and social networking application for the enterprise.
Bio: Bill Press has over a decade in enterprise search. He got his start as a co-creator of Ripfire, a horizontally scalable enterprise search engine (long since lost in the mists of multiple acquisitions) and now runs search development at salesforce.com. He is a recovering academic and has a Ph.D. from Caltech in Computation and Neural Systems.
"OER Glue - Pervasively Mashing Up the Web for Teaching and Learning"
Presented by Joel Duffin, CEO | tatemae.com
OER Glue allows you to pervasively mash up the web for teaching and learning in formal and corporate education. It uses Lucene, Solr, Hadoop, and HBase to build massive databases of learning content for search and recommendation to authors and learners. OER Glue lets mash it up content where you find it instead of copying it into a new system. It integrates with everyone instead of re-implementing functionality. With OER Glue you can drag and drop images, videos, text, and applets from any page into any other web page. In addition you can drag discussion, quiz, and other widgets. New widgets can be created using a few lines of javascript and HTML.
Bio: Joel Duffin has developed web applications for education for the past 10 years. After building a custom recommender system for open education resources (http://www.oerrecommender.org), he happily discovered Lucene and Solr and rebuilt it using that technology. This allowed it to easily provide better support for 8 different languages and improve performance. Two years ago he co-founded Tatemae, LLC, a company that builds open source technology to support online teaching and learning (http://www.tatemae.com). Tatemae recently released OER Glue in beta (http://www.oerglue.com)
"Using HIVE for weblog analysis"
Presented by Joshua Seagroves, Senior Staff Scientist | Information Systems Worldwide Corporation (i_SW corp)
Using Hive that is built on top of Hadoop to provides an easy to enable data ETL to store Solr/Weblogs within HDFS for retrieval.
Bio: Joshua specializes in the architecture, design, and development of advanced technology systems. In this capacity, he has developed the applications and components of multiple system development efforts. He is participating in the prototyping and research of Big Data/Cloud Computing activities in the development of operational systems. He has extensive experience in architecture and software modeling methodologies, where he has lead and collaborated upon multiple Web 2.0 technologies to include Wiki’s, Social Networks and information sharing software. Mr. Seagroves' education includes BS. Computer Science, Multiple certifications to include Cloudera Hadoop developer and system administrator.
"Intro to index with Lucene's DocumentsWriterPerThread"
Presented by Simon Wilnauer, Committer/PMC Member | Apache Lucene
Quick introduction into how Lucene's latest Indexing Improvements speed up indexing by 250% with DocumentsWriterPerThread (DWPT)
Bio: Simon is a Lucene core committer and PMC member. During the last couple of years he worked on design and implementation of scalable software systems and search infrastructure. He studied Computer Science at the University of Applied Sciene Berlin. Currently, he work as a consultant for Apache Solr, Lucene Java and Hadoop and is a co-organizer of the "BerlinBuzzwords" conference on Scalability June 2011 in Berlin (Germany).
"How to create hosts to run Solr using Chef (from Opscode)"
Presented by DH Upayavira, Consultant | Sourcesense UK
In this talk I present a way to create hosts to run Solr using Chef (from Opscode). I have created a 'cookbook' that can handle replication, distributed search, and multicore. By assigning different 'roles' to your hosts, they can automatically configure themselves as masters, or slaves, or as co-ordinators in a distributed search setup. This was a last minute talk, and hence the presentation is sketchy. The cookbook I talk about is coded and works, but is still in need of some TLC before ready for public consumption. Feel free to contact me if you're interested in seeing it.
Bio: Upayavira is consultant for Sourcesense UK, specialising in enterprise search with Apache Solr. With a substantial background in unix, he always seems to get the systems jobs - and when doing them, he can't stop himself automating them.
Apache Lucene, Lucene, Apache Solr, Solr, Apache Hadoop and Hadoop are trademarks of The Apache Software Foundation.











