How search can solve big data problems

Session Name: The DB And The Index: Why Our Need For Search Is Shaping Our Technology.
Speakers: Announcer Grant Ingersoll
All right. Up next, we have Mr. Grant Ingersoll, he is the CTO at LucidWorks and he’s going to be talking about the database and the index; why our need for search is shaping our technology. Please welcome Mr. Grant Ingersoll to the stage.
Great, thank you. Anybody seen one of these before? Anybody used a search box before? In many ways, it’s why we’re all here; we’re all talking about Hadoop. Well, Hadoop was created to solve at problem at Google, rather the pre-cursor to Hadoop was built at Google to solve a problem and in many ways it was built to solve the problem of creating search indexes. Yet, when I look at the landscape around big data these days, you don’t see search mentioned very often. I think it’s really one of the fundamental things that really can help us solve a lot of the challenges that we face in the data landscape that we’re living in these days. So, that’s what I’m going to spend the next 15 minutes or so, hopefully convincing you of.
So, first and foremost, I like to think of searches no SQL before no SQL was cool. And I want to encourage you to think about the thing that search does with data that you traditional approaches rows and columns in a database simply do not do. So, very much is the case that search is the system building block. It’s not just about text anymore. Searches, really, at the end of the day are a very effective set of algorithms and data structures that help you solve what I like to call the Top X problem or the Top Ten Problem, or essentially any time you need to return back a ranked set of results. There are many of these things that go well beyond that search box that I just showed you; things like recommendation, classification, organizing fuzzy data, returning things where you essentially need that ranked set of results. So, at the end of the day, if the algorithms fit, you should use them because they’re going to make your job a lot easier. There’s a lot of great tools out there like Lucid and Solar that help you do those kinds of things; make it easy for you to integrate search into you application, such that search should be a critical part of your architecture.
I think the other thing that you’re really starting to see that comes out of searches and out of the big data space is that you can really embrace fuzziness when you start thinking about problems of this Top K or this Top X kind of result, because if I’ve got 50 petabytes of data in my system or whatever it is, 100 billion documents, it doesn’t do me any good to return all of those back or even a sub-set of a million documents, etc., you really have a ranking problem on your hands and search, again, is going to be really effective at dealing with that and the way search does that by scoring those things according to some notion or similarity or relevance can also help you solve those kinds of problems. Then last, but not least, what you’re often seeing these days out of search thanks to many people who have been working in this space for a long, long time is that there are lots of ways of thinking about relevance, of thinking about importance, about thinking about how we can make that fuzziness work to our advantage. So, there’s a lot of scoring features out there I want you to think about that go well beyond just the traditional search aspects of matching keywords, etc. So that’s what I’m trying to get at here with the diagram. On the side here is that, when you really look back at how searches evolved and how we’ve interacted with text and that kind of data, we really started it off the top, we were working with content, how do we model this content, this unstructured stuff for this rich data, if you will. We looked at access things like natural language processing of users’ queries, etc., and then this little known company called Google came along and had this idea of this page rank thing where there’s actually relationships between the content. I think really what you’ve seen now in the last few years in the search space especially and as it expands out into the big data space, is we’re really looking at how can we then overlay all over the user interactions with the data as well so that, hopefully by the time you’re mashing all of this stuff together, you have both deeper insight into the content, as well as the way users are interacting with that content. The combination of that really provides a much more powerful system than if you’re just doing one of those things alone.
For us, one of the things that you often see come out of this is pretty simple reference architecture, if I can highlight a few things here. Typically, I like to start here in the lower left. Of course, we’ve got to get data into this system. We want to bring in a lot of different disparate data sources. Again, I think that’s one of the big changes that’s happened with search as well. It’s not just about some small set of collections. The fuzziness and that ranking capability really allow you to combine together a lot of different data sources. So, for instance, I’ve been working with a healthcare system which has over 400 different data sources. They really need to embrace this fuzzy ranking type problem because they just can’t simply deal with all of those different data sources and know how to rank them otherwise, or know how to deal with which ones are more important than others. So, we get the content in, that’s a lot of your traditional tools etc. that you do to load stuff in. Then really you have this authoritative document store, if you will, notice I’m not really using the word database there, but often times you’ve seen at least in the past that a database plays there, something like H-Base or Cassandra, etc. Although more and more I think that you’re also seeing that the search technology is perfectly capable of serving as your authoritative store there in the middle. Really the key part of that centerpiece of this architecture is the fact that it’s not just about keeping the content itself, but you also want to keep track of the users; their profiles, their histories, what they’ve searched on, what they like, what they don’t like, what they’ve rated, what they’ve reviewed, all of that stuff. You also want to keep track of all of that exhause that comes out of the system, i. e. all of your logs. What are people searching for? How often are they searching for it? When are they searching for it? What’s popular this week? What’s popular last week? All of those things then become really critical aspects of having this overall view of the data and the users. And then we have a variety of services that go around that document store, such that things like your search view here and your analytics services etc., those are always providing to you this best known state of the system, i. e. what do we know about this data at this point in time such that we can provide that to the users and then you’re using things like your personalization, your matching learning, your classification, you’re discovery enrichment tools, all of those things running usually in Hadoop etc., or distributing computation capabilities.
All of those are running in the background, essentially trying to enrich our understanding of the data. So, again, you have this model of best known state of the data and then in the background, we’re working like crazy to make sure that we can improve that best known state of the data by analyzing how users are interacting with that data.
So, that’s what I see is a pretty common reference architecture. Then the interesting thing that comes out of thinking about that reference architecture is you’re starting to see more and more of a search plane and the really important role in that. Like I said, in our particular case, we see it now as the authoritative document store, and many of our customers as well as our products. The interesting thing that comes out of that is this notion of what I like to call ‘search abuse’, i. e., you’re using the search engine, this technology, to do things that you normally wouldn’t do, wouldn’t think of doing with a traditional search engine and I think that a lot of this is really powered by the fact that we have really good open source search capabilities, because it lowers the barrier to experimentation so that you can take something like Lucid, which is pretty much the most widely deployed search technology on the planet and really dig in there and be able to create things that allow you to free your mind from the traditional approaches. The other interesting thing that comes out of when you look at a search engine like this is that at the end of the day, you’re essentially talking about a really large and fast sparse matrix multiplication. So, there’s a lot of problems out there that fall into this category of at the end of the day doing sparse matrix multiplications. So, you can think of a search engine as being a pretty effective way of dealing with that.
Some of the other interesting things that I think you’ve seen come out of Lucid and Solar’s evolution from being a, when I started with the project back in 2004, from being a pretty much straightforward keyword based search engine, is that we’re using it in a lot of different ways. I’ve seen people use it as key value stores; it’s very, very fast as a key value store, for instance, which could perhaps allow you to prevent having two moving parts in your system instead of having a key value store and then adding search on top of that value store. Why not just have a search index that allows you to do very, very high rates of key value look ups and gives you the value of the secondary index as well. We have customers doing 80 to 100 thousand queries per second, key value look ups and more, all off of the search engine. They completely have eliminated their need for other key value stores, which were providing the same type of service. Then, interestingly enough, if you really look especially at what the latest releases are in Lucid and Solar, around Lucid 4, the 4X Line, a lot more database like features in there now. We can do drawings, we can do aggregates, we can do groupings, you really have a lot more of those traditional database things, much deeper numerical support as well. So, not you can really start to combine the powers of that fuzzy matching searching that you’re doing, things like faceting or guided navigation that you’ve traditionally seen out of search engines. Also, the number crunching to go along with it so that we can start to look at whole document sets and drive more traditional BI tools off of the actual search results themselves, as opposed to having them have to go off and do all of that stuff in SQL or something like that and then try to do the fuzzy matching stuff in the search engine and join them across two different products.
We also have a number of other things that really allow you if you really want to geek out underneath the hood that allow you to customize Lucid in a number of different ways; for performance etc. So, for instance, things like the use of fine transducers, I know it’s probably a really geeky term, but it really allows you to do some really interesting things in terms of dealing with text. All of that stuff is pretty powerful when you start to bring those things into other applications that have search involved or have text involved or have a combination of attributed data with text.
To the third bullet point there, this stuff really, really scales. I think you’ve all heard of Twitter. Twitter search is powered entirely by Lucid. They turn around tweets, every tweet that comes in is searchable roughly within about 50 milliseconds or so, or micro-seconds of when it hits their system. They do however many index updates a date off of search and well over a billion queries a day. They’ve published all of these numbers. That’s all powered off of Lucid. So, in many ways, if you can scale the Twitter load on Lucid, I think it’s more than good enough for most of us out there.
Like I said earlier with the relevance things, the scoring features etc., there’s really been some pretty interesting changes around the way we can model relevance and importance in the search engine, such that if you’ve perhaps dealt with last generation search technology, I really think you ought to re-visit this because there’s a lot more capabilities here, it’s a lot more pluggable for people who are interested, you have a lot more control over what is good results and what are bad results. You can, for instance, inject a lot of your business needs into the search results as well so that you have that ability to fine tune things where you want to, but you can also then leverage the system at scale to do the things that it’s really good at as well.
So, with that, I’ve got about a minute left. I’m happy to leave you with few links and I’m also happy to take some questions if there are any out there.
There’s got to be one question. Maybe I should call on people. A good friend of mine calls on people if there are no audience questions. How many people, for instance have used Lucid and Solar? All right, a handful. How many people are doing natural language processing or dealing with text on a dailt basis? See, so if you don’t ask me questions then I’m going to ask you questions. Okay. Well, with that then, I thank you for your time and again, I would encourage you to look at Lucid and Solar and enjoy the rest of your day.