As I blogged a couple of days ago, I just recently put up a new site at DougHughes.net. One thing I wanted to do was implement a search tool that used the Verity spider which comes with ColdFusion. After much research and hair-pulling I found out that the Verity spider’s not supported on Linux. D’oh! I pondered several solutions to the problem, including writing my own spider in Java, and then feeding the data into Verity somehow or other.
The problem I kept running into is that Verity, at least as it ships with ColdFusion, is really intended for indexing files on disk and not dynamic web content directly from websites. This meant that for me to try to index html and other binary content that I would need to use a spider and cache web documents to disk before using a bulk insert file to index them with Verity.
I tried several different approaches to the problem, but I wasn’t really very happy with any of them. In the end, I did a little more research and found out about a custom tag called Lindex which was distributed on Macromedia’s DRK 3 CD and which used a search engine I’d never heard of to index content. The search engine was Lucene.
Lucene, it seems, is a “high-performance, full-featured text search enginelibrary written entirely in Java”. It also happened to be free, open source, and published by the Apache Foundation. I downloaded it right away and started learning how to use it.
I’ve now build a CFC which uses lucene to create indexes, index content, and search indexes. Currently, I’m only indexing HTML content however, I plan to grow that to PDF, DOC, XLS, PPT, and RTF accoding to the FAQs I found.Content is fed to Lucene via a half-assed spider I wrote using cfhttp. Asside from some things which could use a little work on my side, I’ve been very happy with the preformance and functionality of Lucene and my search component.
It’s just so much much more powerful and easier than Verity. It’s just exactly what I need. I love the fact that I have an API I can code against. (There’s really not an API for Verity.) If you’re using (or failing to use) Verity, I strongly suggest looking into Lucene. When is all said and done, I’ll probably release the search component as an Alagad product. Maybe by 2005? It could happen!
Comments on: "Searching with Lucene" (5)
Have you been able to get the Lucene Multisearcher function working from CF?
I’ve got the indexing down and can do a simple search but setting up the multisearcher is giving me a headache — I’m a CF programmer, not a Java jock.
Yes and no. I have gotten the multi searcher to work, but from Java. However, that said, it shouldn’t be that hard from ColdFusion.
Sorry, but at the moment I can’t offer much more help than that.
I figured it out — I was overthinking it trying to create a Java array of searchers when a CF array worked just fine.
Anyone want to pony up a some code for this? 🙂
Did you ever get this solid and working? How about your spider? Any thoughts/recommendations for handling spidering sites in CF7 as I’m having problems with vspider as well.
Have some ideas, but any input much appreciated!