epifonyCrawlerPlugin
0.1.2beta
for sf 1.4sf 1.3 MIT
The aim of epifonyCrawler is to imitate a web search engine to spider your site and add the pages to a Lucene index using the excellent Zend Lucene library.
By using this plugin you will have control over the way your site is crawled by excluding certain areas from being indexed, like navigation. The crawler is also i18n aware and will index your translated copy.
Developers
| Name |
Status |
Email |
Pod1 Ltd |
lead |
moc.1dop <<ta>> ynofmys
|
License
Copyright (c) 2010 Pod1
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
epifonyCrawlerPlugin
Introduction
The aim of epifonyCrawler is to imitate a web search engine to spider your site and add the pages to a Lucene index using the excellent Zend Lucene library.
By using this plugin you will have control over the way your site is crawled by excluding certain areas from being indexed, like navigation. The crawler is also i18n aware and will index your translated copy.
Installation
To install the plugin for a symfony project, the usual process is to use the symfony command line:
symfony plugin:install epifonyCrawlerPlugin --stability=alpha
The plugin includes the required zend lucene code.
Crawling your domain
You can create your index by running:
symfony epifony:generate-index http://www.example.com --depth=2
This will crawl every link for 2 pages depth. If you ommit the depth then every link will be crawled, but be warned that this could take a long time and will consume a lot of memory. I would recommend a depth of about 5 for most sites. If you have very deep navigation on your site then 8 or 9 should suffice.
You can add more than one domain to the index:
symfony epifony:generate-index http://www.example.com,http://blog.example.com --depth=5
Or more than one language
symfony epifony:generate-index http://en.example.com/,http://fr.example.com/ --depth=5
Or add individual pages to your index like this:
symfony epifony:add-url http://www.example.com/new-page
Now you can test your search by running a lucene query on it
symfony epifony:search-query "dependency injection"
results:
id:91-----------------------------
score: 1
url: http://www.example.com/dependency-injection
id:234----------------------------
score: 0.518689
url: http://www.example.com/programming-design-patterns
Configuration
You can override most classes using your app.yml
all:
epifonyCrawler:
index_class: Zend_Search_Lucene
# handles the DI for all the classes
manager_class: epifonyCrawlerManager
# defaults to all links - not recommended because php maxiumum nesting level can be reached quite quickly
depth: -1
# takes the url, reads the headers and gets the content
browser_class: sfWebBrowser
#null value attempts to guess
browser_adapter_class: sfCurlAdapter
browser_adapter_options:
followlocation: true
browser_default_headers:
User-Agent: epifonyCrawler v0.1
#the amount of debugging shown to the user
debug_level: E_NOTICE)):?>31
#save the debugging to a file
log_file_path: %sf_log_dir%/search.log
logger: sfFileLogger
#where to save the index
data_dir: %sf_data_dir%/search/
crawler_class: epifonyCrawler
#assigns a class to read the content based on the mime type from the browser class
extractor_factory: epifonyCrawlerExtractorFactory
#the classes to assign to parsing a mime type
mime_document_classes:
text/html: epifonyCrawlerHtmlExtractor
application/pdf: epifonyCrawlerPdfExtractor
#you can pass in constructors to the mime_document_classes here
document_constructor_options:
epifonyCrawlerHtmlExtractor:
#give boost to certain html elements
boost:
h1: 1.5
description: 1.5
h2: 1.2
title: 1.6
Example of overriding the mime document class
Change the class in the app.yml to your own
mime_document_classes:
text/html: myEpifonyCrawlerHtmlExtractor
Override one of the parsing options - for example to remove some text from all the page titles
class myEpifonyCrawlerHtmlExtractor extends epifonyCrawlerHtmlExtractor
{
public function processTitle($xpath)
{
$docTitle = '';
$titleNodes = $xpath->query('/html/head/title');
foreach ($titleNodes as $titleNode) {
// remove the url from the title
$docTitle .= str_replace('www.example.com', '', $titleNode->nodeValue) . ' ';
}
$field = $this->createField($docTitle, 'title');
$this->addField($field);
}
}
Selective Indexing
Some areas of your site might not need to be indexed, for example the navigation, which could skew the results of the search. There are 3 ways to exclude content from being searched and increasing the quality of your index
- Robots.txt You can add the epifonyCrawler user-agent to your robots and block by subdomain or subdirectory
- rel="nofollow" Add this to links that you don't want the crawler to follow
- class="crawl_ignore" Add this class to navigation or other html tags which you don't want to be indexed
I18n
By default the html meta content-language value is stored in the index and can be added onto any search query. See the next section for an example of this.
Creating a search results page
I'm currently working on epifonySearchPlugin which will handle the search results for an index. In the meantime in your frontend app you can create a search like this
in your action.class.php:
$crawlerManager = epifonyCrawlerManager::getInstance();
$index = $crawlerManager->openIndex();
// only search english results
$this->hits = $index->find($request->getParameter('search').' AND content-language:en');
in your template:
foreach ($hits as $hit) {
echo $hit->score;
echo $hit->title;
echo $hit->url;
}
Helping with your index
Your index needs attention to make sure the results it's returning is right for your site. Luke is an excellent tool to analyse your index and can be installed on Windows or Unix.
TODO
- Add more mime types to the extractors eg. Word, RSS
- Extract links from a PDF (currently returns an empty array)
- Test with different charsets (currently only UTF-8 is supported)
- Create a script to backup the index and roll-forward functionality so a new index can be generated without affecting the current index
- Some sort of log rotation