epifonyCrawlerPlugin - 0.1.2

Website crawler that adds your site content to a Lucene Index for easy and efficient searching

You are currently browsing
the website for symfony 1

Visit the Symfony2 website


« Back to the Plugins Home

Signin


Forgot your password?
Create an account

Tools

Stats

advanced search
Information Readme Dependencies Releases Changelog Contribute
Show source

epifonyCrawlerPlugin

Introduction

The aim of epifonyCrawler is to imitate a web search engine to spider your site and add the pages to a Lucene index using the excellent Zend Lucene library.

By using this plugin you will have control over the way your site is crawled by excluding certain areas from being indexed, like navigation. The crawler is also i18n aware and will index your translated copy.

Installation

To install the plugin for a symfony project, the usual process is to use the symfony command line:

symfony plugin:install epifonyCrawlerPlugin --stability=alpha

The plugin includes the required zend lucene code.

Crawling your domain

You can create your index by running:

symfony epifony:generate-index http://www.example.com --depth=2

This will crawl every link for 2 pages depth. If you ommit the depth then every link will be crawled, but be warned that this could take a long time and will consume a lot of memory. I would recommend a depth of about 5 for most sites. If you have very deep navigation on your site then 8 or 9 should suffice.

You can add more than one domain to the index:

symfony epifony:generate-index http://www.example.com,http://blog.example.com --depth=5

Or more than one language

symfony epifony:generate-index http://en.example.com/,http://fr.example.com/ --depth=5

Or add individual pages to your index like this:

symfony epifony:add-url http://www.example.com/new-page

Now you can test your search by running a lucene query on it

symfony epifony:search-query "dependency injection"

results:
id:91-----------------------------
score: 1
url: http://www.example.com/dependency-injection
id:234----------------------------
score: 0.518689
url: http://www.example.com/programming-design-patterns

Configuration

You can override most classes using your app.yml

all:
  epifonyCrawler:

    index_class: Zend_Search_Lucene

    # handles the DI for all the classes
    manager_class: epifonyCrawlerManager

    # defaults to all links - not recommended because php maxiumum nesting level can be reached quite quickly
    depth: -1

    # takes the url, reads the headers and gets the content
    browser_class: sfWebBrowser

    #null value attempts to guess
    browser_adapter_class: sfCurlAdapter
    browser_adapter_options:
      followlocation: true
    browser_default_headers:
      User-Agent: epifonyCrawler v0.1

    #the amount of debugging shown to the user
    debug_level:  E_NOTICE)):?>31

    #save the debugging to a file
    log_file_path: %sf_log_dir%/search.log
    logger: sfFileLogger

    #where to save the index
    data_dir: %sf_data_dir%/search/

    crawler_class: epifonyCrawler

    #assigns a class to read the content based on the mime type from the browser class
    extractor_factory: epifonyCrawlerExtractorFactory

    #the classes to assign to parsing a mime type
    mime_document_classes:
      text/html: epifonyCrawlerHtmlExtractor
      application/pdf: epifonyCrawlerPdfExtractor

    #you can pass in constructors to the mime_document_classes here
    document_constructor_options:
      epifonyCrawlerHtmlExtractor:
        #give boost to certain html elements
        boost: 
          h1: 1.5
          description: 1.5
          h2: 1.2
          title: 1.6

Example of overriding the mime document class

Change the class in the app.yml to your own

    mime_document_classes:
      text/html: myEpifonyCrawlerHtmlExtractor

Override one of the parsing options - for example to remove some text from all the page titles

class myEpifonyCrawlerHtmlExtractor extends epifonyCrawlerHtmlExtractor
{

  public function processTitle($xpath)
  {

    $docTitle = '';

    $titleNodes = $xpath->query('/html/head/title');

    foreach ($titleNodes as $titleNode) {
      // remove the url from the title
      $docTitle .= str_replace('www.example.com', '', $titleNode->nodeValue) . ' ';
    }
    $field = $this->createField($docTitle, 'title');

    $this->addField($field);

  }      
}

Selective Indexing

Some areas of your site might not need to be indexed, for example the navigation, which could skew the results of the search. There are 3 ways to exclude content from being searched and increasing the quality of your index

  1. Robots.txt You can add the epifonyCrawler user-agent to your robots and block by subdomain or subdirectory
  2. rel="nofollow" Add this to links that you don't want the crawler to follow
  3. class="crawl_ignore" Add this class to navigation or other html tags which you don't want to be indexed

I18n

By default the html meta content-language value is stored in the index and can be added onto any search query. See the next section for an example of this.

Creating a search results page

I'm currently working on epifonySearchPlugin which will handle the search results for an index. In the meantime in your frontend app you can create a search like this

in your action.class.php:

 $crawlerManager = epifonyCrawlerManager::getInstance();

 $index = $crawlerManager->openIndex();

 // only search english results
 $this->hits = $index->find($request->getParameter('search').' AND content-language:en');

in your template:

foreach ($hits as $hit) {
    echo $hit->score;
    echo $hit->title;
    echo $hit->url;
}

Helping with your index

Your index needs attention to make sure the results it's returning is right for your site. Luke is an excellent tool to analyse your index and can be installed on Windows or Unix.

TODO

  • Add more mime types to the extractors eg. Word, RSS
  • Extract links from a PDF (currently returns an empty array)
  • Test with different charsets (currently only UTF-8 is supported)
  • Create a script to backup the index and roll-forward functionality so a new index can be generated without affecting the current index
  • Some sort of log rotation