Dec 4, 2010

FAST Search Web Crawler – Part I

Many people don’t know that FAST Search for SharePoint contains two Web Crawler (I was one of them ;).
  • Web site indexing connector
    • Use when you have a limited amount of Web sites to crawl, without dynamic content.
  • FAST Search Web crawler
    • Use when you have many Web sites to crawl.
    • Use when the Web site content contains dynamic data, including JavaScript.
    • Use when the organization needs access to advanced Web crawling, configuration and scheduling options.
    • Use when you want to crawl RSS Web content.
    • Use when the Web site content uses advanced logon options
The Web site indexing connector can simply be configured in the central administration and is kind of limited. I really was dissatisfied, because it was not able to met my requirements. What are my requirements?
I want to crawl our enterprise wikis, team sites, profiles and especially the social bookmarks. This could easily be achieved with a SharePoint content source. BUT, I also want to crawl external (internet) pages that are referenced within wikis and social bookmarks. A big part of our company knowledge consist of hyperlinks to internet resources like articles and blog posts that has been collected and verified from our employees. I want that the crawler follows the external links and only crawl that page (1 hop, page depth 0). Additionally I want to crawl our internet blogs which are also an important part of our company knowledge (0 hops, page depth full). The SharePoint crawler is not able to follow external links at all. So I started to evaluate the simple web crawler, but unfortunately with no luck. The web crawler was only able to specify hops and page depth with no further constraints. For our blogs this would be okay, but not for the wikis and the social bookmarks. The configuration for the blogs and our homepage could easily be achieved through 0 hops and a full page depth. But for the wikis and the social bookmarks I need a more fine grained configuration. Crawl the intranet with 1 hop and full page depth and after the hop (the external site) just crawl this page (page depth 0).
image
This was not possible with the simple web crawler. I couldn’t believe that FAST would not be able to satisfy this requirement…
…and finally I found the FAST Search Web Crawler.
…but I also found a lack of documentation ;-) First of all forget the UI! This crawler can only be configured through command line tools and XML configs living on the FAST servers. The most important part is to create the Web Crawler Configuration as described here.

Web Crawler Configuration

<?xml version="1.0"?>
<CrawlerConfig>
<!-- Crawl collection name, must be unique for each collection.      -->
<!-- Documents are indexed in the collection by the same name.       -->
<DomainSpecification name="sp">
<SubDomain name="intranet">
<attrib name="start_uris" type="list-string">
<member>http://intranet</member>
</attrib>
<section name="include_uris">
<attrib name="prefix" type="list-string">
<member>http://intranet</member>
<member>http://mysite</member>
</attrib>
</section>
<section name="passwd">
<attrib name="http://intranet" type="string">
FastCrawl:pass@word1:contoso:auto
</attrib>
<attrib name="http://mysite" type="string">
FastCrawl:pass@word1:contoso:auto
</attrib>
</section>
<section name="crawlmode">
<!--Crawl depth (use DEPTH:n to do level crawling).-->
<attrib name="mode" type="string">FULL</attrib>
<!--Follow links from one hostname to another (interlinks).-->
<attrib name="fwdlinks" type="boolean">yes</attrib>
<!--Reset crawl level when following interlinks.-->
<attrib name="reset_level" type="boolean">no</attrib>
<attrib name="robots" type="boolean">no</attrib>
<attrib name="max_uri_recursion" type="integer">5</attrib>
</section>
</SubDomain>

<SubDomain name="full_depth_no_hops">
<attrib name="start_uris" type="list-string">
<member>http://dataone.de</member>
<member>http://iLoveSharePoint.com</member>
<member>http://aknauer.blogspot.com</member>
<member>http://www.markus-alt.de</member>
<member>http://aknauer.blogspot.com</member>
<member>http://bydprojekt.blogspot.com</member>
<member>http://www.andreaseissmann.de</member>
<member>http://cglessner.blogspot.com</member>

</attrib>
<section name="include_uris">
<attrib name="prefix" type="list-string">
<member>http://dataone.de</member>
<member>http://iLoveSharePoint.com</member>
<member>http://aknauer.blogspot.com</member>
<member>http://www.markus-alt.de</member>
<member>http://bydprojekt.blogspot.com</member>
<member>http://www.andreaseissmann.de</member>
<member>http://cglessner.blogspot.com</member>
</attrib>
</section>

<section name="crawlmode">
<!--Crawl depth (use DEPTH:n to do level crawling).-->
<attrib name="mode" type="string">FULL</attrib>
<!--Follow links from one hostname to another (interlinks).-->
<attrib name="fwdlinks" type="boolean">no</attrib>
<!--Reset crawl level when following interlinks.-->
<attrib name="reset_level" type="boolean">no</attrib>
</section>
<attrib name="max_uri_recursion" type="integer">5</attrib>
</SubDomain>

<!-- List of start (seed) URIs. -->
<attrib name="start_uris" type="list-string">
<member>http://intranet</member>
<member>http://dataone.de</member>
<member>http://iLoveSharePoint.com</member>
<member>http://www.markus-alt.de/blog</member>
<member>http://aknauer.blogspot.com</member>
<member>http://bydprojekt.blogspot.com</member>
<member>http://www.andreaseissmann.de</member>
<member>http://cglessner.blogspot.com</member>
</attrib>

<!-- Include and exclude rules. Each type of rule may contain a   -->
<!-- the following types: exact, prefix, suffix, regexp and file. -->
<!-- See "include domains" for an example.                        -->

<!-- Include the following hostnames in the crawl. If no hostnames -->
<!-- are specified, the crawler will crawl any hostname unless     -->
<!-- "include_uris" are specified, in which case only URIs         -->
<!-- those rules are crawled.                                      -->
<section name="include_domains">
<attrib name="exact" type="list-string"></attrib>
<attrib name="prefix" type="list-string"></attrib>
<attrib name="suffix" type="list-string"></attrib>
<attrib name="file" type="list-string"></attrib>
</section>

<!-- Include the following URIs in the crawl. -->
<section name="include_uris"></section>

<!-- The following hostnames will be excluded from the crawl, -->
<!-- even if they were included by include rules above.       -->
<section name="exclude_domains"></section>

<!-- The following URIS will be excluded from the crawl, -->
<!-- even if they were included by include rules above.  -->
<section name="exclude_uris"></section>

<!-- Crawl Mode -->
<section name="crawlmode">
<!-- Crawl depth (use DEPTH:n to do level crawling). -->
<attrib name="mode" type="string">DEPTH:0</attrib>
<!-- Follow links from one hostname to another (interlinks). -->
<attrib name="fwdlinks" type="boolean">no</attrib>
<!-- Reset crawl level when following interlinks. -->
<attrib name="reset_level" type="boolean">no</attrib>
</section>
<section name="passwd">
<attrib name="http://intranet" type="string">
FastCrawl:pass@word1:contoso:auto
</attrib>
</section>

<attrib name="robots" type="boolean">no</attrib>
<attrib name="max_uri_recursion" type="integer">5</attrib>
<!-- Delay in seconds between requests to a single site -->
<attrib name="delay" type="real">60</attrib>
<!-- Length of crawl cycle expressed in minutes -->
<attrib name="refresh" type="real">1440</attrib>
<!-- Maximum number of documents to retrieve from one site. -->
<attrib name="max_doc" type="integer">5000</attrib>
<!-- Let each Node Scheduler crawl this many sites simultaneously. -->
<attrib name="max_sites" type="integer">32</attrib>
<!-- Maximum size of a document (bytes). -->
<attrib name="cut_off" type="integer"> 5000000 </attrib>
<!-- Toggle JavaScript support (using the Browser Engine). -->
<attrib name="use_javascript" type="boolean"> no </attrib>
<!-- Toggle near duplicate detection. -->
<attrib name="near_duplicate_detection" type="boolean">no</attrib>

<!-- Inclusion and exclusion.                                   -->
<!--                                                            -->
<!-- The following section sets up what content to crawl and    -->
<!-- not to crawl.                                              -->
<!-- Only crawl HTTP/HTTPS (e.g., don't crawl FTP). -->
<attrib name="allowed_schemes" type="list-string">
<member> http </member>
<member> https </member>
</attrib>
<!-- Allow these MIME types to be retrieved. -->
<attrib name="allowed_types" type="list-string">
<member> text/* </member>
<member> application/* </member>
</attrib>

</DomainSpecification>
</CrawlerConfig>

The main domain specification for content collection sp (sp is the default collection for SharePoint) is configured to crawl a page depth of 0 (DEPTH:0) and to not forward to hyperlinks pointing to other domains (fwdlinks=false). The start uris contains the url to our intranet, homepage and blogs. At this level there isn’t any include uri or domain pattern specified. This means any url will match the rule. In the sub domain “intranet” I defined as start uri http://intranet/ and very important to only include uris that start with http://intranet/ or http://mysite/ to this sub domain. I also defined a page depth of full and to follow external urls. Every url that start with the previous defined url prefixes will follow that crawl rule. This means that the http://intranet/ will be crawled with full depth and also follows external links to other domains. But the followed domains (e.g. “http://microsoft.com”) will not match the intranet sub domain pattern and will fallback to the main domain specification which only allows to crawl the particular page. There is also a sub domain called “full_depth_no_hops” which includes our blogs and homepage in the include uri patterns with full page depth and no forwarding. Exactly what I want :-)

Deploy the configuration

  • Log on to the FAST Search Server
  • Copy your configuration to “%FASTSEARCH%\etc\” (e.g. MyCollection.xml)
  • Start the FAST Search PowerShell Shell
  • Ensure that crawler is started: nctrl.exe start crawler
  • Register the config: cawleradmin.exe --addconfig “%FASTSEARCH%\etc\MyCollection.xml”
  • (optional) Start a new crawl: cawleradmin.exerefetch
  • Monitor the fetch log: “%FASTSEARCH%\etc\var\log\crawler\node\fetch\sp”
Fetch log will look like this:
2010-12-03-17:03:43 200 REDIRECT  http://ilovesharepoint.com/ Redirect URI=http://cglessner.de/ 2010-12-03-17:03:44 200 NEW       http://bydprojekt.blogspot.com/ 2010-12-03-17:03:46 200 REDIRECT  http://www.markus-alt.de/ Redirect URI=http://www.markus-alt.de/blog 2010-12-03-17:03:46 200 NEW       http://aknauer.blogspot.com/ 2010-12-03-17:03:53 200 NEW       http://www.andreaseissmann.de/ 2010-12-03-17:04:43 302 REDIRECT  http://intranet/ Redirect URI=http://intranet/Pages/Home.aspx 2010-12-03-17:04:43 301 REDIRECT  http://www.markus-alt.de/blog Redirect URI=http://www.markus-alt.de/blog/ 2010-12-03-17:04:43 302 REDIRECT  http://dataone.de/Seiten/VariationRoot.aspx Redirect URI=http://dataone.de/de
….

The FAST Search Web crawler is a high end Web Crawler!

Keep in mind that you still need the SharePoint crawler because of the keywords, taxonomy and security. The web crawler will do an additional crawl. FAST can handle this duplicates. The second crawl is a drawback, but the result is worth it.
I will post more details (hopefully) soon…

Review Collaboration Days 2010

Last week has been the Collaboration Days Swiss, with about 300 attendees it has been the biggest SharePoint event in Switzerland so far. I liked the fact that the event has been community driven. Especially many thanks to Samuel Zürcher and Stefan Heinz for their effort and commitment that has contributed significantly to the success of the event. I believe this was a big step forward for the SharePoint Community Swiss. What I personally regret, is that I had to cancel two session (and missed the #SharePint ;) because of hoarseness. Thanks to Thorsten Hans and Nicki Borell for the great backup.

Interview with Eric Swift (Microsoft SharePoint General Manager)


Interview with Nicolette du Toit (Marketing Manager for Office at Microsoft Swiss)


I’m looking forward for the Collaboration Days 2011…

PS: Thanks to the HLMC stuff for all the ginger tea. I’m almost fit :-)