Dec 4, 2010

FAST Search Web Crawler – Part I

Many people don’t know that FAST Search for SharePoint contains two Web Crawler (I was one of them ;).
  • Web site indexing connector
    • Use when you have a limited amount of Web sites to crawl, without dynamic content.
  • FAST Search Web crawler
    • Use when you have many Web sites to crawl.
    • Use when the Web site content contains dynamic data, including JavaScript.
    • Use when the organization needs access to advanced Web crawling, configuration and scheduling options.
    • Use when you want to crawl RSS Web content.
    • Use when the Web site content uses advanced logon options
The Web site indexing connector can simply be configured in the central administration and is kind of limited. I really was dissatisfied, because it was not able to met my requirements. What are my requirements?
I want to crawl our enterprise wikis, team sites, profiles and especially the social bookmarks. This could easily be achieved with a SharePoint content source. BUT, I also want to crawl external (internet) pages that are referenced within wikis and social bookmarks. A big part of our company knowledge consist of hyperlinks to internet resources like articles and blog posts that has been collected and verified from our employees. I want that the crawler follows the external links and only crawl that page (1 hop, page depth 0). Additionally I want to crawl our internet blogs which are also an important part of our company knowledge (0 hops, page depth full). The SharePoint crawler is not able to follow external links at all. So I started to evaluate the simple web crawler, but unfortunately with no luck. The web crawler was only able to specify hops and page depth with no further constraints. For our blogs this would be okay, but not for the wikis and the social bookmarks. The configuration for the blogs and our homepage could easily be achieved through 0 hops and a full page depth. But for the wikis and the social bookmarks I need a more fine grained configuration. Crawl the intranet with 1 hop and full page depth and after the hop (the external site) just crawl this page (page depth 0).
image
This was not possible with the simple web crawler. I couldn’t believe that FAST would not be able to satisfy this requirement…
…and finally I found the FAST Search Web Crawler.
…but I also found a lack of documentation ;-) First of all forget the UI! This crawler can only be configured through command line tools and XML configs living on the FAST servers. The most important part is to create the Web Crawler Configuration as described here.

Web Crawler Configuration

<?xml version="1.0"?>
<CrawlerConfig>
<!-- Crawl collection name, must be unique for each collection.      -->
<!-- Documents are indexed in the collection by the same name.       -->
<DomainSpecification name="sp">
<SubDomain name="intranet">
<attrib name="start_uris" type="list-string">
<member>http://intranet</member>
</attrib>
<section name="include_uris">
<attrib name="prefix" type="list-string">
<member>http://intranet</member>
<member>http://mysite</member>
</attrib>
</section>
<section name="passwd">
<attrib name="http://intranet" type="string">
FastCrawl:pass@word1:contoso:auto
</attrib>
<attrib name="http://mysite" type="string">
FastCrawl:pass@word1:contoso:auto
</attrib>
</section>
<section name="crawlmode">
<!--Crawl depth (use DEPTH:n to do level crawling).-->
<attrib name="mode" type="string">FULL</attrib>
<!--Follow links from one hostname to another (interlinks).-->
<attrib name="fwdlinks" type="boolean">yes</attrib>
<!--Reset crawl level when following interlinks.-->
<attrib name="reset_level" type="boolean">no</attrib>
<attrib name="robots" type="boolean">no</attrib>
<attrib name="max_uri_recursion" type="integer">5</attrib>
</section>
</SubDomain>

<SubDomain name="full_depth_no_hops">
<attrib name="start_uris" type="list-string">
<member>http://dataone.de</member>
<member>http://iLoveSharePoint.com</member>
<member>http://aknauer.blogspot.com</member>
<member>http://www.markus-alt.de</member>
<member>http://aknauer.blogspot.com</member>
<member>http://bydprojekt.blogspot.com</member>
<member>http://www.andreaseissmann.de</member>
<member>http://cglessner.blogspot.com</member>

</attrib>
<section name="include_uris">
<attrib name="prefix" type="list-string">
<member>http://dataone.de</member>
<member>http://iLoveSharePoint.com</member>
<member>http://aknauer.blogspot.com</member>
<member>http://www.markus-alt.de</member>
<member>http://bydprojekt.blogspot.com</member>
<member>http://www.andreaseissmann.de</member>
<member>http://cglessner.blogspot.com</member>
</attrib>
</section>

<section name="crawlmode">
<!--Crawl depth (use DEPTH:n to do level crawling).-->
<attrib name="mode" type="string">FULL</attrib>
<!--Follow links from one hostname to another (interlinks).-->
<attrib name="fwdlinks" type="boolean">no</attrib>
<!--Reset crawl level when following interlinks.-->
<attrib name="reset_level" type="boolean">no</attrib>
</section>
<attrib name="max_uri_recursion" type="integer">5</attrib>
</SubDomain>

<!-- List of start (seed) URIs. -->
<attrib name="start_uris" type="list-string">
<member>http://intranet</member>
<member>http://dataone.de</member>
<member>http://iLoveSharePoint.com</member>
<member>http://www.markus-alt.de/blog</member>
<member>http://aknauer.blogspot.com</member>
<member>http://bydprojekt.blogspot.com</member>
<member>http://www.andreaseissmann.de</member>
<member>http://cglessner.blogspot.com</member>
</attrib>

<!-- Include and exclude rules. Each type of rule may contain a   -->
<!-- the following types: exact, prefix, suffix, regexp and file. -->
<!-- See "include domains" for an example.                        -->

<!-- Include the following hostnames in the crawl. If no hostnames -->
<!-- are specified, the crawler will crawl any hostname unless     -->
<!-- "include_uris" are specified, in which case only URIs         -->
<!-- those rules are crawled.                                      -->
<section name="include_domains">
<attrib name="exact" type="list-string"></attrib>
<attrib name="prefix" type="list-string"></attrib>
<attrib name="suffix" type="list-string"></attrib>
<attrib name="file" type="list-string"></attrib>
</section>

<!-- Include the following URIs in the crawl. -->
<section name="include_uris"></section>

<!-- The following hostnames will be excluded from the crawl, -->
<!-- even if they were included by include rules above.       -->
<section name="exclude_domains"></section>

<!-- The following URIS will be excluded from the crawl, -->
<!-- even if they were included by include rules above.  -->
<section name="exclude_uris"></section>

<!-- Crawl Mode -->
<section name="crawlmode">
<!-- Crawl depth (use DEPTH:n to do level crawling). -->
<attrib name="mode" type="string">DEPTH:0</attrib>
<!-- Follow links from one hostname to another (interlinks). -->
<attrib name="fwdlinks" type="boolean">no</attrib>
<!-- Reset crawl level when following interlinks. -->
<attrib name="reset_level" type="boolean">no</attrib>
</section>
<section name="passwd">
<attrib name="http://intranet" type="string">
FastCrawl:pass@word1:contoso:auto
</attrib>
</section>

<attrib name="robots" type="boolean">no</attrib>
<attrib name="max_uri_recursion" type="integer">5</attrib>
<!-- Delay in seconds between requests to a single site -->
<attrib name="delay" type="real">60</attrib>
<!-- Length of crawl cycle expressed in minutes -->
<attrib name="refresh" type="real">1440</attrib>
<!-- Maximum number of documents to retrieve from one site. -->
<attrib name="max_doc" type="integer">5000</attrib>
<!-- Let each Node Scheduler crawl this many sites simultaneously. -->
<attrib name="max_sites" type="integer">32</attrib>
<!-- Maximum size of a document (bytes). -->
<attrib name="cut_off" type="integer"> 5000000 </attrib>
<!-- Toggle JavaScript support (using the Browser Engine). -->
<attrib name="use_javascript" type="boolean"> no </attrib>
<!-- Toggle near duplicate detection. -->
<attrib name="near_duplicate_detection" type="boolean">no</attrib>

<!-- Inclusion and exclusion.                                   -->
<!--                                                            -->
<!-- The following section sets up what content to crawl and    -->
<!-- not to crawl.                                              -->
<!-- Only crawl HTTP/HTTPS (e.g., don't crawl FTP). -->
<attrib name="allowed_schemes" type="list-string">
<member> http </member>
<member> https </member>
</attrib>
<!-- Allow these MIME types to be retrieved. -->
<attrib name="allowed_types" type="list-string">
<member> text/* </member>
<member> application/* </member>
</attrib>

</DomainSpecification>
</CrawlerConfig>

The main domain specification for content collection sp (sp is the default collection for SharePoint) is configured to crawl a page depth of 0 (DEPTH:0) and to not forward to hyperlinks pointing to other domains (fwdlinks=false). The start uris contains the url to our intranet, homepage and blogs. At this level there isn’t any include uri or domain pattern specified. This means any url will match the rule. In the sub domain “intranet” I defined as start uri http://intranet/ and very important to only include uris that start with http://intranet/ or http://mysite/ to this sub domain. I also defined a page depth of full and to follow external urls. Every url that start with the previous defined url prefixes will follow that crawl rule. This means that the http://intranet/ will be crawled with full depth and also follows external links to other domains. But the followed domains (e.g. “http://microsoft.com”) will not match the intranet sub domain pattern and will fallback to the main domain specification which only allows to crawl the particular page. There is also a sub domain called “full_depth_no_hops” which includes our blogs and homepage in the include uri patterns with full page depth and no forwarding. Exactly what I want :-)

Deploy the configuration

  • Log on to the FAST Search Server
  • Copy your configuration to “%FASTSEARCH%\etc\” (e.g. MyCollection.xml)
  • Start the FAST Search PowerShell Shell
  • Ensure that crawler is started: nctrl.exe start crawler
  • Register the config: cawleradmin.exe --addconfig “%FASTSEARCH%\etc\MyCollection.xml”
  • (optional) Start a new crawl: cawleradmin.exerefetch
  • Monitor the fetch log: “%FASTSEARCH%\etc\var\log\crawler\node\fetch\sp”
Fetch log will look like this:
2010-12-03-17:03:43 200 REDIRECT  http://ilovesharepoint.com/ Redirect URI=http://cglessner.de/ 2010-12-03-17:03:44 200 NEW       http://bydprojekt.blogspot.com/ 2010-12-03-17:03:46 200 REDIRECT  http://www.markus-alt.de/ Redirect URI=http://www.markus-alt.de/blog 2010-12-03-17:03:46 200 NEW       http://aknauer.blogspot.com/ 2010-12-03-17:03:53 200 NEW       http://www.andreaseissmann.de/ 2010-12-03-17:04:43 302 REDIRECT  http://intranet/ Redirect URI=http://intranet/Pages/Home.aspx 2010-12-03-17:04:43 301 REDIRECT  http://www.markus-alt.de/blog Redirect URI=http://www.markus-alt.de/blog/ 2010-12-03-17:04:43 302 REDIRECT  http://dataone.de/Seiten/VariationRoot.aspx Redirect URI=http://dataone.de/de
….

The FAST Search Web crawler is a high end Web Crawler!

Keep in mind that you still need the SharePoint crawler because of the keywords, taxonomy and security. The web crawler will do an additional crawl. FAST can handle this duplicates. The second crawl is a drawback, but the result is worth it.
I will post more details (hopefully) soon…

19 comments:

domsen said...

whats the "Passwd" for a syntax? :S

FastCrawl:pass@word1:contoso:auto

Username:Passwort:Domain:auto?!

JackWilliam said...

Hi, I am currently working as a seo in reputed company. I got a project in Web designing last week. i need to gather study material regarding it. I like to ass your stuff in it. I found to easy and follow your stuff. I has also added some from WebSpiders.

Anonymous said...

The application applications the shore potential, once dockside and even drink, once omega replica underway. Several other extraordinary stove tops for sale is the cp cooker stove tops that come with accurate grilling items and thereby supply the appropriate outdoors grilling gear. Thanks to every one of stuff performing because of bonce, perhaps you may definitely lose interest in what exactly most essential nearby, that is definitely to help you in the right way register with your airline around the flight destination. There are actually a number of tips to help you look at within the register operation rolex replica before you'll deck ones own airline. And listed below are that three or more fundamental conisderations to consider. It's endorsed to reach with the flight destination around three days until the signed reduction from your plane. It can offer a lot of period to whole that register operation, plus really going rolex replica because of the persuits and even reliability determines. Launched three days earlier looks like a powerful very best timing that you certainly will be certain ones own essentials is simply not looked inside ahead of time, resulting in ones own essentials isn't going to be the carry on to reach around the travel luggage experiencing tag heuer replica carousel and a close spot. Moreover, it very best timing might be certain rolex replica uk you have got a lot of precious time when suffering persuits and even reliability you need to do various late deal hunting and featuring bathroom around the flight destination reduction terminal.

infopath signing said...

The information that you have posted helped me a lot as I got to know so many new and useful facts about this concept.

Hug Day said...

hug day 2016 Images
happy hug day 2016 Date SMS Messages Quotes
Hug day Celebration Ideas For Him/Her
hug day sms for gf

风骚达哥 said...

20160423 junda
air jordans
cheap oakley sunglasses
kate spade outlet
adidas stan smith
ray ban sunglasses
michael kors outlet
nike air max shoes
cheap omega watches
yeezy boost 350
michael kors outlet online
michael kors outlet online
nike free runs
air jordan uk
michael kors outlet online
pandora charms
ray ban outlet
calvin klein outlet
fitflops sale clearance
toms outlet
nike huarache white
sac longchamp
toms outlet
michael kors outlet clearance
bottega veneta outlet
armani watches
gucci handbags
nike blazer
oakley sunglasses
true religion jeans
cheap jordans
ray bans
hollister
bottega veneta handbags
jimmy choo outlet
oakley sunglasses
converse
prada outlet
reebok shoes
true religion
burberry outlet

Libin Huang said...

20160426libinhollister clothing store
oakley sunglasses
michael kors handbags
tiffany outlet
ray-ban sunglasses
gucci outlet online
ray-ban sunglasses
burberry outlet online
coach outlet canada
ray-ban sunglasses
herve leger outlet
ray ban sunglasses sale
air max 90
oakley sunglasses
adidas wings shoes
michael kors outlet
michael kors factory outlet
louis vuitton bags
michael kors online outlet
nba jerseys
ralph lauren pas cher
oakley sunglasses
toms shoes
nike air max 90
nike tn pas cher
air max 90
lululemon outlet
nike air force 1
true religion outlet
michael kors outlet
burberry outlet store
ray ban sunglasses
kobe shoes
true religion outlet
discount michael kors handbags

Yuanyuan Lin said...

7.14llllllyuan"hollister shirts"
"true religion canada"
"true religion outlet"
"beats by dre"
"nike tn pas cher"
"ray ban sunglasses"
"tory burch outlet"
"mulberry handbags"
"prada outlet"
"michael kors outlet"
"tiffany outlet"
"louis vuitton neverfull"
"louis vuitton handbags"
"giuseppe zanotti shoes"
"nike uk store"
"michael kors wholesale"
"tiffany jewelry"
"michael kors outlet"
"ralph lauren pas cher"
"mulberry bags"
"nike air force 1"
"police sunglasses for men"
"gucci outlet online"
"ray ban sunglasses"
"jordan shoes"
"asics,asics israel,asics shoes,asics running shoes,asics israel,asics gel,asics running,asics gel nimbus,asics gel kayano"
"cheap jordan shoes"
"basketball shoes"
"fitflops sale"
"prada outlet online"
"ralph lauren polo"
"nike air max 90"
"omega outlet"
"lacoste shirts"
"longchamp outlet"
7.14

Justin said...

After you complete the process of baffling available, you possess it. Equipment renting, from an alternate point of view, is fundamentally an advance. The bank purchases and claims the apparatus and after that "rents" it to a business at a level month to month rate for a set length. With a lease, you pay only to use the unit. However, toward the end of the lease time frame, you may complete up owning nothing. www.usacheckcashingstore.com/san-diego

marko said...

Starting now, to connect with wide range of understudies, banks now are putting forth online undergrad credits. Online accessibility of the credits makes it simple for the understudies to benefit the funds without confronting excessively numerous bothers. usacheckcashingstore.com/san-diego

Paulo said...

As the expense of school goes up and the opposition to get subsidized gets stiffer, planned understudies must get imaginative regarding financing. There are, however, some approaches to get to school that you may not know of. usacheckcashingstore.com/san-diego

dada24 Xu said...

cheap oakleys
ralph lauren pas cher
canada goose
michael kors outlet clearance
true religion outlet
armani exchange
gucci outlet
ed hardy
ugg boots
ray ban sunglasses
2016921caiyan

mai ali said...

الحشرات من اكثر الاشياء التى تسبب الازعاج والالم لعملاءنا الكرام فتاتى الحشرات فى فصل الربيع وتبدا فى الانتشار فى فصل الصيف
رش حشرات بالمدينة المنورة
فاذا كنت تعانى من اعمال مكافحة حشرات بالمدينة المنورة التقليديه وتعانى من وجود الكثير من الحشرات والقيام باعمال مكافحة حشرات تؤدى الى ارجاع الحشرات مره اخرى فلا داعى للقلق من شان كل ذلك واتصل على افضل شركة مكافحة حشرات بالمدينة المنورة على الفور فالشركة
مكافحة النمل الابيض بالمدينة المنورة
تعتمد على افضل المبيدات المخصصه والمتعارف عليها ذات جوده عاليه وقدرتها على القضاء على الحشرات على الفور والاعتماد على افضل الاجهزه والالات التى تساعد فى القيام باعمال الرش والوصول الى اصعب الاماكن التى تختبىء فيه الحشرات واماكن اليرقات والبيض المخصص للحشرات
شركة رش مبيدات بالمدينة المنورة
كل ذلك بعد ان يتم تحديد عدد الجلسات التى تساعد فى القضاء على الحشره المتواجده فى المكان كل ذلك يتم باعتماد شركة مكافحة حشرات بالمدينه المنوره على الايدى العامله المدربه على اعلى مستوى حول كيفيه القيام بكل ذلك فى مقابل اقل الاسعار المتواجده فى الاسواق فلا تتردد فى الاستعانه بنا
رش مبيدات بالمدينة المنورة
اذا كنت فى اى مكان.


Paulo said...

A renegotiated advance is just another credit that pays off the current home loan. What's more, you can renegotiate for all intents and purposes any advance. All it takes is a little work on your part. usapaydayloanstore.com/chicago

Paulo said...

There is one specific colossal inconvenience that typically happens when an individual disregards the qualification among government and individual loaning items. Government college understudy money related credits are ensured by the elected govt. aaa1autotitleloans.com/chicago

شركة المثالية لتنظيف said...

شركة تسليك مجارى بالجبيل

raybanoutlet001 said...

ugg outlet
pandora jewelry
ugg outlet
polo ralph lauren
green bay packers jerseys
ugg boots
coach outlet store online
mbt shoes outlet
hermes belts
nike shoes

Best Construction Company City of Yonkers said...

The team at Ajrin Construction Inc. is the best of the best general contractors and support staffs, as selected from within the industries over City of Yonkers NY based upon their achievements and their understanding of complex projects.
Best Construction Company City of Yonkers

قمة الخليج said...

شركة تسليك وكشف تسربات المياة
تعد عملية كشف تسربات المياه بالرياض
من الامور الضرورية التى لابد من القيام بحلها فور حدوثها ، لان الاهمال وترك المجارى يظهر عدد كبير من المشكلات فاذا كنت تعانى من اعمال التسليك المتكررة او تعانى من تكرار اعمال التسليك للمجارى ولا تجد النتيجة على ما يرام فتعاون وتواصل مع شركة ركن المثالية التى تساعد فى القيام باعمال الاصلاح والتعرف على الاسباب التى ادتت الى ظهور المجارى والتعرف على مشاكل البنية التحتية التى تظهر المجارى مرارا وتكرارا ، فاذا كنت فى حيرة من امر التسليك وتلجا الى الكثير من الشركات التى تعمل فى اعمال التسليك والتخلص من المجارى فعليك ان تتعاون وتتواصل مع شركة تسليك مجارى بالرياض . شركة عزل خزانات بالرياض
انتشار المجارى من المشكلات التى تؤدى الى التعرض الى عدد من المشكلات الاخرى التى نحن فى غنى عنها فيؤدى الامر الى التعرض الى الحشرات والافات الضارة – التعرض الى الانراض الفيروسية – التعرض الى شلل فى حركة الافراد – انتشار الروائح الكريهة فى المكان – التعرض الى ارتفاع فى فاتورة المياه ... وغيرها من عدد من المشكلات الاخرى . شركة عزل اسطح بالرياض
يلجا الكثير من الافراد الى القيام باعمال الاصلاح للمجارى من خلال الشركات العادية او الافراد الذين يقومون بالاصلاح بشكل عادى الا ان هذا الامر يؤدى الى التعرض الى مشكلة الانسداد المتكررة لانة لا يساعد فى البحث عن سبب التعرض الى مشكلة المجارى .
شركة تسليك مجاري بالرياض
شركة تنظيف بيارات بالرياض
شركة ركن المثالية متخصصة فى اعمال الاصلاح والتسليك من خلال الاتى :-
1- الاعتماد على اجهزة النيتروجين التى تساعد فى شفط محتويات المجارى والتخلص منه.
2- الاعتماد على اسبرين التسليك فى الاحواض والبانيو .
3- الاهتمام باعمال التسليك بالاسيد والتسليك من خلال براميل التسليك .
شركة تنظيف مسابح بالرياض
هذا ما يتم تقديمة من خلل شركة تسليك مجارى بالرياض والتى تحقق اعلى مستوى من التميز فى خدمة التسليك حتى تضمن عدم عودة المجارى مرة اخرى الى المكان
تسربات المياه بالرياض
شركة جلي بلاط بالرياض
ظهور الرطوبة والتغيرات فى الدهانات والمشكلات العامة التى تظهر فى المكان من اكثر ما يتم التعرض الية نتيجة لتسربات المياة بالاضافة الى ان هناك عدد من المشكلات الاخرى التى نحن فى غنى عنة على الاطلاق فاذا ظهر لديك او تشك ان فى المكان تسربات المياه فعليك ان تتعاون وتتصل بشركة ركن المثالية التى تساعد فى القيام بحل هذه المشكلة الان فمن اهم ما تعتمد علية
وتسعى الى تحقيقة
تقدم الشركة مجموعة من المتخصصين والاستشارين فى اعمال الاصلاح لتفادى حدوث اى مشاكل بسبب تسرب المياه والبحث عن اصل واسباب المشكلة من اهم الخدمات التى تقدمة شركة نقل اثاث بالرياض
، فتواصل على الارقام الخاصة بالشركة على مدار 24 ساعة خلال ايام الاسبوع اذا كنت فى اى مكان بالرياض .