Well, I’m not sure how many times I can mention KnowledgeLake and “Transactional Content Management” without getting flogged by the blog hosts for peddling our wares again… but here I go again.
So once again, I’ll set the stage with the world I work in every day. KL is all about facilitating document processing all the way from paper to grave. By grave I mean the end of a document lifecycle. So after KL Capture Server blasts a batch of documents into SharePoint we often take advantage of some form of workflow to kick off additional document/account processing.
For example, imagine a lending branch scanning in and releasing a series of documents related to a loan application. Upon receipt of the actual application document a workflow might be initiated. Here’s where it gets interesting. During loan application processing there might be several approval steps that are based on peripheral documents such as income statements and/or loan collateral documentation. If the institution is processing many loans per day, they don’t have time to wait around for an incremental crawl to take an hour or sometimes even 15 minutes.
So what can we do to really tighten down search result availability? Well in this type of environment I would architect the farm a certain way and setup the incremental crawl for the content source to fire literally every minute. So the information below outlines how I would configure the farm to squeeze the absolute most performance out of crawl processing.
Implementation:
- The farm should include a separate (and beefy) machine for Index Server. I recommend a box with at MINIMUM of 4 (64bit) CPU cores 16GB RAM running. The Query role should not be enabled on this server. Note that you can’t mix 32bit and 64bit WFEs in the farm so if you’re running 32bit front ends, stick with 32bit Index Server.
- In order to get that hefty Index Server to take advantage of available resources we need to force it to use more threads while crawling content. We can do that using 1 of 2 possible techniques
- OPTION 1: When configuring the “Office SharePoint Server Search” role on the Index Server, set the Indexer Performance to “Maximum”:
- OPTION 2: We can create a crawler impact rule in Application Management => Manage search service => Crawler Impact Rules => Add Rule
NOTE:Crawler Impact Rules take precedence over Indexer Performance Settings and since the default simultaneous requests is based on the number of processors on the index server, it’s possible that the “Maximum” indexer performance setting could be overridden by the default crawler impact setting (even if no crawler impact rules exist).
- Then, regardless of which option is chosen, we need to set the “Target” Web Front End to be the actual Index Server itself (WFE role must be enabled) or possibly a specific “target” WFE machine would not be used for serving content to end users.

- Finally, we set the incremental crawl schedule to fire in 1 minute increments. Navigate to the Shared Services Administration page for your SSP. Then click Search Settings => Content sources and crawl schedules => [Content Source Name]. Then click “Create[/Edit] schedule” under the Incremental Crawl field. Set the values as identified below and click OK => OK.

That should do it. You’ve just configured the search service to kick off incremental crawls in 1 minute intervals! Shortly after an incremental crawl completes, if any changes were made to any of the index files, those changes will be propagated out to the Query (Search) servers. Once that propagation has been processed, the content will be available for searching!
Monitoring Performance:
- Keep an eye on the “Manage Content Sources” page in the SSP administration site. It will tell you the indexing status.
- You want to watch the Indexing Status field. It will cay “Crawling Incremental” when it’s crawling. It should say “Idle” when it is finished crawling. Refresh often to ensure that at some point during the 1 minute interval it is able to finish the incremental crawl.
- If Index Status never changes to Idle then unfortunately you don’t have the horsepower to maintain a 1 minute incremental crawl interval. You should increase the interval by 1 minute until you verify that your crawl can complete in the allotted amount of time.
- Keep an eye on the performance of your Index Server, Target Server (if applicable), and your SQL Server. If ramping up crawl performance has created an uncomfortable increase on system resource utilization on ANY of these servers, you can either back down the crawl threads (Crawler Impact Rules/Indexer Performance) or you can increase the incremental crawl duration or both.
Additional Points of Interest:
- There are many factors related to crawl performance. Everything from how powerful your Index, Target, and SQL Servers are to the I/O performance of the SQL Server databases. The SSP Search database is particularly vulnerable as it can become very large quickly.
- Not all environments are the same. Your mileage may vary. For example, KnowledgeLake solutions often revolve in high volumes of TIFF files. There is no TIFF iFilter available for MOSS out of the box so the “NULL” iFilter is used. This means that the document metadata is gathered and inserted into the property store in the SSP Search database but the actual binary file doesn’t have to be parsed. So our indexing speed is often much faster.
- With such a high load created on the Index Server and SQL Server during crawl processing, it’s recommended that any Full Crawls be scheduled during off peak times (evenings and weekends, etc). This is because the Full Crawl will obey the same threading rules used by the incremental crawl. This could yield a very high level of stress on the SQL Server over an extended period of time.
OK. That’s about all I have to say about that. Once again, the cool thing about SharePoint is that it is so configurable! If the changes I specified here don’t work for you, please don’t flame me
! Just back off of the threading or put the settings back where they started and you’ll be just fine.


1 comment
Steve Garner says:
June 1, 2009 at 11:27 pm (UTC -6)
In the "Add Crawler Impact Rule" dialog, the instructions say not to include the protocol, however the example does, using "http://*". Assuming this works, then it’s too bad this wasn’t documented correctly. We have a rule for each of 50+ sites and according to the article this would mean that if we set each at 2 simultaneous requests, the cumulative effect is 100+ simultaneous requests, since rules take precedence.