Stefan Stanev's SharePoint blog: SharePoint Search

Showing posts with label SharePoint Search. Show all posts

Saturday, July 30, 2011

SharePoint Search 2007 – hacking the SiteData web service – Part II

In the first part of this posting I outlined the physical architecture of the SharePoint search engine mostly concerning the part of the Sts3 protocol handler and the role that the “SiteData” web service plays in the crawling process. The very fact that the search engine uses a web service to access and retrieve SharePoint list and site data gave me the first clue as to how I can “hack” the search crawling process. My idea was simple – since the web service is hosted in IIS, of course, I can use some basic URL rewriting techniques so that the call to the real web service is “covertly” redirected to a custom web service which will either implement the original logic of the standard service adding some additional functionality or will simply serve as a proxy to the real web service but will do some modifications to either the input or the output of the latter. Out of these two options the first one seemed more than complex and the second one actually was pretty sufficient as to the goals that I had with the implementation of the “hack”. The thing is that the output XML of the SiteData.GetContent method contains all relevant SharePoint list item and schema data – the “List” option for the ObjectType parameter returns the schema of the SharePoint list and the “Folder” option – the list item data (see the sample XML outputs of the web service in the first part of the posting). The problem is that the Sts3 protocol handler “interprets” the data from the output XML in its own specific way which results in the well-known limitations of the crawled properties and the retrieved for them data in the search index that we have for SharePoint content. So what I decided to do was to create a small custom web service which implements the SiteData.GetContent and SiteData.GetChanges methods (with the exact same parameters and notation). Since I wanted to use it as a proxy to the real SiteData web service I needed somehow to pass the call to it. The simplest option here would have been to simply issue a second web service call from my web service, but the better solution was to just instantiate an instance of the SiteData web service class (Microsoft.SharePoint.SoapServer.SiteData from the STSSOAP assembly which is in the _app_bin subfolder of the SharePoint web application) and call its corresponding method. The last trick of the “hack” was to get the XML output from the SiteData GetContent and SiteData.GetChanges methods and modify it (actually add some additional stuff to it) so that I can get the extra crawled properties in the search index that I needed.

So, before going into details about the implementation of the “hack”, I want to point out several arguments as to why you should consider twice before starting using it (it’s become a habit of mine trying to dissuade people from using my own solutions) and I would rather not recommend using it in bigger production environments:

It tampers with the XML output of the standard SiteData web service – this may lead to unpredictable behavior of the index engine and result in it being not able to crawl your site(s). The standard XML output of the SiteData service is itself not quite well-formed XML so before getting the right way to modify it without losing its original formatting I kept receiving crawler errors which I could find in the crawl log of my SSP admin site.
There will be a serious performance penalty compared to using just the standard SiteData service. The increased processing time comes from the added parsing of the output XML and the extra modifications and additions added to it.
The general argument that this is indeed a hack which gets inside the standard implementation of the SharePoint search indexing which won’t sound good to both managers and Microsoft guys alike.

Having said that (and if you are still reading) let me give you the details of the implementation itself. The solution of the “hack” can be downloaded from here (check the installation notes below).

The first thing that I will start with is the URL rewriting logic that allows the custom web service to be invoked instead of the standard SiteData web service. In IIS 7 there is a built-in support for URL rewriting, but because I was testing on a Windows 2003 server with IIS 6 and because I was a bit lazy to implement a full proxy for the SiteData web service I went to the other approach … Which is to use a custom .NET HTTP module (the better solution) or to simply modify the global.asax of the target SharePoint web application (the worse but easier to implement solution) – which is the one that I actually used. The advantage of using a custom URL rewriting logic as opposed to using the built in URL rewriting functionality in IIS 7 is that in the former you can additionally inspect the HTTP request data and apply the URL rewriting only for certain methods of the web service. So in the modified version of the global.asax I do an explicit check for the web service method being called and redirect to the custom web service only if I detect the GetContent or GetChanges methods (all other methods will hit directly the standard SiteData service and no URL rewriting will take place). You can see the source code of the global.asax file that I used below:

<%@ Application Language="C#" Inherits="Microsoft.SharePoint.ApplicationRuntime.SPHttpApplication" %>
<script language="C#" runat="server">
protected void Application_BeginRequest(Object sender, EventArgs e) 
{
    CheckRewriteSiteData();
}
protected void CheckRewriteSiteData()
{
    if (IsGetListItemsCall())
    {
        string newUrl = this.Request.Url.AbsolutePath.ToLower().Replace("/sitedata.asmx", "/stefansitedata.asmx");
        HttpContext.Current.RewritePath(newUrl);
    }
}
protected bool IsGetListItemsCall()
{
    if (string.Compare(this.Request.ServerVariables["REQUEST_METHOD"], "post", true) != 0) return false;
    if (!this.Request.Url.AbsolutePath.EndsWith("/_vti_bin/SiteData.asmx", StringComparison.InvariantCultureIgnoreCase)) return false;
 
    if (string.IsNullOrEmpty(this.Request.Headers["SOAPAction"])) return false;
 
    string soapAction = this.Request.Headers["SOAPAction"].Trim('"').ToLower();
    if (!soapAction.EndsWith("getcontent") && !soapAction.EndsWith("getchanges")) return false;
    if (string.Compare(ConfigurationManager.AppSettings["UseSiteDataRewrite"], "true", true) != 0) return false;
 
    return true;
}
</script>

Note also that in the code I check a custom “AppSettings” key in the web.config file whether to use or not URL rewriting logic. This way you can easily turn on or off the “hack” with a simple tweak in the configuration file of the SharePoint web application.

And this is the code of the custom “SiteData” web service:

[WebServiceBinding(ConformsTo = WsiProfiles.BasicProfile1_1), WebService(Namespace = "http://schemas.microsoft.com/sharepoint/soap/")]
public class SiteData
{
    [WebMethod]
    public string GetContent(ObjectType objectType, string objectId, string folderUrl, string itemId, bool retrieveChildItems, bool securityOnly, ref string lastItemIdOnPage)
    {
        try
        {
            SiteDataHelper siteDataHelper = new SiteDataHelper();
            return siteDataHelper.GetContent(objectType, objectId, folderUrl, itemId, retrieveChildItems, securityOnly, ref lastItemIdOnPage);
        }
        catch (ThreadAbortException) { throw; }
        catch (Exception exception) { throw SoapServerException.HandleException(exception); }
    }
    [WebMethod]
    public string GetChanges(ObjectType objectType, string contentDatabaseId, ref string LastChangeId, ref string CurrentChangeId, int Timeout, out bool moreChanges)
    {
        try
        {
            SiteDataHelper siteDataHelper = new SiteDataHelper();
            return siteDataHelper.GetChanges(objectType, contentDatabaseId, ref LastChangeId, ref CurrentChangeId, Timeout, out moreChanges);
        }
        catch (ThreadAbortException) { throw; }
        catch (Exception exception) { throw SoapServerException.HandleException(exception); }
    }
}

As you see, the custom “SiteData” web service implements only the GetContent and GetChanges methods. We don’t need to implement the other methods of the standard SiteData web service because the URL rewriting will redirect to the custom web service only in case these two methods are being invoked. The two methods in the custom service have the exact same notation as the ones in the standard SiteData web service. The implementation of the methods is a simple delegation to the corresponding methods with the same names of a helper class: SiteDataHelper. Here is the source code of the SiteDataHelper class:

using SP = Microsoft.SharePoint.SoapServer;
namespace Stefan.SharePoint.SiteData
{
    public class SiteDataHelper
    {
        public string GetChanges(ObjectType objectType, string contentDatabaseId, ref string startChangeId, ref string endChangeId, int Timeout, out bool moreChanges)
        {
            SP.SiteData siteData = new SP.SiteData();
            string res = siteData.GetChanges(objectType, contentDatabaseId, ref startChangeId, ref endChangeId, Timeout, out moreChanges);
            try
            {
                ListItemXmlModifier modifier = new ListItemXmlModifier(new EnvData(), res);
                res = modifier.ModifyChangesXml();
            }
            catch (Exception ex) { Logging.LogError(ex); }
            return res;
        }
 
        public string GetContent(ObjectType objectType, string objectId, string folderUrl, string itemId, bool retrieveChildItems, bool securityOnly, ref string lastItemIdOnPage)
        {
            SPWeb web = SPContext.Current.Web;
            SP.SiteData siteData = new SP.SiteData();
            string res = siteData.GetContent(objectType, objectId, folderUrl, itemId, retrieveChildItems, securityOnly, ref lastItemIdOnPage);
            try
            {
                EnvData envData = new EnvData() { SiteId = web.Site.ID, WebId = web.ID, ListId = objectId.TrimStart('{').TrimEnd('}') };
                if ((objectType == ObjectType.ListItem || objectType == ObjectType.Folder) && !securityOnly)
                {
                    ListItemXmlModifier modifier = new ListItemXmlModifier(envData, res);
                    res = modifier.ModifyListItemXml();
                }
                else if (objectType == ObjectType.List)
                {
                    ListItemXmlModifier modifier = new ListItemXmlModifier(envData, res);
                    res = modifier.ModifyListXml();
                }
            }
            catch (Exception ex) { Logging.LogError(ex); }
            return res;
        }
    }
}

The thing to note here is that the two methods in the SiteDataHelper helper class create an instance of the SiteData web service class directly (note that this is not a generated proxy class, but the actual web service class implemented in the standard STSSOAP.DLL). The GetContent and GetChanges methods are called on this instance respectively and the string result of the calls is stored in a local variable. The string value that these methods return actually contains the XML with the list schema or list item data (depending on the “ObjectType” parameter being “List” or “Folder”). This XML data is then provided to an instance of the custom ListItemXmlModifier class which handles all XML modifications for both the GetContent and GetChanges methods. Note that for the GetContent method, the XML results are passed for modification only if the “ObjectType” parameter has the “ListItem”, “Folder” or “List” values. I am not going to show the source code of the ListItemXmlModifier class directly in the posting (it is over 700 lines of code) but instead I will briefly explain to you what are the changes in the XML from the GetContent and GetChanges methods that this class implements. The modifications to the XML are actually pretty simple and semantically there are only two types of changes – these correspond to the result XML-s of the GetContent (ObjectType=List) and GetContent (ObjectType=Folder) methods (the result XML of the GetChanges method has a more complex structure but contains the above two fragments (in one or more occurrences) where list and list item changes are available).

Let’s start with a sample XML from the standard SiteData.GetContent(ObjectType=List) method (I’ve trimmed some of the elements for brevity):

<List>
  <Metadata ID="{1d53a556-ae9d-4fbf-8917-46c7d97ebfa5}" LastModified="2011-01-17 13:24:18Z" Title="Pages" DefaultTitle="False" Description="This system library was created by the Publishing feature to store pages that are created in this site." BaseType="DocumentLibrary" BaseTemplate="850" DefaultViewUrl="/Pages/Forms/AllItems.aspx" DefaultViewItemUrl="/Pages/Forms/DispForm.aspx" RootFolder="Pages" Author="System Account" ItemCount="4" ReadSecurity="1" AllowAnonymousAccess="False" AnonymousViewListItems="False" AnonymousPermMask="0" CRC="699748088" NoIndex="False" ScopeID="{a1372e10-8ffb-4e21-b627-bed44a5130cd}" />
  <ACL>
    <permissions>
      <permission memberid='3' mask='9223372036854775807' />
      ....
    </permissions>
  </ACL>
  <Views>
    <View URL="Pages/Forms/AllItems.aspx" ID="{771a1809-e7f3-4c52-b346-971d77ff215a}" Title="All Documents" />
    ....
  </Views>
  <Schema>
    <Field Name="FileLeafRef" Title="Name" Type="File" />
    <Field Name="Title" Title="Title" Type="Text" />
    <Field Name="Comments" Title="Description" Type="Note" />
    <Field Name="PublishingContact" Title="Contact" Type="User" />
    <Field Name="PublishingContactEmail" Title="Contact E-Mail Address" Type="Text" />
    <Field Name="PublishingContactName" Title="Contact Name" Type="Text" />
    <Field Name="PublishingContactPicture" Title="Contact Picture" Type="URL" />
    <Field Name="PublishingPageLayout" Title="Page Layout" Type="URL" />
    <Field Name="PublishingRollupImage" Title="Rollup Image" Type="Note" TypeAsString="Image" />
    <Field Name="Audience" Title="Target Audiences" Type="Note" TypeAsString="TargetTo" />
    <Field Name="ContentType" Title="Content Type" Type="Choice" />
    <Field Name="MyLookup" Title="MyLookup" Type="Lookup" />
    ....
  </Schema>
</List>

The XML contains the metadata properties of the queried SharePoint list, the most important part of which is contained in the Schema/Field elements – the simple definitions of the fields in this list. It is easy to deduce that the fields that the index engine encounters in this part of the XML will be recognized and appear as crawled properties in the search index. So what if we start adding fields of our own – this won’t solve the thing by itself because we will further need list items with values for these “added” fields (we’ll see that in the second XML sample) but it is the first required bit of the “hack”. The custom service implementation will actually add several extra “Field” elements like these:

    <Field Name='ContentTypeId.Text' Title='ContentTypeId' Type='Note' />
    <Field Name='Author.Text' Title='Created By' Type='Note' />
    <Field Name='Author.ID' Title='Created By' Type='Integer' />
    <Field Name='MyLookup.Text' Title='MyLookup' Type='Note' />
    <Field Name='MyLookup.ID' Title='MyLookup' Type='Integer' />
    <Field Name='PublishingRollupImage.Html' Title='Rollup Image' Type='Note' />
    <Field Name='PublishingPageImage.Html' Title='Page Image' Type='Note' />
    <Field Name='PublishingPageContent.Html' Title='Page Content' Type='Note' />
    <Field Name='Env.SiteId' Title='Env.SiteId' Type='Text' />
    <Field Name='Env.WebId' Title='Env.WebId' Type='Text' />
    <Field Name='Env.ListId' Title='Env.ListId' Type='Text' />
    <Field Name='Env.IsListItem' Title='Env.IsListItem' Type='Integer' />

You can immediately notice that these “new” fields are actually related to already existing fields in the SharePoint list in the schema XML that’s being modified. You can see that I used a specific naming convention for the “Name” attribute – with a dot and a short suffix. Actually the crawled properties that the index engine will generate will also contain the dot and the suffix, so it will be easy for you to locate them in the “crawled properties” page in the SSP admin site. From the “Name” attribute you can immediately see which the related original fields for the new fields are. In short the rules for creating these new fields are:

For every original lookup field (both single and multiple lookup columns and all derived lookup columns, e.g. user fields) two additional fields are added – with the suffixes “.ID” and “.Text” and field “Type” attribute “Integer” and “Note” respectively.
For every original publishing “HTML” and “Image” field one extra field with the “.Html” suffix is added.
For all lists the “ContentTypeId.Text” extra field is added with “Type” attribute set to “Note”
For all lists the additional fields “Env.SiteId”, “Env.WebId”, “Env.ListId”, “Env.IsListItem” are added.

So, we have already extra fields in the list schema, the next step is to have them in the list item data populated with the relevant values. Let me first show you a sample of the standard unmodified XML output of the GetContent(ObjectType=Folder) method (I trimmed some of the elements and reduced the values of some of the attributes for brevity):

<Folder>
  <Metadata>
    <scope id='{5dd2834e-902d-4db0-8db2-4a1da762a620}'>
      <permissions>
        <permission memberid='1' mask='206292717568' />
        ....
      </permissions>
    </scope>
  </Metadata>
  <xml xmlns:s='uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882' xmlns:dt='uuid:C2F41010-65B3-11d1-A29F-00AA00C14882' xmlns:rs='urn:schemas-microsoft-com:rowset' xmlns:z='#RowsetSchema'>
    <s:Schema id='RowsetSchema'>
      <s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
        <s:AttributeType name='ows_ContentTypeId' rs:name='Content Type ID' rs:number='1'>
          <s:datatype dt:type='int' dt:maxLength='512' />
        </s:AttributeType>
        <s:AttributeType name='ows__ModerationComments' rs:name='Approver Comments' rs:number='2'>
          <s:datatype dt:type='string' dt:maxLength='1073741823' />
        </s:AttributeType>
        <s:AttributeType name='ows_FileLeafRef' rs:name='Name' rs:number='3'>
          <s:datatype dt:type='string' dt:lookup='true' dt:maxLength='512' />
        </s:AttributeType>
        ....
      </s:ElementType>
    </s:Schema>
    <scopes>
    </scopes>
    <rs:data ItemCount='2'>
      <z:row ows_ContentTypeId='0x010100C568DB52D9D0A14D9B2FDCC96666E9F2007948130EC3DB064584E219954237AF390064DEA0F50FC8C147B0B6EA0636C4A7D400E595F4AC9968CC4FAD1928288BC9885A' ows_FileLeafRef='1;#Default.aspx' ows_Modified_x0020_By='myserver\sstanev' ows_File_x0020_Type='aspx' ows_Title='Home' ows_PublishingPageLayout='http://searchtest/_catalogs/masterpage/defaultlayout.aspx, Welcome page with Web Part zones' ows_ContentType='Welcome Page' ows_PublishingPageImage='' ows_PublishingPageContent='some content' ows_ID='1' ows_Created='2010-12-20T18:53:18Z' ows_Author='1;#Stefan Stanev' ows_Modified='2010-12-26T12:45:31Z' ows_Editor='1;#Stefan Stanev' ows__ModerationStatus='0' ows_FileRef='1;#Pages/Default.aspx' ows_FileDirRef='1;#Pages' ows_Last_x0020_Modified='1;#2010-12-26T12:45:32Z' ows_Created_x0020_Date='1;#2010-12-20T18:53:19Z' ows_File_x0020_Size='1;#6000' ows_FSObjType='1;#0' ows_PermMask='0x7fffffffffffffff' ows_CheckedOutUserId='1;#' ows_IsCheckedoutToLocal='1;#0' ows_UniqueId='1;#{923EEE29-44AB-4D1B-B65B-E3ECEAE1353E}' ows_ProgId='1;#' ows_ScopeId='1;#{5DD2834E-902D-4DB0-8DB2-4A1DA762A620}' ows_VirusStatus='1;#6000' ows_CheckedOutTitle='1;#' ows__CheckinComment='1;#' ows__EditMenuTableStart='Default.aspx' ows__EditMenuTableEnd='1' ows_LinkFilenameNoMenu='Default.aspx' ows_LinkFilename='Default.aspx' ows_DocIcon='aspx' ows_ServerUrl='/Pages/Default.aspx' ows_EncodedAbsUrl='http://searchtest/Pages/Default.aspx' ows_BaseName='Default' ows_FileSizeDisplay='6000' ows_MetaInfo='...' ows__Level='1' ows__IsCurrentVersion='1' ows_SelectTitle='1' ows_SelectFilename='1' ows_Edit='0' ows_owshiddenversion='26' ows__UIVersion='6656' ows__UIVersionString='13.0' ows_Order='100.000000000000' ows_GUID='{2C80A53D-4F38-4494-855D-5B52ED1D095B}' ows_WorkflowVersion='1' ows_ParentVersionString='1;#' ows_ParentLeafName='1;#' ows_Combine='0' ows_RepairDocument='0' ows_ServerRedirected='0' />
      ....
    </rs:data>
  </xml>
</Folder>

The list item data is contained below the “rs:data” element – there is one “z:row” element for every list item. The attributes of the “z:row” element contain the field values of the corresponding list item. You can see here that the attributes already have the “ows_” prefix as all crawl properties in the “SharePoint” category. You can notice that the attributes for lookup fields contain the unmodified item field data but the publishing “HTML” and “Image” columns are already modified – all HTML markup has been removed from them (for the “Image” type column this means that they become empty, since all the data they contain is in markup).

And let’s see the additional attributes that the custom web service adds to the “z:row” elements of the list item data XML:

ows_ContentTypeId.Text='0x010100C568DB52D9D0A14D9B2FDCC96666E9F2007948130EC3DB064584E219954237AF3900242457EFB8B24247815D688C526CD44D0005E464F1BD83D14983E49C578030FBF6' 
ows_PublishingPageImage.Html='&lt;img border="0" src="/PublishingImages/newsarticleimage.jpg" vspace="0" style="margin-top:8px" alt=""&gt;' 
ows_PublishingPageContent.Html='&lt;b&gt;some content&lt;/b&gt;' 
ows_Author.Text='1;#Stefan Stanev' 
ows_Author.ID='1' 
ows_Editor.Text='1;#Stefan Stanev' 
ows_Editor.ID='1' 
ows_Env.SiteId='ff96067a-accf-4763-8ec1-194f20fbf0f5' 
ows_Env.WebId='b2099353-41d6-43a7-9b0d-ab6ad87fb180' 
ows_Env.ListId='Pages' 
ows_Env.IsListItem='1' 

Here is how these fields are populated/formatted (as I mentioned above all the data is retrieved from the XML itself or from the context of the service request):

the lookup derived fields with the “.ID” and “.Text” suffixes – both get their values from the “parent” lookup column – the former is populated with the starting integer value of the lookup field, the latter is set with the unmodified value of the original lookup column. When the search index generates the corresponding crawled properties the “.ID” field can be used as a true integer property and the “.Text” one although containing the original lookup value will be treated as a simple text property by the index engine (remember that in the list schema XML the type of this extra field was set to “Note”). So what will be the difference between the “.Text” field and the original lookup column in the search index. The difference is that the value of the original lookup column will be trimmed in the search index and will contain only the text value without the integer part preceding it. And if you issue an SQL search query against a managed property mapped to the crawled property of a lookup field you will be able to retrieve only the textual part of the lookup value (this holds also for the filtering and sorting operation for this field type). Whereas with the “.Text” derived field you will have access to the original unmodified value of the lookup field.
the fields derived from the “HTML” and “Image” publishing field type with the “.Html” suffix – they are populated with the original values of the corresponding fields with the original markup intact. Since the values of the original fields in the list item data XML are already trimmed the original values are retrieved with a simple trick. The “z:row” element for every list item contains the “ows_MetaInfo” attribute which contains a serialized property bag with all properties of the underlying SPFile object for the current list item. This property bag happens to contain all list item field values which are non-empty. So what I do in this case is to parse the value of the ows_MetaInfo attribute and retrieve the unmodified values for all “Html” and “Image” fields that I need. An important note here – the ows_MetaInfo attribute (and its corresponding system list field – MetaInfo) is available only for document libraries and is not present in non-library lists, which means that this trick is possible only for library-type lists.
the ows_ContentTypeId.Text field gets populated from the value of the original ows_ContentTypeId field/attribute. The difference between the two is that the derived one is defined in the schema as a “Note” field so its value is treated by the search index as a text property.
the ows_Env.*, fields get populated from service contextual data (see the implementation of the SiteDataHelper class). For the implementation of the XML modifications for the SiteData.GetChanges method these values are retrieved from the result XML itself. The value of the ows_Env.IsListItem is always set to 1 (its purpose is to be used as a flag defining a superset of the standard “isdocument” managed property).

Installation steps for the “hack” solution

Download and extract the contents of the zip archive.
Build the project in Visual Studio (it is a VS 2008 project).
The assembly file (Stefan.SharePoint.SiteData.dll) should be deployed to the GAC
The StefanSiteData.asmx file should be copied to {your 12 hive root}\ISAPI folder
The global.asax file should be copied to the root folder of your SharePoint web application. Note that you will have to backup the original global.asax file before you overwrite it with this one.
Open the web.config file in the target SharePoint web application and add an “appSettings” “add” element with key “UseSiteDataRewrite” and value “true”.
Note that if you have more than one front end servers in your farm you should repeat steps 3-6 on all machines.
After the installation is ready you can start the search crawler (full crawl) from the SSP admin site. It is a good idea if you have a content source only for the web application for which the custom SiteData service is enabled, so that you can see immediately the results of the custom service.
After the crawling is complete you should check the crawl log for errors – check whether there’re unusual errors which were not occurring before the installation of the custom SiteData service.
If there’re no errors in the crawl log you can check the “crawled properties” page in the SSP admin site – the new “dotted” crawled properties should be there and you can now create new managed properties that can be mapped to them.
Note that the newly created managed properties are not ready for use before you run a second full crawl for the target content source.

Tuesday, May 24, 2011

SharePoint Search 2007 – hacking the SiteData web service – Part I

When I started preparing this posting I realized that it would be too long, so I decided to split it into two parts – the first one being more introductory and explaining some aspects of the inner workings of the SharePoint search engine, and the second one concentrating on the actual implementation of the “hack”. Then when I started the first part, which you are now reading, I felt that the posting’s title itself already raises several questions, so it would be a good idea to start with a brief Q & A which will help you get into the discussed matter. This is a short list of questions that you may have also asked yourself two sentences into the posting:

What is the relation between the SharePoint search service and the SharePoint SiteData web service in the first place?
Why would I need to change the working of the SharePoint search, what can be the reasons and motives for that?
Is it a wise idea and would you recommend using this hack?

And the answers come promptly:

To answer this one we need to have a closer look at the internal workings of the SharePoint search index engine. If you are not familiar with some core concepts and basic terminology like index engine, content sources, filter daemon, protocol handlers, IFilters, I would recommend that you first check these two MSDN articles – here (for a high level architecture overview) and here (for a high level overview of the protocol handlers). Let me start with several words about the protocol handlers – these are basically responsible for crawling the different types of content sources. They are implemented as COM components written in unmanaged code (C or C++). If you are familiar with COM Interop you will know that it is also possible to create COM components using .NET and managed code and in fact there is a sample .NET protocol handler project in CodePlex. I am not sure though how wise it is to create your own protocol handler with managed code (apart from the fact that it is quite complex to start with) knowing that all existing protocol handlers by Microsoft and third party vendors are written in unmanaged code.
You can check the available index engine protocols and their matching protocols handlers for your SharePoint installation in the Windows registry:

You can see that there are different protocol handlers for the different types of content sources – SharePoint sites, external web sites, file shares, BDC, etc. The name of the protocol handler (the “Data” column in the image above) is actually the ProgID (in COM terms) of the COM component that implements the handler.
In this posting we are only interested in just one of the protocol handlers – this is the one for the Sts3 protocol, which is responsible for crawling the content from SharePoint sites. The same handler is also used for the Sts3s protocol (see the image) which is again for SharePoint sites but which use the HTTPS (SSL) scheme. And now the interesting part – how does the Sts3 protocol handler traverse the content from SharePoint. The answer is actually also the answer of the first question in the list above – it calls the standard SharePoint SiteData web service (/_vti_bin/SiteData.asmx). If you wonder why for instance it doesn’t use the SharePoint object model directly – the main reason I think is for greater scalability (not to mention that it would be at best challenging to call managed from unmanaged code). The better scalability comes from the fact that the handler can be configured to call the SiteData web service from all available web front servers in the SharePoint farm, which can distribute better the workload and utilize better the resources of the farm. Later in the posting I will give you more details about how you can check and monitor the calls to the SiteData web service from the crawl engine and also some additional information about the exact methods of the SiteData service that are used for traversing the content of the SharePoint sites.
As I already mentioned in the answer for the first question, this posting deals specifically with the search functionality that targets SharePoint content. So, the motives to come to this hack are directly related to the using and querying of SharePoint data. The reasons and motives for these changes can be separated in two groups – the first one is more general - why use SharePoint search and not some other available alternative method. The second one is more specific – what is not available or well implemented in the SharePoint search query engine that needs to be changed or improved.
Let me start with the first group – out of the available methods to query and aggregate SharePoint content in the form of SharePoint list item and document metadata – SharePoint search doesn’t even come as the first or preferred option. Normally you would use the SharePoint object model with the SPList, SPListItem and SPQuery classes (for a single SharePoint list) or the SPSiteDataQuery class with the SPWeb.GetSiteData method (or alternatively the CrossListQueryInfo and CrossListQueryCache classes if you use the publishing infrastructure) – for querying and retrieving data from many lists in one site collection. The cross list query functionality is actually directly used in the standard SharePoint content by query web part (CQWP), so even without using custom code you may have experienced certain issues with it. Probably the biggest one is performance – maybe you’ve never seen it or you are well aware of it. This is because it becomes a real issue only if the size of your site collection in terms of the number of sub-sites becomes very big and you use queries that aggregate data from most of the available sub-sites. You can add to these two conditions the number of list items in the individual SharePoint lists which further degrades the performance. So, when does this become a visible issue – you can have various combinations of the above said conditions, but if you query more than one hundred sub-sites and/or you have more than several thousand items in every list (or many of the lists) you may see page loading times ranging from several seconds to well above a minute in certain extreme cases. And … this is an issue even with the built-in caching capabilities of the cross list query classes. As to why the caching doesn’t solve always the performance issue – there are several reasons (and cases) for that: first – there’re specific CAML queries for which the caching is not used at all (e.g. queries that contain the <UserID /> element); secondly – even if the caching works well, you have the first load that populates the cache which will still be slow, etc.
Let me now briefly explain why the cross list query has such performance issues (only in the above mentioned cases). The main reason is the fact that the content database contains all list data (all list items in the whole site collection, it may also contain more than one site collections) in a highly denormalized table called AllUserData. This design solution was totally deliberate because it allows all flexibility that we know with SharePoint lists in terms of the ability to add, modify and customize fields, which unfortunately comes with a price in some rare cases like this one. Let’s see how the cross list query works from a database perspective with a real example – let’s say that we have a site collection with one hundred sub-sites each containing an “announcements” list with two custom fields “expiration date” and “publication date”. On the home page of the root site we want to place a CQWP that displays the latest five announcements (aggregated from all sub-sites) ordered by publication date and for which the expiration date is in the future. Knowing that all list item data is contained in a single database table you may think that it may be possible to retrieve the aggregated data in a single SQL query but, alas, this is not the case. If you have a closer look at the AllUserData table you will find out that it contains columns whose names go: nvarchar1, nvarchar2, …, int1, int2, …, datetime1, datetime2, … – these are the underlying storage placeholders for the various types of SharePoint fields in your lists. Obviously the “publication date” and “expiration date” will be stored in two of the “datetimeN” SQL columns but the important thing is that for the different lists the mappings may be totally different, e.g. for list 1 “publication date” and “expiration date” may map to datetime1 and datetime2 respectively, whereas for list 2 they can map to datetime3 and datetime4 respectively. This heterogeneous storage pattern makes the retrieval much more complex and time costly – the object model first needs to extract the metadata for all target lists in these one hundred sites (which contains the mappings for the fields) and after that retrieve the items from all one hundred lists one by one making a SQL union with the correct field to SQL columns mappings and applying the filtering and sorting after that. If you are interested in checking that yourself you can use the SQL profiler tool that comes with the MS SQL management studio.
Having seen the performance issues that may arise with the usage of the cross list query built-in functionality of SharePoint, it is quite natural to check what SharePoint Search can offer as an alternative. Obviously it performs much faster in these cases and allows data retrieval and metadata filtering but the results and functionality it has are not exactly identical to the ones of the cross list query. And here we come to the second group of motives for implementing this kind of hack that I mentioned in the beginning of this paragraph. So let’s see some of the things that we’re missing in SharePoint search – from a data retrieval perspective – the text fields, especially the ones that contain HTML are returned by the search query with the mark-up stripped out (this is especially embarrassing for the publishing Image field type, whose values are stored as mark-up and get retrieved virtually empty by the search query); the “content type id” field is never crawled and cannot be used as a crawled and managed property; for the “lookup” field type (and derivative field types as the “user” type) – these are retrieved as plain text, with the lookup item ID contained in the field value stripped out; etc. From filtering and sorting perspective, you have pretty much everything needed – you can perform comparison operations on the basic value types – text, date, integer and float and perform the correct sorting based on the respective field type. What is missing is for instance the filtering on “lookup” (including “user”) fields based not on the textual value but on the integer (lookup ID) value – this is because this part of the lookup field value is simply ignored by the search crawler (we’ll come to that in the next part of the posting). For the same reason you cannot filter on the “content type id” field.
The next question is of course is it possible to achieve these things with the SharePoint search – the answer is yes, and the hack that is a subject of this posting does exactly that.
And lastly the third and most serious one – most of the time I am overly critical towards my own code and solutions, so I would normally not recommend using this hack (I will publish the source code in the second part of the posting), at least not in production environments. I would only suggest that you use it very limitedly in development/testing or small intranet environments if at all. I suppose that the material in the posting about some of the inner workings of the indexing engine and the SiteData web service would be interesting and useful by itself.

So, let’s now see how the index engine or more precisely the Sts3 protocol handler calls the SiteData web service. Basically you can track the SiteData.asmx invocations by simply checking the IIS logs of your web front server or servers (you have to have IIS logging enabled beforehand). If you first run a full crawl on one of your “SharePoint Site” content sources from the SSP admin site and after it completes open the latest IIS log file you will see that there will be many request to _vti_bin/SiteData.asmx and also to all pages and documents available in the SharePoint sites that were listed in the selected content source. It is logical to conclude that the protocol handler calls the SiteData web service to traverse the existing SharePoint hierarchy and to also fetch the available metadata for the SharePoint list items and documents and then it also opens every page and document and scans/indexes their contents so that they are later available for the full text search queries.

The checking of the IIS logs was in fact the first thing that I tried when I began investigating the SiteData-SharePoint Search relation but I was also curious to find out what method or methods exactly of the SiteData web service get called when the crawler runs. If you have a look at the documentation of the SiteData web service you will see that some of its methods like GetSite, GetWeb, GetListCollection, GetList, GetListItems, etc. look like ideal candidates for traversing the SharePoint site hierarchy starting from the site collection level down to the list item level. The IIS logs couldn’t help me here because they don’t track the POST body of the HTTP requests, which is exactly the place where the XML of the SOAP request is put. So I needed a little bit more verbose tracking here and I quickly came up with a bit ugly but working solution – I simply modified the global.asax of my test SharePoint web application like this:

<%@ Assembly Name="Microsoft.SharePoint"%><%@ Application Language="C#" Inherits="Microsoft.SharePoint.ApplicationRuntime.SPHttpApplication" %>
<%@ Import Namespace="System.IO" %>
<script RunAt="server">
 
    protected void Application_BeginRequest(object sender, EventArgs e)
    {
        TraceUri();
    }
 
    protected void TraceUri()
    {
        const string path = @"c:\temp\wssiis.log";
        try
        {
            HttpRequest request = HttpContext.Current.Request;
            DateTime date = DateTime.Now;
            string httpMethod = request.HttpMethod;
            string url = request.Url.ToString();
            string soapAction = request.Headers["SoapAction"] ?? string.Empty;
            string inputStream = string.Empty;
 
            if (string.Compare(httpMethod, "post", true) == 0)
            {
                request.InputStream.Position = 0;
                StreamReader sr = new StreamReader(request.InputStream);
                inputStream = sr.ReadToEnd();
                request.InputStream.Position = 0;
            }
 
            string msg = string.Format("{0}, {1}, {2}, {3}, {4}\r\n", date, httpMethod, url, soapAction, inputStream);
 
            File.AppendAllText(path, msg);
        }
        catch { }
    }
</script>

The code is pretty simple – it hooks onto the BeginRequest event of the HttpApplication class which effectively enables it to track several pieces of useful information for every HTTP request made against the target SharePoint web application. So, apart from the date and time of the request, the requested URL and the HTTP method (GET, POST or some other) I also track the “SoapAction” HTTP header which contains the name of the SOAP method for a web service call and also the POST content of the HTTP request which contains the XML of the SOAP request (in the case of a web service call). The SOAP request body contains all parameters that are passed to the web service method call, so by tracking this I could have everything I wanted – the exact web service method being called and the exact values of the parameters that were being passed to it. Just to quickly make an important note about this code – don’t use it for anything serious, I created it only for testing and quick tracking purposes.

With this small custom tracking of mine enabled I ran a full crawl of my test web application again and after the crawl completed I opened the log file (the tracking code writes to a plain text file in a hard-coded disc location) and to my surprise I saw that only two methods of the SiteData web service were called – GetContent and GetURLSegments. Actually the real job was obviously done by the GetContent method – there were about 30-35 calls to it, and only one call to GetURLSegments. You can see the actual trace file that I had after running the full crawl here. My test web application was very small containing only one site collection with a single site, so the trace file is very small and easy to follow. The fourth column contains something that looks like an URL address but this is in fact the value of the “SoapAction” HTTP header – the last part of this “URL” is in fact the actual method that was called in the SiteData web service. The fifth column contains the XML of the SOAP request that was used for the web service calls – you can see the parameters that were passed to the SiteData.GetContent method inside. If you check the MSDN documentation about the SiteData.GetContent method you will see that its first parameter is of type “ObjectType” which is an enumeration. The possible values of this enumeration are: VirtualServer, ContentDatabase, SiteCollection, Site, Folder, List, ListItem, ListItemAttachments. As one can deduce from this enumeration, the GetContent method is designed and obviously used for hierarchy traversing and metadata retrieval (the MSDN article explicitly mentions that in the yellow note box at the bottom). If you check the sample trace file from my test site again you will see that the calls made by the crawler indeed start with a call using ObjectData.VirtualServer and continue down the hierarchy with ObjectData.ContentDatabase, ObjectData.SiteCollection, etc. You may notice something interesting – after the calls with ObjectData.List there’re no calls with ObjectData.ListItem. Actually in the trace file there is only one call to GetContent using ObjectData.ListItem and it is invoked for the corresponding list item of the home (welcome) page of the site, which in my case was a publishing page. The other method of the SiteData web service – GetURLSegments is also called for the home page only – it basically returns the containing site and list of the corresponding list item by providing the URL of the page. And if you wonder which option is used for retrieving list items – it is neither the ObjectData.List nor the ObjectData.ListItem. The former returns an XML fragment containing mostly the list metadata and the latter the metadata of a single list item. The option that actually returns the metadata of multiple list items is ObjectData.Folder. Even though the name is a bit misleading this option can be used in two cases – to retrieve the files from a folder that is not in a SharePoint list or library (e.g. the root folder of a SharePoint site) or to retrieve the list items from a SharePoint list/library. If you check the sample trace file you will see that the GetContent method is not called with ObjectData.Folder for all lists – this is because the crawler is smart enough and doesn’t call it for empty lists (and most of the lists in my site were empty). And the crawler knows that a particular list is empty by the preceding GetContent (ObjectData.List) call which returns the ItemCount property of the list. There is one other interesting thing about how the crawler uses the GetContent with ObjectData.Folder – if the list contains a big number of items, the crawler doesn’t retrieve all of them with one call to GetContent but instead reads them in chunks of two thousand items each (the logic in SharePoint 2010 is even better – it determines the number of items in a batch depending on the number of fields that the items in the particular list have). And … about the return value of the GetContent method – it is in all cases an XML document that contains the metadata for the requested object or objects. It is interesting to note that the XML also contains the permissions data associated with the object which is obviously used by the indexing engine to maintain ACL-s for the various items in its index which allows the query engine to apply appropriate security trimming based on the permissions of the user that issues the search query. For the purposes of this posting we are mostly interested in the result XML for the ObjectData.List and ObjectData.Folder GetContent invocations – here’re two sample XML fragments from GetContent (List) and GetContent (Folder) calls. Well, indeed they seem quite … SharePoint-ish. Except for the permissions parts, the GetContent (Folder) yields pretty much the same XML as the standard Lists.GetListItems web service method. Have a look at the attributes containing the field values in the list items – these start with the well-known “ows_” prefix, which is the very same prefix that we see in the crawled properties associated with SharePoint content. Another small detail to note is that the GetContent (Folder)’s XML is not exactly well formed – for example it contains not properly escaped new line characters inside attribute values (not that this prevents it from rendering normally in IE) – I will come again to this point in the second part of this posting.

So far so good, but the results above are from a full crawl. And what happens when we run an incremental crawl? Have a look at the sample trace file that I got when I ran an incremental crawl on my test web application after i had changed several list items and had created a new sub-site and several lists in it. You can see that it contains again several calls to SiteData.GetContent, one call to SiteData.GetURLSegments and this time one call to SiteData.GetChanges. If you wonder why there is only one call to SiteData.GetChanges – a quick look at the result XML of this method will explain most of it. If you open the sample XML file you will see that the XML is something like a merged document from the results of the GetContent method for all levels from “ContentDatabase” down to “ListItem” … but containing only the parts of the SharePoint hierarchy whose leaf descendants (that is list items) got changed since the time of the last crawl. So basically, with one call the crawler can get all the changes in the entire content database … well, almost. Unless there are too many changes – in this cases the method is called several times each time retrieving a certain number of changes and then continuing after the reached book-mark. If you check the documentation of the GetChanges method in MSDN you will see that its first parameter is again of type ObjectData. Unlike the GetContent method however, you can use it here only with the “ContentDatabase” and “SiteCollection” values (the rest of the possible values of the enumeration are ignored and the returned XML if you use them is the same as with the “ContentDatabase” option). And one last thing in the case of the incremental crawl – the calls to the GetContent method are only for new site collections, sites and lists (which is normally to expect). The metadata for new, updated and deleted list items in existing lists is retrieved with the call to the GetChanges method.

So, this was in short the mechanism of the interaction between the SharePoint Search 2007 indexing engine (the Sts3 protocol handler) and the SharePoint SiteData web service. In the second part of this posting I will continue with explaining how I got to hack the SiteData web service and what the results of this hack were for the standard SharePoint search functionality.

Stefan Stanev's SharePoint blog

Saturday, July 30, 2011

SharePoint Search 2007 – hacking the SiteData web service – Part II

Tuesday, May 24, 2011

SharePoint Search 2007 – hacking the SiteData web service – Part I

Google Translator

My Blog List

Labels

Followers

Blog Archive

About Me

Stefan Stanev's SharePoint blog

Saturday, July 30, 2011

SharePoint Search 2007 – hacking the SiteData web service – Part II

Tuesday, May 24, 2011

SharePoint Search 2007 – hacking the SiteData web service – Part I

Google Translator

My Blog List

Labels

Subscribe To

Followers

Blog Archive

About Me