Stefan Stanev's SharePoint blog: MOSS 2007

Showing posts with label MOSS 2007. Show all posts

Friday, January 20, 2012

SharePoint WCM HTML clean-up

SharePoint is a great web content management system, it is fast, scalable, reliable, comes with lots of out-of-the-box components, web parts, etc. that often make the life of the content managers much easier. There are certain aspects of the WCM capabilities of SharePoint though that sometimes need a little more time or some hacking to get them to work properly, or the way you may want them to work. One such thing is the HTML code that appears on your SharePoint pages - there are several things in the HTML generated by the SharePoint WCM system which make it look not quite neat and tidy. The SharePoint UI is built on top of the asp.net Web Forms technology, so SharePoint actually inherits some of the HTML issues directly from its asp.net foundation. The problem here is that when you use asp.net and Web Forms you don't have full control over the HTML that is going to be generated in your page. With the advent of the asp.net MVC this was one of the arguments in favor of the latter, because with asp.net MVC the developer indeed has full control over the generated HTML code. Unfortunately SharePoint doesn't utilize the MVC framework, so many of us at one or another point have had to struggle with the extra HTML bits that get injected in the SharePoint aspx page. Some examples for such bits that come directly from asp.net are the many system hidden fields that appear in the "form" element, the infamous "ViewState" field among them which can grow very big in size, the inline JavaScript blocks with "form" submit helpers, etc. Several intrinsically SharePoint items that further inflate your HTML are for instance the inclusion of the two "core" files: "core.js" and "core.css" (quite big both of them), the many nested HTML "table" elements around your web parts which are rendered by the containing WebPartZone controls, especially in cases when you want your HTML to contain only nice looking "div" elements, etc. The dilemma here is that because SharePoint utilizes in-place page content editing and it is a single aspx file that handles both the editing process and the actual displaying of the page to the end user, the items (web controls in most cases) responsible for these extra (but necessary) HTML artifacts cannot be removed directly from the page. So, we need them for the page content editing, but on the other hand we need to somehow get rid of them, or hide them, or at least suppress the extra HTML that they generate so that the page in display mode shows only the bare minimum of HTML that needs to render the page contents. In the WCM context, I assume here that the SharePoint site is publicly accessible or at least allows anonymous access within some internal network, so the hiding of the extra artifacts is necessary only when the pages are being accessed anonymously. This is a pretty broad scenario and this particular setup is quite popular in the WCM function of SharePoint. The next question is how many of the "extra" HTML SharePoint artifacts may be unwanted in your scenario. If it is about simple content pages with SharePoint field controls only or standard content editor web parts you actually won't need any of the above mentioned bits in display mode with anonymous access. This is especially true when your HTML design is very different from the standard SharePoint page design.
So, after several years and several partial solutions I decided to wrap up the whole thing in a single solution. And it turned out that the solution was pretty easy and simple to develop, and luckily - very easy to use too. It is actually a single user control that you need to place in one place only in your master page. And that's all. The control has several public properties that can be used to configure it, so that it suppresses some of the SharePoint artifacts that it can handle but not others (I will explain these in detail shortly). I chose to create the control as a user control (and there is no code behind assembly, the code is placed inline in the ascx file directly) because this way you have the two deployment options - to either place it in the TEMPLATE/CONTROLTEMPLATES folder of the SharePoint hive, or to upload it to the Master Page Gallery of your site and reference it in your master page from there (I explained this technique in this recent posting of mine). Of course, the code can be easily transferred to a simple web control and put in an assembly of yours.
You can download the user control that I named appropriately "HideSPArtifacts.ascx" from here.
Provided that you have uploaded it to the Master Page Gallery of your site collection (/_catalogs/masterpage) you will need the following lines of code to place it in your master page file:
First you need the "Register" directive at the top of your master page

<%@ Register="" TagPrefix="MyControls" TagName="HideSPArtifacts" Src="~SiteCollection/_catalogs/masterpage/HideSPArtifacts.ascx" %>

The second bit is to place the control declaration in the page mark-up:

<MyControls:HideSPArtifacts runat="server" RemoveCoreJS="false" RemoveCoreCss="false" RemoveHeadCss="false" RemoveForm="false" AddBodyOnLoadDummy="false" EnablePageViewState="true" RemoveZoneHeaders="false"/>

Two very important notes here: 1) if you upload the user control to the Master Page Gallery of your site collection you will have to make additionally certain modifications to your web.config file (check the previous posting that I mentioned above). 2) You need to place the control's declaration (MyControls:HideSPArtifacts) immediately after the opening "html" element of your page and before the "head" HTML element.
One other thing that you should check in your master page is whether you have a "head" element and whether it has the runat="server" attribute (if you have used one of the SharePoint master pages as a base for your master page you will have these). If this condition is not satisfied the control won't be able to remove some of the SharePoint artifacts from the page.
So, after you have the user control in your master page and open a page from your site anonymously (if you view the page as authenticated user the control will do nothing) and you have the values of the control's properties as they are in the snippet above you will see ... no changes in the HTML code of the page. This is because all "Remove.." properties are set to "false". I will now give you a list with the properties of the control and will briefly explain what changes in the generated HTML you will see after setting each property:

RemoveCoreJS - as the name suggests, if this property is set, the include script declaration for the SharePoint's "core.js" file is removed from the page
RemoveCoreCss - when set, this property causes the core.css style sheet include to be removed from the page. Note that if you use alternate style sheets, these won't be rendered either. This is because the HideSPArtifacts control will block the rendering of the standard SharePoint CssLink control (if available).
RemoveHeadCss - when you use certain web controls like the TreeView control, the AspMenu control and some other controls, the asp.net page generates an inline CSS block in its "head" HTML element. If you don't want this inline style sheet to appear in the page (check if this doesn't affect any of the controls that you use on the page), set this property to true.
RemoveZoneHeaders - the WebPartZone controls that contain your web parts have the bad habit of creating several nested "table" HTML elements. The web parts' chrome (frame) which you most often set to "none", because you don't need it in WCM public sites also renders a "table" element. If you don't want any of these "table" elements and want to have only the HTML markup directly rendered by your web parts set this property to true. Note that even if you set the "ChromeType" property of the web part, no chrome will be actually rendered (in anonymous mode only).
EnablePageViewState - the default value of this property is true and in this case the control will change nothing on the page. If you set this property to false it will simply set the EnableViewState property of the containing page to false (only when the page is viewed anonymously). The net effect will be that ... you will still have the "ViewState" hidden field in your page, but it will contain only thirty or so bytes of data.
RemoveForm - when set this property removes the "form" element from your page. Actually it does something much more radical - it removes also all system hidden fields including the ViewState field and all inline JavaScript blocks that were included using the methods of Page.ClientScript - e.g. RegisterClientScriptBlock, RegisterStartupScript, etc. You will get rid of a ton of HTML and JavaScript in your page, which you wouldn't need if you don't have controls and logic that need to do POST submits of the page. If your pages (at least the pages using the master page with the HideSPArtifacts control) contain only SharePoint field controls and content editor web parts this will be a perfect choice. Note however that you will need to carefully check all your pages - some controls (like Button, LinkButton, etc) directly crash if there is no "form" rendered on the page. Other controls may stop function properly because they will miss some JavaScript code that won't get rendered. Bottom line - use cautiously.
AddBodyOnLoadDummy - the default value of this property is true. It has a visible effect only when the "RemoveForm" property is set to true. It adds a small JavaScript block with several empty JavaScript functions. One of these functions is called "_spBodyOnLoadWrapper". This function appears in the "onload" attribute of the "body" element of the default SharePoint master page. Since the "RemoveForm" property removes all inline JavaScript associated with the page's "form" element, the real definition of this JavaScript function won't be available on the page, and you will see a JavaScript error in your browser when the page loads. This is the reason why this property causes the adding of this small JavaScript block with empty definitions of this and two other system JS functions.

And now, let me briefly explain how the trick with hiding a control without hiding its contents is possible. Actually the idea is to hide the control itself (or at least parts of it) but display its child controls. This technique is used in the implementation of the "RemoveForm", "RemoveHeadCss" and "RemoveZoneHeaders" properties. The following steps are executed:

in its OnInit override the HideSPArtifacts hooks onto the parent page's InitComplete event
in the InitComplete event handler an empty control (class Control) is created in the "Controls" collection of the parent control of the control which we want to hide. The new control is inserted in the "Controls" collection of the parent control right after the target control.
The SetRenderMethodDelegate method of the new empty control is called - this method receives a single delegate parameter, which you use to provide a method to be called right after the control's "Render" method exits. The idea is to use the empty control as a place-holder to inject some HTML right after the control that we want to hide.
In the "Render" method override of the HideSPArtifacts control the "Visible" property of the control that we want to hide is set to false. Since we have placed the HideSPArtifacts right after the beginning of the master page its Render method is guaranteed to be called first in the child controls' chain. This way the control whose Visible property is set to false will not get rendered.
in the render method passed as the render delegate parameter of the SetRenderMethodDelegate method, the Controls collection of the target control is iterated and all child controls are rendered using the Control.RenderControl method. This way we have the target control itself not rendered but all its children get actually rendered within the empty control that was injected right after it. This is how the goal of hiding the control itself but not its child controls is achieved.

You can check the source code of the HideSPArtifacts control for the details of the actual implementation.

Sunday, October 30, 2011

Deploy user controls in the SharePoint content database

Normally user controls (*.ascx files) in SharePoint are deployed with farm solutions and the preferred system (hive) folder for that is the CONTROLTEMPLATES one. User controls can be used for various purposes – for form templates in SharePoint lists, with the new visual web part in SharePoint 2010 or simply to provide reusable visual bits that can be placed in more than one master page files of page layouts. User controls can be quite handy since they provide a nice separation of the presentation logic which is not directly available in the regular web controls – you can use the WebForms markup to produce your HTML output instead of having to deal with that in the code itself.

And to the subject of this posting – is it possible to have user controls (*.ascx) files directly in the content database of your SharePoint web application (site) instead of in the SharePoint hive folder (12/TEMPLATE/CONTROLTEMPLATES or 14/TEMPLATE/CONTROLTEMPLATES). The answer to this question is yes and I am going to demonstrate that shortly. But before that I would want to briefly discuss the motives and reasons that could justify the usage of user controls in the content database and also some possible advantages and disadvantages of this approach. This topic is actually a bit broader than suggested by the posting’s title and it is about the possibility of having a SharePoint custom application that doesn’t use custom assemblies and files deployed to the SharePoint hive folder. Sounds a bit like the new sandbox solutions available in SharePoint 2010 – this is to some extent so, though the code that can be executed is the normal farm version of the SharePoint object model. Unlike the sandbox solutions, there is also a big security implication that I will discuss shortly. The main motivation here is to have a way to place your code directly in your master pages or page layouts by using say the SharePoint designer. So, opposed to the usual way of deploying SharePoint farm solutions with “wsp” files, this is sorts of “SharePoint designer” development and deployment (the boundary between development and deployment with the SharePoint designer is almost non-existing). The other advantage here is that you can push all you code and code updates using the simple content deployment paths and jobs built-in functionality. The bottom line here is that if you use extensively the SharePoint designer and have a SharePoint environment with a publishing and production server using content deployment paths for content synchronization you can consider using this approach.

As for the question of how to use code directly in your master page files and page layouts (maybe you’ve done that many times already with the SharePoint designer) – the answer is simple – inline code blocks:

<script runat="server">

Two big notes here – the first one is the security issue that I mentioned above. Since it is very easy to insert a code block in a page using the SharePoint designer, there is an internal protection in SharePoint – this is the so called “safe mode” for parsing and processing of un-ghosted pages (pages that are in the content database). By default if you have a code block in a page that is un-ghosted (or was created directly in the content database) you will receive an error if you try to open the page. This can be overridden by a setting in the web.config file:

<PageParserPaths><PageParserPath VirtualPath="/_catalogs/masterpage/*" CompilationMode="Always" AllowServerSideScript="True" IncludeSubFolders="true" />
</PageParserPaths>

but keep in mind that this can be a serious security hole. Note the value of the “VirtualPath” attribute containing the location of the Master Page gallery with a wild card meaning that all your master page files and page layouts will be allowed to have code blocks. If you have a site collection under this server relative URL – “/sites/test-site” you will have to specify the full path to its Master Page gallery: “/sites/test-site/_catalogs/masterpage/*” (unless you don’t want to put something as unsecure as “/*”). For a detailed treatment of the SharePoint “safe mode” page processing, the configuration of the “PageParsePath” elements (and also of the “SafeControl” elements) check this MSDN article.

The second note is that the usage of code blocks as opposed to having the code in a code-behind assembly is not considered the very best and recommendable code practice. Apart from that the SharePoint designer is far from Visual Studio in terms of providing good IDE support for code development.

The placing of code blocks inside master pages and page layouts is maybe nothing new for most of you but the main idea of this posting is to demonstrate the using of user controls inside master pages and page layouts. And it will be the user control that will contain the inline code block in this case. Remember that the ascx file of the user control will reside in the content database (it can be uploaded to the Master Page gallery for instance), so it will be subject to the SharePoint “safe mode” processing mode too. And you will need to add some extra configuration bits to the web.config file so that your pages are allowed to use the user controls. You have two ways to enable this in the web.config – the first one is to add a “SafeControl” element like this one:

<SafeControl Src="/_catalogs/masterpage/*" IncludeSubFolders="True" Safe="True" AllowRemoteDesigner="True" SafeAgainstScript="True" />

Note that the value of the “Src” attribute should contain the full server relative URL of the target library in your site collection (e.g. for the “/sites/test-site site collection it will be /sites/test-site/_catalogs/masterpage/*”). The other way (much more unsecure) is to add an extra attribute to the “PageParsePath” element:

AllowUnsafeControls="True"

In this case you won’t need an extra “SafeControl” element.

Note also that although you can deploy your code files (pages and user controls) directly to the content database unlike the SharePoint 2010 sandbox solutions you will need to modify the web.config file of the containing web application, which has its own serious security implication, that I already pointed out.

And let me now demonstrate a sample user control that can be deployed (uploaded) to the Master Page gallery and used by page layouts (wpzone.ascx):

<%@ Assembly Name="Microsoft.SharePoint, Version=14.0.0.0,Culture=neutral,PublicKeyToken=71e9bce111e9429c" %>
<%@ Assembly Name="Microsoft.SharePoint.Publishing, Version=14.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c" %>
 
<%@ Control Language="C#"  %>
<%@ Import Namespace="System.Collections.Generic" %>
<%@ Import Namespace="Microsoft.SharePoint" %>
<%@ Import Namespace="Microsoft.SharePoint.WebPartPages" %>
 
<script runat="server">
    protected string _wpZoneID;

    public string WPZoneID { get { return _wpZoneID; } set { _wpZoneID= value; } }

    protected override void OnLoad (EventArgs args)

         base.OnLoad(args);

    protected override void Render (HtmlTextWriter writer)

         if (string.IsNullOrEmpty(_wpZoneID)) return;

         Control c = this.NamingContainer.FindControl (_wpZoneID);

         WebPartZone _zone = c as WebPartZone;

         if (_zone == null) return;

         SPWebPartManager mngr = SPWebPartManager.GetCurrentWebPartManager(this.Page) as SPWebPartManager;

         if (mngr == null) return;

         if (!mngr.GetDisplayMode().AllowPageDesign)

             // if we are in display mode - hide the zone control itself

             _zone.Visible = false;

             // and render the web parts directly

             foreach (WebPart part in _zone.WebParts)

                 part.RenderControl(writer);

     }
script>

And here is how you can use this control in a page layout (the technique is identical for master pages and regular web part pages) – first you need the control “Register” directive at the top of the page:

<%@ Register TagPrefix="Test" TagName="WPZone" Src="~SiteCollection/_catalogs/masterpage/wpzone.ascx" %>

Note the value of the “Src” attribute – you can use the handy “~SiteCollection” URL token here instead of having to hard-code the server relative URL of the containing site collection. And then the declaration of the user control’s tag inside the markup of the page:

<Test:WPZone runat="server" WPZoneID="TopZone"/>

This simple user control modifies the default rendering of the WebPartZone control whose ID is specified in its “WPZoneiD” property. When the page is in display mode the control hides the web part zone and renders only the web parts that belong to this web part zone. This effectively hides the markup that is produces by the web part zone (several nested HTML table elements) and also the chrome (or frame) headers of the web parts. You should place the control just before the declaration of the web part zone control whose rendering you want to modify.

It is also possible to use user controls just as regular classes with helper methods that can be reused in various places – in this case the user control will not have visual behavior. For example you can create a user control like this (mylib.ascx):

<%@ Assembly Name="Microsoft.SharePoint, Version=14.0.0.0,Culture=neutral,PublicKeyToken=71e9bce111e9429c" %>
<%@ Assembly Name="Microsoft.SharePoint.Publishing, Version=14.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c" %>
 
<%@ Control Language="C#" ClassName="MyLib" %>
<script runat="server">
 public static string SayHello(string who)
 {
  return "Hello " + who;
 }
 
 public string SayHi (string who)
 {
  return "Hi " + who;
 }
script>

Note the “ClassName” attribute in the “Control” directive – this specifies the name of the class that will be generated from the ascx file. You will be able to use this generated class by this name in the code blocks of the pages that use the control (or in other user controls). To use the user control in this way you will only have to put the control “Register” directive at the top of the page. And then in a code block you can have something like:

<script runat="server">
protected override void OnLoad(EventArgs args)
{
 base.OnLoad(args);
 this.txtBox.Text = MyLib.SayHello("John");
}
script>

Saturday, July 30, 2011

SharePoint Search 2007 – hacking the SiteData web service – Part II

In the first part of this posting I outlined the physical architecture of the SharePoint search engine mostly concerning the part of the Sts3 protocol handler and the role that the “SiteData” web service plays in the crawling process. The very fact that the search engine uses a web service to access and retrieve SharePoint list and site data gave me the first clue as to how I can “hack” the search crawling process. My idea was simple – since the web service is hosted in IIS, of course, I can use some basic URL rewriting techniques so that the call to the real web service is “covertly” redirected to a custom web service which will either implement the original logic of the standard service adding some additional functionality or will simply serve as a proxy to the real web service but will do some modifications to either the input or the output of the latter. Out of these two options the first one seemed more than complex and the second one actually was pretty sufficient as to the goals that I had with the implementation of the “hack”. The thing is that the output XML of the SiteData.GetContent method contains all relevant SharePoint list item and schema data – the “List” option for the ObjectType parameter returns the schema of the SharePoint list and the “Folder” option – the list item data (see the sample XML outputs of the web service in the first part of the posting). The problem is that the Sts3 protocol handler “interprets” the data from the output XML in its own specific way which results in the well-known limitations of the crawled properties and the retrieved for them data in the search index that we have for SharePoint content. So what I decided to do was to create a small custom web service which implements the SiteData.GetContent and SiteData.GetChanges methods (with the exact same parameters and notation). Since I wanted to use it as a proxy to the real SiteData web service I needed somehow to pass the call to it. The simplest option here would have been to simply issue a second web service call from my web service, but the better solution was to just instantiate an instance of the SiteData web service class (Microsoft.SharePoint.SoapServer.SiteData from the STSSOAP assembly which is in the _app_bin subfolder of the SharePoint web application) and call its corresponding method. The last trick of the “hack” was to get the XML output from the SiteData GetContent and SiteData.GetChanges methods and modify it (actually add some additional stuff to it) so that I can get the extra crawled properties in the search index that I needed.

So, before going into details about the implementation of the “hack”, I want to point out several arguments as to why you should consider twice before starting using it (it’s become a habit of mine trying to dissuade people from using my own solutions) and I would rather not recommend using it in bigger production environments:

It tampers with the XML output of the standard SiteData web service – this may lead to unpredictable behavior of the index engine and result in it being not able to crawl your site(s). The standard XML output of the SiteData service is itself not quite well-formed XML so before getting the right way to modify it without losing its original formatting I kept receiving crawler errors which I could find in the crawl log of my SSP admin site.
There will be a serious performance penalty compared to using just the standard SiteData service. The increased processing time comes from the added parsing of the output XML and the extra modifications and additions added to it.
The general argument that this is indeed a hack which gets inside the standard implementation of the SharePoint search indexing which won’t sound good to both managers and Microsoft guys alike.

Having said that (and if you are still reading) let me give you the details of the implementation itself. The solution of the “hack” can be downloaded from here (check the installation notes below).

The first thing that I will start with is the URL rewriting logic that allows the custom web service to be invoked instead of the standard SiteData web service. In IIS 7 there is a built-in support for URL rewriting, but because I was testing on a Windows 2003 server with IIS 6 and because I was a bit lazy to implement a full proxy for the SiteData web service I went to the other approach … Which is to use a custom .NET HTTP module (the better solution) or to simply modify the global.asax of the target SharePoint web application (the worse but easier to implement solution) – which is the one that I actually used. The advantage of using a custom URL rewriting logic as opposed to using the built in URL rewriting functionality in IIS 7 is that in the former you can additionally inspect the HTTP request data and apply the URL rewriting only for certain methods of the web service. So in the modified version of the global.asax I do an explicit check for the web service method being called and redirect to the custom web service only if I detect the GetContent or GetChanges methods (all other methods will hit directly the standard SiteData service and no URL rewriting will take place). You can see the source code of the global.asax file that I used below:

<%@ Application Language="C#" Inherits="Microsoft.SharePoint.ApplicationRuntime.SPHttpApplication" %>
<script language="C#" runat="server">
protected void Application_BeginRequest(Object sender, EventArgs e) 
{
    CheckRewriteSiteData();
}
protected void CheckRewriteSiteData()
{
    if (IsGetListItemsCall())
    {
        string newUrl = this.Request.Url.AbsolutePath.ToLower().Replace("/sitedata.asmx", "/stefansitedata.asmx");
        HttpContext.Current.RewritePath(newUrl);
    }
}
protected bool IsGetListItemsCall()
{
    if (string.Compare(this.Request.ServerVariables["REQUEST_METHOD"], "post", true) != 0) return false;
    if (!this.Request.Url.AbsolutePath.EndsWith("/_vti_bin/SiteData.asmx", StringComparison.InvariantCultureIgnoreCase)) return false;
 
    if (string.IsNullOrEmpty(this.Request.Headers["SOAPAction"])) return false;
 
    string soapAction = this.Request.Headers["SOAPAction"].Trim('"').ToLower();
    if (!soapAction.EndsWith("getcontent") && !soapAction.EndsWith("getchanges")) return false;
    if (string.Compare(ConfigurationManager.AppSettings["UseSiteDataRewrite"], "true", true) != 0) return false;
 
    return true;
}
</script>

Note also that in the code I check a custom “AppSettings” key in the web.config file whether to use or not URL rewriting logic. This way you can easily turn on or off the “hack” with a simple tweak in the configuration file of the SharePoint web application.

And this is the code of the custom “SiteData” web service:

[WebServiceBinding(ConformsTo = WsiProfiles.BasicProfile1_1), WebService(Namespace = "http://schemas.microsoft.com/sharepoint/soap/")]
public class SiteData
{
    [WebMethod]
    public string GetContent(ObjectType objectType, string objectId, string folderUrl, string itemId, bool retrieveChildItems, bool securityOnly, ref string lastItemIdOnPage)
    {
        try
        {
            SiteDataHelper siteDataHelper = new SiteDataHelper();
            return siteDataHelper.GetContent(objectType, objectId, folderUrl, itemId, retrieveChildItems, securityOnly, ref lastItemIdOnPage);
        }
        catch (ThreadAbortException) { throw; }
        catch (Exception exception) { throw SoapServerException.HandleException(exception); }
    }
    [WebMethod]
    public string GetChanges(ObjectType objectType, string contentDatabaseId, ref string LastChangeId, ref string CurrentChangeId, int Timeout, out bool moreChanges)
    {
        try
        {
            SiteDataHelper siteDataHelper = new SiteDataHelper();
            return siteDataHelper.GetChanges(objectType, contentDatabaseId, ref LastChangeId, ref CurrentChangeId, Timeout, out moreChanges);
        }
        catch (ThreadAbortException) { throw; }
        catch (Exception exception) { throw SoapServerException.HandleException(exception); }
    }
}

As you see, the custom “SiteData” web service implements only the GetContent and GetChanges methods. We don’t need to implement the other methods of the standard SiteData web service because the URL rewriting will redirect to the custom web service only in case these two methods are being invoked. The two methods in the custom service have the exact same notation as the ones in the standard SiteData web service. The implementation of the methods is a simple delegation to the corresponding methods with the same names of a helper class: SiteDataHelper. Here is the source code of the SiteDataHelper class:

using SP = Microsoft.SharePoint.SoapServer;
namespace Stefan.SharePoint.SiteData
{
    public class SiteDataHelper
    {
        public string GetChanges(ObjectType objectType, string contentDatabaseId, ref string startChangeId, ref string endChangeId, int Timeout, out bool moreChanges)
        {
            SP.SiteData siteData = new SP.SiteData();
            string res = siteData.GetChanges(objectType, contentDatabaseId, ref startChangeId, ref endChangeId, Timeout, out moreChanges);
            try
            {
                ListItemXmlModifier modifier = new ListItemXmlModifier(new EnvData(), res);
                res = modifier.ModifyChangesXml();
            }
            catch (Exception ex) { Logging.LogError(ex); }
            return res;
        }
 
        public string GetContent(ObjectType objectType, string objectId, string folderUrl, string itemId, bool retrieveChildItems, bool securityOnly, ref string lastItemIdOnPage)
        {
            SPWeb web = SPContext.Current.Web;
            SP.SiteData siteData = new SP.SiteData();
            string res = siteData.GetContent(objectType, objectId, folderUrl, itemId, retrieveChildItems, securityOnly, ref lastItemIdOnPage);
            try
            {
                EnvData envData = new EnvData() { SiteId = web.Site.ID, WebId = web.ID, ListId = objectId.TrimStart('{').TrimEnd('}') };
                if ((objectType == ObjectType.ListItem || objectType == ObjectType.Folder) && !securityOnly)
                {
                    ListItemXmlModifier modifier = new ListItemXmlModifier(envData, res);
                    res = modifier.ModifyListItemXml();
                }
                else if (objectType == ObjectType.List)
                {
                    ListItemXmlModifier modifier = new ListItemXmlModifier(envData, res);
                    res = modifier.ModifyListXml();
                }
            }
            catch (Exception ex) { Logging.LogError(ex); }
            return res;
        }
    }
}

The thing to note here is that the two methods in the SiteDataHelper helper class create an instance of the SiteData web service class directly (note that this is not a generated proxy class, but the actual web service class implemented in the standard STSSOAP.DLL). The GetContent and GetChanges methods are called on this instance respectively and the string result of the calls is stored in a local variable. The string value that these methods return actually contains the XML with the list schema or list item data (depending on the “ObjectType” parameter being “List” or “Folder”). This XML data is then provided to an instance of the custom ListItemXmlModifier class which handles all XML modifications for both the GetContent and GetChanges methods. Note that for the GetContent method, the XML results are passed for modification only if the “ObjectType” parameter has the “ListItem”, “Folder” or “List” values. I am not going to show the source code of the ListItemXmlModifier class directly in the posting (it is over 700 lines of code) but instead I will briefly explain to you what are the changes in the XML from the GetContent and GetChanges methods that this class implements. The modifications to the XML are actually pretty simple and semantically there are only two types of changes – these correspond to the result XML-s of the GetContent (ObjectType=List) and GetContent (ObjectType=Folder) methods (the result XML of the GetChanges method has a more complex structure but contains the above two fragments (in one or more occurrences) where list and list item changes are available).

Let’s start with a sample XML from the standard SiteData.GetContent(ObjectType=List) method (I’ve trimmed some of the elements for brevity):

<List>
  <Metadata ID="{1d53a556-ae9d-4fbf-8917-46c7d97ebfa5}" LastModified="2011-01-17 13:24:18Z" Title="Pages" DefaultTitle="False" Description="This system library was created by the Publishing feature to store pages that are created in this site." BaseType="DocumentLibrary" BaseTemplate="850" DefaultViewUrl="/Pages/Forms/AllItems.aspx" DefaultViewItemUrl="/Pages/Forms/DispForm.aspx" RootFolder="Pages" Author="System Account" ItemCount="4" ReadSecurity="1" AllowAnonymousAccess="False" AnonymousViewListItems="False" AnonymousPermMask="0" CRC="699748088" NoIndex="False" ScopeID="{a1372e10-8ffb-4e21-b627-bed44a5130cd}" />
  <ACL>
    <permissions>
      <permission memberid='3' mask='9223372036854775807' />
      ....
    </permissions>
  </ACL>
  <Views>
    <View URL="Pages/Forms/AllItems.aspx" ID="{771a1809-e7f3-4c52-b346-971d77ff215a}" Title="All Documents" />
    ....
  </Views>
  <Schema>
    <Field Name="FileLeafRef" Title="Name" Type="File" />
    <Field Name="Title" Title="Title" Type="Text" />
    <Field Name="Comments" Title="Description" Type="Note" />
    <Field Name="PublishingContact" Title="Contact" Type="User" />
    <Field Name="PublishingContactEmail" Title="Contact E-Mail Address" Type="Text" />
    <Field Name="PublishingContactName" Title="Contact Name" Type="Text" />
    <Field Name="PublishingContactPicture" Title="Contact Picture" Type="URL" />
    <Field Name="PublishingPageLayout" Title="Page Layout" Type="URL" />
    <Field Name="PublishingRollupImage" Title="Rollup Image" Type="Note" TypeAsString="Image" />
    <Field Name="Audience" Title="Target Audiences" Type="Note" TypeAsString="TargetTo" />
    <Field Name="ContentType" Title="Content Type" Type="Choice" />
    <Field Name="MyLookup" Title="MyLookup" Type="Lookup" />
    ....
  </Schema>
</List>

The XML contains the metadata properties of the queried SharePoint list, the most important part of which is contained in the Schema/Field elements – the simple definitions of the fields in this list. It is easy to deduce that the fields that the index engine encounters in this part of the XML will be recognized and appear as crawled properties in the search index. So what if we start adding fields of our own – this won’t solve the thing by itself because we will further need list items with values for these “added” fields (we’ll see that in the second XML sample) but it is the first required bit of the “hack”. The custom service implementation will actually add several extra “Field” elements like these:

    <Field Name='ContentTypeId.Text' Title='ContentTypeId' Type='Note' />
    <Field Name='Author.Text' Title='Created By' Type='Note' />
    <Field Name='Author.ID' Title='Created By' Type='Integer' />
    <Field Name='MyLookup.Text' Title='MyLookup' Type='Note' />
    <Field Name='MyLookup.ID' Title='MyLookup' Type='Integer' />
    <Field Name='PublishingRollupImage.Html' Title='Rollup Image' Type='Note' />
    <Field Name='PublishingPageImage.Html' Title='Page Image' Type='Note' />
    <Field Name='PublishingPageContent.Html' Title='Page Content' Type='Note' />
    <Field Name='Env.SiteId' Title='Env.SiteId' Type='Text' />
    <Field Name='Env.WebId' Title='Env.WebId' Type='Text' />
    <Field Name='Env.ListId' Title='Env.ListId' Type='Text' />
    <Field Name='Env.IsListItem' Title='Env.IsListItem' Type='Integer' />

You can immediately notice that these “new” fields are actually related to already existing fields in the SharePoint list in the schema XML that’s being modified. You can see that I used a specific naming convention for the “Name” attribute – with a dot and a short suffix. Actually the crawled properties that the index engine will generate will also contain the dot and the suffix, so it will be easy for you to locate them in the “crawled properties” page in the SSP admin site. From the “Name” attribute you can immediately see which the related original fields for the new fields are. In short the rules for creating these new fields are:

For every original lookup field (both single and multiple lookup columns and all derived lookup columns, e.g. user fields) two additional fields are added – with the suffixes “.ID” and “.Text” and field “Type” attribute “Integer” and “Note” respectively.
For every original publishing “HTML” and “Image” field one extra field with the “.Html” suffix is added.
For all lists the “ContentTypeId.Text” extra field is added with “Type” attribute set to “Note”
For all lists the additional fields “Env.SiteId”, “Env.WebId”, “Env.ListId”, “Env.IsListItem” are added.

So, we have already extra fields in the list schema, the next step is to have them in the list item data populated with the relevant values. Let me first show you a sample of the standard unmodified XML output of the GetContent(ObjectType=Folder) method (I trimmed some of the elements and reduced the values of some of the attributes for brevity):

<Folder>
  <Metadata>
    <scope id='{5dd2834e-902d-4db0-8db2-4a1da762a620}'>
      <permissions>
        <permission memberid='1' mask='206292717568' />
        ....
      </permissions>
    </scope>
  </Metadata>
  <xml xmlns:s='uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882' xmlns:dt='uuid:C2F41010-65B3-11d1-A29F-00AA00C14882' xmlns:rs='urn:schemas-microsoft-com:rowset' xmlns:z='#RowsetSchema'>
    <s:Schema id='RowsetSchema'>
      <s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
        <s:AttributeType name='ows_ContentTypeId' rs:name='Content Type ID' rs:number='1'>
          <s:datatype dt:type='int' dt:maxLength='512' />
        </s:AttributeType>
        <s:AttributeType name='ows__ModerationComments' rs:name='Approver Comments' rs:number='2'>
          <s:datatype dt:type='string' dt:maxLength='1073741823' />
        </s:AttributeType>
        <s:AttributeType name='ows_FileLeafRef' rs:name='Name' rs:number='3'>
          <s:datatype dt:type='string' dt:lookup='true' dt:maxLength='512' />
        </s:AttributeType>
        ....
      </s:ElementType>
    </s:Schema>
    <scopes>
    </scopes>
    <rs:data ItemCount='2'>
      <z:row ows_ContentTypeId='0x010100C568DB52D9D0A14D9B2FDCC96666E9F2007948130EC3DB064584E219954237AF390064DEA0F50FC8C147B0B6EA0636C4A7D400E595F4AC9968CC4FAD1928288BC9885A' ows_FileLeafRef='1;#Default.aspx' ows_Modified_x0020_By='myserver\sstanev' ows_File_x0020_Type='aspx' ows_Title='Home' ows_PublishingPageLayout='http://searchtest/_catalogs/masterpage/defaultlayout.aspx, Welcome page with Web Part zones' ows_ContentType='Welcome Page' ows_PublishingPageImage='' ows_PublishingPageContent='some content' ows_ID='1' ows_Created='2010-12-20T18:53:18Z' ows_Author='1;#Stefan Stanev' ows_Modified='2010-12-26T12:45:31Z' ows_Editor='1;#Stefan Stanev' ows__ModerationStatus='0' ows_FileRef='1;#Pages/Default.aspx' ows_FileDirRef='1;#Pages' ows_Last_x0020_Modified='1;#2010-12-26T12:45:32Z' ows_Created_x0020_Date='1;#2010-12-20T18:53:19Z' ows_File_x0020_Size='1;#6000' ows_FSObjType='1;#0' ows_PermMask='0x7fffffffffffffff' ows_CheckedOutUserId='1;#' ows_IsCheckedoutToLocal='1;#0' ows_UniqueId='1;#{923EEE29-44AB-4D1B-B65B-E3ECEAE1353E}' ows_ProgId='1;#' ows_ScopeId='1;#{5DD2834E-902D-4DB0-8DB2-4A1DA762A620}' ows_VirusStatus='1;#6000' ows_CheckedOutTitle='1;#' ows__CheckinComment='1;#' ows__EditMenuTableStart='Default.aspx' ows__EditMenuTableEnd='1' ows_LinkFilenameNoMenu='Default.aspx' ows_LinkFilename='Default.aspx' ows_DocIcon='aspx' ows_ServerUrl='/Pages/Default.aspx' ows_EncodedAbsUrl='http://searchtest/Pages/Default.aspx' ows_BaseName='Default' ows_FileSizeDisplay='6000' ows_MetaInfo='...' ows__Level='1' ows__IsCurrentVersion='1' ows_SelectTitle='1' ows_SelectFilename='1' ows_Edit='0' ows_owshiddenversion='26' ows__UIVersion='6656' ows__UIVersionString='13.0' ows_Order='100.000000000000' ows_GUID='{2C80A53D-4F38-4494-855D-5B52ED1D095B}' ows_WorkflowVersion='1' ows_ParentVersionString='1;#' ows_ParentLeafName='1;#' ows_Combine='0' ows_RepairDocument='0' ows_ServerRedirected='0' />
      ....
    </rs:data>
  </xml>
</Folder>

The list item data is contained below the “rs:data” element – there is one “z:row” element for every list item. The attributes of the “z:row” element contain the field values of the corresponding list item. You can see here that the attributes already have the “ows_” prefix as all crawl properties in the “SharePoint” category. You can notice that the attributes for lookup fields contain the unmodified item field data but the publishing “HTML” and “Image” columns are already modified – all HTML markup has been removed from them (for the “Image” type column this means that they become empty, since all the data they contain is in markup).

And let’s see the additional attributes that the custom web service adds to the “z:row” elements of the list item data XML:

ows_ContentTypeId.Text='0x010100C568DB52D9D0A14D9B2FDCC96666E9F2007948130EC3DB064584E219954237AF3900242457EFB8B24247815D688C526CD44D0005E464F1BD83D14983E49C578030FBF6' 
ows_PublishingPageImage.Html='&lt;img border="0" src="/PublishingImages/newsarticleimage.jpg" vspace="0" style="margin-top:8px" alt=""&gt;' 
ows_PublishingPageContent.Html='&lt;b&gt;some content&lt;/b&gt;' 
ows_Author.Text='1;#Stefan Stanev' 
ows_Author.ID='1' 
ows_Editor.Text='1;#Stefan Stanev' 
ows_Editor.ID='1' 
ows_Env.SiteId='ff96067a-accf-4763-8ec1-194f20fbf0f5' 
ows_Env.WebId='b2099353-41d6-43a7-9b0d-ab6ad87fb180' 
ows_Env.ListId='Pages' 
ows_Env.IsListItem='1' 

Here is how these fields are populated/formatted (as I mentioned above all the data is retrieved from the XML itself or from the context of the service request):

the lookup derived fields with the “.ID” and “.Text” suffixes – both get their values from the “parent” lookup column – the former is populated with the starting integer value of the lookup field, the latter is set with the unmodified value of the original lookup column. When the search index generates the corresponding crawled properties the “.ID” field can be used as a true integer property and the “.Text” one although containing the original lookup value will be treated as a simple text property by the index engine (remember that in the list schema XML the type of this extra field was set to “Note”). So what will be the difference between the “.Text” field and the original lookup column in the search index. The difference is that the value of the original lookup column will be trimmed in the search index and will contain only the text value without the integer part preceding it. And if you issue an SQL search query against a managed property mapped to the crawled property of a lookup field you will be able to retrieve only the textual part of the lookup value (this holds also for the filtering and sorting operation for this field type). Whereas with the “.Text” derived field you will have access to the original unmodified value of the lookup field.
the fields derived from the “HTML” and “Image” publishing field type with the “.Html” suffix – they are populated with the original values of the corresponding fields with the original markup intact. Since the values of the original fields in the list item data XML are already trimmed the original values are retrieved with a simple trick. The “z:row” element for every list item contains the “ows_MetaInfo” attribute which contains a serialized property bag with all properties of the underlying SPFile object for the current list item. This property bag happens to contain all list item field values which are non-empty. So what I do in this case is to parse the value of the ows_MetaInfo attribute and retrieve the unmodified values for all “Html” and “Image” fields that I need. An important note here – the ows_MetaInfo attribute (and its corresponding system list field – MetaInfo) is available only for document libraries and is not present in non-library lists, which means that this trick is possible only for library-type lists.
the ows_ContentTypeId.Text field gets populated from the value of the original ows_ContentTypeId field/attribute. The difference between the two is that the derived one is defined in the schema as a “Note” field so its value is treated by the search index as a text property.
the ows_Env.*, fields get populated from service contextual data (see the implementation of the SiteDataHelper class). For the implementation of the XML modifications for the SiteData.GetChanges method these values are retrieved from the result XML itself. The value of the ows_Env.IsListItem is always set to 1 (its purpose is to be used as a flag defining a superset of the standard “isdocument” managed property).

Installation steps for the “hack” solution

Download and extract the contents of the zip archive.
Build the project in Visual Studio (it is a VS 2008 project).
The assembly file (Stefan.SharePoint.SiteData.dll) should be deployed to the GAC
The StefanSiteData.asmx file should be copied to {your 12 hive root}\ISAPI folder
The global.asax file should be copied to the root folder of your SharePoint web application. Note that you will have to backup the original global.asax file before you overwrite it with this one.
Open the web.config file in the target SharePoint web application and add an “appSettings” “add” element with key “UseSiteDataRewrite” and value “true”.
Note that if you have more than one front end servers in your farm you should repeat steps 3-6 on all machines.
After the installation is ready you can start the search crawler (full crawl) from the SSP admin site. It is a good idea if you have a content source only for the web application for which the custom SiteData service is enabled, so that you can see immediately the results of the custom service.
After the crawling is complete you should check the crawl log for errors – check whether there’re unusual errors which were not occurring before the installation of the custom SiteData service.
If there’re no errors in the crawl log you can check the “crawled properties” page in the SSP admin site – the new “dotted” crawled properties should be there and you can now create new managed properties that can be mapped to them.
Note that the newly created managed properties are not ready for use before you run a second full crawl for the target content source.

Stefan Stanev's SharePoint blog

Friday, January 20, 2012

SharePoint WCM HTML clean-up

Sunday, October 30, 2011

Deploy user controls in the SharePoint content database

Saturday, July 30, 2011

SharePoint Search 2007 – hacking the SiteData web service – Part II

Google Translator

My Blog List

Labels

Followers

Blog Archive

About Me

Stefan Stanev's SharePoint blog

Friday, January 20, 2012

SharePoint WCM HTML clean-up

Sunday, October 30, 2011

Deploy user controls in the SharePoint content database

Saturday, July 30, 2011

SharePoint Search 2007 – hacking the SiteData web service – Part II

Google Translator

My Blog List

Labels

Subscribe To

Followers

Blog Archive

About Me