Pav Cherny

Code download available at: ChernySharePoint2009_06.exe(2,006 KB)

Contents

Internal Binary Storage
External Binary Storage
Building an Unmanaged EBS Provider
Building a Managed EBS Provider
Registering an EBS Provider in SharePoint
Implementing Garbage Collection
Conclusion

Microsoft estimates that as much as 80 percent of the data stored in Microsoft Windows SharePoint Services ( WSS ) 3.0 and Microsoft Office SharePoint Server (MOSS) 2007 content databases is non-relational binary large object ( BLOB) data, such as Microsoft Office Word documents, Microsoft Office Excel spreadsheets, and Microsoft Office PowerPoint presentations. Only 20 percent is relational metadata, which implies a suboptimal use of Microsoft SQL Server resources at the database backend. SharePoint does not take advantage of recent SQL Server innovations for unstructured data introduced in SQL Server 2008, such as the FILESTREAM attribute or Remote BLOB Storage API, but provides its own options to increase the storage efficiency and manageability of massive data volumes.

Specifically, SharePoint includes an external binary storage provider API, ISPExternalBinaryProvider, which Microsoft first published as a hotfix in May 2007 and incorporated later into Service Pack 1. The ISPExternalBinaryProvider API is separate from the Remote BLOB Storage API. Third-party vendors can use this API to integrate SharePoint with advanced storage solutions, such as content-addressable storage (CAS) systems. You can also use this API to maintain SharePoint BLOB data on a central file server outside of content databases if you want to build a custom solution to increase storage efficiency and scalability in a SharePoint farm. Keep in mind, however, that this API is specific to WSS 3.0 and MOSS 2007. It will change in the next SharePoint release, which means that you will have to update your provider.

In this column, I discuss how to extend the SharePoint storage architecture using the ISPExternalBinaryProvider API, including advantages and disadvantages, implementation details, performance considerations, and garbage collection. I also discuss a 64-bit compatibility issue of Microsoft Visual Studio that can cause SharePoint to fail loading managed ISPExternalBinaryProvider components despite a correct interface implementation. Where appropriate, I refer to the ISPExternalBinaryProvider documentation in the WSS 3.0 SDK. Another reference worth mentioning is Kyle Tillman's blog.

Kyle does a great job explaining how he mastered the implementation hurdles in managed code, but neither the WSS 3.0 SDK nor Kyle's blog post includes a Visual Studio sample project, so I decided to provide ISPExternalBinaryProvider samples in both unmanaged and managed code in this column's companion material. The purpose of these samples is to help you get started if you are interested in integrating external storage solutions with SharePoint. Remember, though, that these samples are untested and not ready for production use.

Internal Binary Storage

By default, SharePoint stores BLOB data in the Content column of the AllDocStreams table in the content database. The obvious advantage of this approach is straightforward transactional consistency between relational data and the associated non-relational file contents. For example, it's not complicated to insert the metadata of a Word document along with the unstructured content into a content database, nor is it complicated to associate metadata with the corresponding unstructured content in select, update, or delete operations. However, the most obvious disadvantage of the default approach is an inefficient use of storage resources. Despite an I/O subsystem optimized for high performance, the SQL Server storage engine is not exactly a file-server replacement.

A SQL Server database consists of transaction log and data files, as illustrated in Figure 1. In order to ensure reliable transactional behavior, SQL Server first writes all transaction records to the log file before it flushes the corresponding data in 8KB pages to the data file on disk. Depending on the selected recovery model, this requires more than twice the BLOB size in storage capacity until you perform a backup and purge the transaction log. Moreover, SQL Server does not store unstructured SharePoint content directly in data pages. Instead, SQL Server uses a separate collection of text/image pages and only stores a 16-byte text pointer to the BLOB's root node in the data row. Text/image pages are organized in a balanced tree, yet there is only one collection of text/image pages for each table. For the AllDocStreams table, this means that the content of all files is spread across the same text/image page collection. A single text/image page can hold data fragments from multiple BLOBs, or it may hold intermediate nodes for BLOBs larger than 32KB in size.

fig01.gif

Figure 1 Default SharePoint BLOB storage in SQL Server

Let's not dive too deeply into SQL Server internals, though. The point is that when reading unstructured content, SQL Server must go through the data row to get the text pointer and then through the BLOB's root node and possibly additional intermediate nodes to locate all data fragments spread across any number of text/image pages that SQL Server must load into memory in full to get all data blocks. This is because SQL Server performs I/O operations at the page level. These complexities impair file-streaming performance in comparison to direct access through the file system. SQL Server also imposes a hard size limit of 2GB on SharePoint because this is the maximum capacity of the image data type. The Content column of the AllDocStreams table is an image column, so you cannot store files larger than 2GB in a SharePoint content database.

External Binary Storage

The ISPExternalBinaryProvider API offers a clever alternative to internal BLOB storage in SharePoint content databases. It is a straightforward COM interface with only two methods (StoreBinary and RetrieveBinary), which you can use to implement an External Binary Storage (EBS) provider. For architecture details, see the topic "Architecture of External BLOB Storage" in the WSS 3.0 SDK.

SharePoint loads your EBS provider when you set the ExternalBinaryStoreClassId property of the local SPFarm object (SPFarm.Local.ExternalBinaryStoreClassId) to the provider's COM class identifier (CLSID). SharePoint then calls the provider's StoreBinary method whenever you submit BLOB data, such as when you're uploading a file to a document library. The EBS provider can decide to store the BLOB in its associated external storage system and return a corresponding BLOB identifier ( BLOB ID) to SharePoint, or it can set the pfAccepted parameter in the StoreBinary method to false to indicate that it did not handle the BLOB. In the latter case, SharePoint stores the BLOB in the content database as usual. On the other hand, if the EBS provider accepted the BLOB, SharePoint only inserts the BLOB ID into the Content column of the AllDocStreams table, as indicated in Figure 2. The BLOB ID can be any value that enables the EBS provider to locate the content in the external storage system, such as a filename, a file path, a globally unique identifier (GUID), or a content digest. The sample providers included in the companion material, for instance, use GUIDs as filenames for reliable identification of BLOBs on a file server.

fig02.gif

Figure 2 Storing a SharePoint BLOB in an external storage system

SharePoint also keeps track of externally stored files by setting the highest DocFlags bit of these files to 1. DocFlags is a column of the AllDocs table. When a user requests to download an externally stored file, SharePoint checks DocFlags and passes the Content value from the AllDocStreams table to the RetrieveBinary method of the EBS provider. In response to the RetrieveBinary call, the EBS provider must retrieve the indicated BLOB from the external storage system and return the binary content to SharePoint in form of a COM object that implements the ILockBytes interface. Note that SharePoint does not call the RetrieveBinary method for BLOBs stored directly in the content database.

Note also that the storage and retrieval processes are transparent to the user as long as the user doesn't attempt to bypass SharePoint. So, you don't need to replace built-in Web parts with custom versions that tie metadata in a list with a document stored externally; productivity applications, such as Microsoft Office, don't need to know how to store metadata in one place and then the document in another; and Search does not need to process metadata separate from documents. Moreover, and this is one of my favorite advantages of the EBS provider architecture, the user must go through SharePoint to access externally stored BLOB data. A user bypassing SharePoint and directly accessing a content database through a SQL Server connection ends up downloading BLOB IDs instead of actual file contents, as illustrated in Figure 3. You can verify this behavior if you deploy the SQL Download Web Part (which I used in the April 2009 column to demonstrate how to bypass SharePoint AD RMS protection) in a test environment. Furthermore, users don't need—and should not have—access permissions to the external BLOB store. Only SharePoint security accounts require access because SharePoint calls the EBS provider methods in the security context of the site's application pool account.

fig03.gif

Figure 3 The EBS provider can be a roadblock to bypassing SharePoint permissions for file downloads

Keep in mind, however, that EBS providers also have drawbacks due to the complexity of maintaining integrity between metadata in the SharePoint farm's content databases and the external BLOB store. For a good discussion of pros and cons, check out the topic "Operational Limits and Trade-Off Analysis" in the WSS 3.0 SDK. Make sure you read this very important topic before implementing an EBS provider in a SharePoint environment.

Building an Unmanaged EBS Provider

Now let's tackle the challenges of building EBS providers. The ISPExternalBinaryProvider interface is well-documented in the WSS 3.0 SDK under "The BLOB Access Interface: ISPExternalBinaryProvider." However, it seems Microsoft forgot to cover the EBS provider details. After all, we are not just consuming the interface of an existing COM server. We are tasked with building that COM server ourselves and implementing the ISPExternalBinaryProvider interface. Most importantly, the WSS 3.0 SDK fails to mention the type of COM server we are supposed to build and the required threading model. A classic COM server can run out-of-process or in-process, and it can support the single-threaded apartment (STA) model, the multithreaded apartment ( MTA) model, or both, or the free-threaded model. For the EBS provider to work properly, make sure you build a thread-safe in-process COM server that supports the threading model "Both" for STAs and the MTA.

You also need to think about which programming language to use. This is important because the ISPExternalBinaryProvider interface is the lowest-level API of SharePoint. Performance issues can affect the entire SharePoint farm. For this reason, I recommend using a language that enables you to build small and fast COM objects, such as Visual C++ and Active Template Library (ATL). ATL provides helpful C++ classes to simplify the development of thread-safe COM servers in unmanaged code with the correct level of threading support.

Visual Studio also includes a variety of ATL wizards. Just create an ATL project, select Dynamic-link library ( DLL) for the server type, copy the ISPExternalBinaryProvider interface definition from the WSS 3.0 SDK into the interface definition language ( IDL) file of your ATL project, add a new class for an ATL Simple Object, select "Both" as the threading model and no aggregation, then right-click the new class, point to Add, click Implement Interface, and select ISPExternalBinaryProvider. That's it! The Implement Interface Wizard performs all necessary plumbing, so you can focus on implementing the StoreBinary and RetrieveBinary methods.

And don't let unmanaged C++ code intimidate you. If you analyze the SampleStore.cpp file in the companion material, you can see that the StoreBinary and RetrieveBinary implementations are relatively straightforward. Essentially, the sample StoreBinary method constructs a file path based on a StorePath registry value, the Site ID passed in from SharePoint, and a GUID generated for the BLOB, and then uses the Win32 WriteFile function to save the binary data obtained from the ILockBytes instance. The sample RetrieveBinary method, on the other hand, constructs the file path based on the same StorePath registry value, the Site ID, and the BLOB ID passed in from SharePoint, and then uses the Win32 ReadFile function to retrieve the unstructured data, which the EBS provider copies into a new ILockBytes instance that it then passes back to SharePoint. Figure 4 illustrates how the EBS provider constructs the file path.

fig04.gif

Figure 4 Constructing file paths for StoreBinary and RetrieveBinary operations in the sample EBS providers

Building a Managed EBS Provider

Of course, SharePoint developers might prefer using familiar managed languages to build EBS providers, even though building managed EBS providers is not necessarily less complicated than building unmanaged providers due to the complexity of COM interoperability. Keep in mind that an application written in unmanaged code can only load one version of the common language runtime (CLR), so your code needs to work with the same version of the CLR that the rest of SharePoint is using, otherwise you might end up with unexpected behavior. Also, you still must deal with unmanaged interfaces and the corresponding marshalling of parameters and buffers. Just compare SampleStore.cpp with SampleStore.cs in the companion material. There are no gains using a managed language in terms of code structure or programming simplicity.

Moreover, be aware of 64-bit compatibility issues if you develop managed EBS providers on the x64 platform. Figure 5 shows a typical error that results from invalid COM registration settings on a development computer. If you enable the Register for COM Interop checkbox in the project properties in Visual Studio 2005 or Visual Studio 2008, you'll end up with COM registration settings for your provider in the registry under HKEY_CLASSES_ROOT\Wow6432Node\CLSID\<ProviderCLSID>. Visual Studio uses the 32-bit version of the Assembly Registration Tool (Regasm.exe) even on the x64 platform.

fig05.gif

Figure 5 Due to invalid COM registration settings, a managed EBS provider could not be loaded

However, the 64-bit version of SharePoint cannot load a 32-bit COM server registered under the Wow6432Node, so you must manually register your managed EBS provider by using the 64-bit Regasm.exe version, located in the %WINDIR%\Microsoft.NET\Framework64\v2.0.50727 directory. For example, the command "%WINDIR%\Microsoft.NET\Framework64\v2.0.50727\Regasm.exe" ManagedProvider.dll creates the required registry settings for the managed sample provider under HKEY_CLASSES_ROOT\CLSID\<ProviderCLSID>. Another approach is to create a Setup program and mark the EBS provider for automatic COM registration.

Remember also that managed EBS providers come with significantly more overhead and performance penalties than their unmanaged ATL counterparts. You can see this if you compare the COM registration settings in the registry. As the InProcServer32 key reveals, the COM runtime loads unmanaged EBS provider DLLs directly, while managed EBS providers rely on Mscoree.dll as the in-proc server, which is the core engine of the CLR. So, for managed providers, the COM runtime loads the CLR and then the CLR loads the EBS provider assembly as registered under the Assembly key and creates a COM Callable Wrapper (CCW) proxy to handle the interaction between the unmanaged SharePoint client (Owssvr.dll) and the managed EBS provider.

Keep in mind that the unmanaged SharePoint server does not directly interact with your managed provider. It's the CCW that marshals parameters, calls the managed methods, and handles HRESULTs. This indirection is especially apparent in the different return types of managed methods in comparison to unmanaged methods. Unmanaged methods return HRESULTs to indicate success or failures while managed methods are supposed to have the void return type. So don't return explicit HRESULTs in managed code. You must raise system or user-defined exceptions in response to error conditions. If a managed method completes without an exception, the CCW automatically returns S_OK to the unmanaged client.

On the other hand, if a managed method raises an exception, the CCW maps error codes and messages to HRESULTs and error information. The CCW implements various error-handling interfaces for this purpose, such as ISupportErrorInfo and IErrorInfo, but SharePoint does not take advantage of these interfaces. EBS providers must implement their own error reporting through the Windows event log, SharePoint diagnostic logs, trace files, or other means. SharePoint only expects the HRESULT values S_OK for success and E_FAIL for any error. You can use the Marshal.ThrowExceptionForHR method to return E_FAIL to SharePoint, as demonstrated in SampleStore.cs.

Registering an EBS Provider in SharePoint

Easily the most confusing section on ISPExternalBinaryProvider in the WSS 3.0 SDK is the topic "Installing and Configuring Your BLOB Provider." At the time of this writing, this section was filled with misleading information and errors. Even the Windows PowerShell commands were incorrect. If you assign the EBS provider to $yourProviderConfig and afterwards use $providerConfig.ProviderCLSID, don't be surprised when you receive an error stating that $providerConfig doesn't exist. Of course, you won't even reach this point because the Active and ProviderCLSID properties aren't part of the ISPExternalBinaryProvider interface. These mysterious properties belong to a dual interface that is not covered in the documentation. Just for fun, I implemented a sample version in both unmanaged and managed code, but your ISPExternalBinaryProvider implementation does not require these proprietary properties at all.

The ProviderCLSID property might be handy, but the CLSID is also available in the registry if you search for the ProgID, such as UnmanagedProvider.SampleStore or ManagedProvider.SampleStore, and you can also find the CLSIDs in the code files SampleStore.rgs and SampleStore.cs. As mentioned earlier, setting the ExternalBinaryStoreClassId property of the local SPFarm object to the CLSID registers the EBS provider. Setting the ExternalBinaryStoreClassId property of the local SPFarm object to an empty GUID ("00000000-0000-0000-0000-000000000000") removes the EBS provider registration. Don't forget to call the SPFarm object's Update method to save the changes in the configuration database and restart Internet Information Services ( IIS). The following code listing illustrates how to accomplish these tasks in Windows PowerShell:

[System.Reflection.Assembly]::LoadWithPartialName('Microsoft.SharePoint')
$farm = [Microsoft.SharePoint.Administration.SPFarm]::Local

# Registering the CLSID of an EBS provider
$farm.ExternalBinaryStoreClassId = "C4A543C2-B7DB-419F-8C79-68B8842EC005"
$farm.Update()
IISRESET

# Removing the EBS provider registration
$farm.ExternalBinaryStoreClassId = "00000000-0000-0000-0000-000000000000"
$farm.Update()
IISRESET

Implementing Garbage Collection

Another section in the WSS 3.0 SDK featuring mysterious components and critical code snippets is titled "Implementing Lazy Garbage Collection." At the time of this writing, this section contained references to another mysterious Utility class with DirFromSiteId and FileFromBlobid methods as well as an incorrect assignment of Directory.GetFiles results to a FileInfo array, but let's not be too demanding on WSS 3.0 documentation quality. The DirFromSiteId and FileFromBlobid helper methods reveal their purpose through their names and the incorrect FileInfo array is easily replaced with a string array, or you can replace the Directory.GetFiles method with a call to the GetFiles method of a DirectoryInfo object. The Garbage Collector sample program in the companion material uses the DirectoryInfo approach and follows the suggested sequence of steps for garbage collection.

An important deviation of the Garbage Collector sample from the SDK explanations concerns the handling of timing conditions. This is a critical issue because timing conditions can lead to misidentification and deletion of valid files during garbage collection. Take a look at Figure 6, which illustrates the WSS 3.0 SDK–recommended approach to determine orphaned files by enumerating all BLOB files in the EBS store and then removing all those references from the BLOB list that are still in the content database as indicated through the site's ExternalBinaryIds collection. The remaining references in the BLOB list are supposed to indicate orphaned files that should be deleted.

fig06.gif

Figure 6 Misidentification of a valid BLOB as orphaned due to a timing condition

However, the EBS provider must, of course, first finish writing BLOB data before it can return a BLOB ID to SharePoint. Depending on network bandwidth and other conditions, I/O performance can fluctuate. So, there is a chance that the EBS provider could create a new BLOB—which then appears in your BLOB list—but completes writing the BLOB data after you have determined the ExternalBinaryIds so the BLOB ID is not yet present in this collection. Accordingly, the reference to the new BLOB remains in the orphaned BLOB list and if you purge the orphaned BLOBs at this point, you accidentally delete a valid content item and lose data! In order to avoid this problem, the sample Garbage Collector checks the file creation time and adds only those items to the BLOB list that are more than one hour old.

Conclusion

By integrating an external storage solution with SharePoint, you can increase storage efficiency, system performance, and scalability of a SharePoint farm. Another advantage is that this forces users to go through SharePoint to access unstructured contents. Pulling data out of the content databases via direct SQL Server connections only yields binary BLOB identifiers instead of the actual files. However, EBS providers also have drawbacks due to the complexity of maintaining integrity between metadata in the SharePoint farm's content databases and the external BLOB store.

In order to integrate SharePoint with an external storage solution, you must build an EBS provider, which is a COM server that implements the ISPExternalBinaryProvider interface with its StoreBinary and RetrieveBinary methods. You can create unmanaged and managed EBS providers, but be aware of performance and compatibility issues if you decide to use managed code. Also keep in mind that the ISPExternalBinaryProvider interface does not include a DeleteBinary method. You must explicitly remove orphaned BLOBs through lazy garbage collection, and be careful to avoid timing conditions that can lead to the accidental deletion of valid BLOB items.

Pav Cherny is an IT expert and author specializing in Microsoft technologies for collaboration and unified communication. His publications include white papers, product manuals, and books with a focus on IT operations and system administration. Pav is President of Biblioso Corporation, a company that specializes in managed documentation and localization services.