SP2010 Scalability (3 of 4): Remote BLOB Storage

Binary Large Objects, or BLOBs as the SQL types like to call them, are the byte arrays that represent documents and other files in SharePoint.  Typicaly, they are stored in the SharePoint content database.  The reality is, the ECM industry has known for decades that RDBMS is not the best place to store BLOBs.  SQL database storage needs to be high IOPS and low latency... translated... EXPENSIVE storage.  It's much more efficient if we are able to store the BLOBs on lower cost, possibly even archival-class storage while we continue to invest in high performance storage for the structured content metadata.

As of SP2007 SP1, it was possible to take advantage of an External BLOB Storage (EBS) API to get the BLOBs out of SQL Server.  Unfortunately, this method is not transactionally consistent and it results in a high number of orphaned BLOBs in the BLOB store because new BLOBs are stored (not replaced) when a document is updated.  We must then rely on event receivers and lazy garbage collection to clean up the orphans.  In short, EBS was a temporary solution from Microsoft.  The future is SQL RBS support in SharePoint 2010.  Fortunately, Microsoft will also provide a PowerShell based solution for migrating from EBS to RBS!

So I can just see the SP2010 box cover art... "Now with RBS!"  When we enable RBS this is what we get:

  1. Transactional consistency ensures that when we get a BLOB ID back from the RBS provider, we are guarateed storage.  It also allows for traditional update capabilities.
  2. Transactional consistency also allows Write Once Read Many (WORM) mode devices to "VETO" a delete or modify operation.  This is clutch for financial institutions who are legally not allowed to delete financial records.  So they might use a storage platform such as EMC Centera or Hitachi HCAP in a sort of "create, but don't delete" mode.  If these vendors choose to write an RBS provider for their devices (and they probably will), then the actual storage subsystem itself can prevent SharePoint from allowing a document to be deleted.
  3. While orphan cleanup is much less of a concern with RBS it still needs to be managed.  The good news is that because RBS is managed through SQL tables, RBS can take advantage of indexes to actually "query" the difference between what is in the BLOB store and what is in SharePoint content databases.  This is a HUGE improvement compared to spinning through the object model of an extremely large content database.
  4. RBS is completely transparent to the SharePoint API.  Nothing changes.  So existing custom and 3rd Party code will continue to function as expected.

So now we have the facility to get the binary data out of the content database.  That means that we're looking at having only metadata present.  SWEET!  That whole 50GB, 100GB, recommended maximum database size discussion becomes a whole long less important!  Instead we'll be concentrating more on the recommendations of list/library sizes and that's a good thing!

Now that Microsoft is targeting 50 million items in a library, site taxonomy can look a little more like what we're used to seeing in your standard collaboration implementation instead of having to split up similar documents into multiple content databases.

Just a quick last note... RBS will require SQL 2008 Enterprise Edition.  So keep that in mind if SP2010 and RBS might possibly be in cards for your organization.

In Part 4 of my SP2010 Scalability series, I will look into the scalability benefits of In Place Records Management.  Microsoft has removed a HUGE scalability bottleneck here! 

SP2010 Scalability (2 of 4): SharePoint Search

For the last several years, I've worked on several projects that stretch the recommended limits regarding the amount of content that SharePoint can handle.  Back in December of 2007 I started on an interesting scalability journey with a couple of awesome guys at Microsoft.  The first, Paul Learning, is a quality MCS SharePoint guy out of Detroit.  The second, Andy Hopkins, served as our red-tape bulldozer.  The three of us worked to put a small server room full of Fujitsu blades and storage arrays to good use in order to prove that SharePoint could do 50 million documents.

The result of our efforts was a very lengthy whitepaper.  I'll sum it up in the following two sentences.  The first is that SharePoint could fairly easily be architected to handle 50+ million documents consisting of over 5 terabytes of data.  The second is that search configuation and crawl processing was BY FAR our greatest challenge.  We were successful, but it took far longer to crawl 50 million documents than we would have liked.

So without further lame commentary, I'll document right now that I believe the most important new feature of SP2010 from an ECM perspective is a highly scalable search subsystem!  Check out some of these new capabilities:

  1. We can now have multiple Index Servers!   Sweet!  No more single point failure and the scale out story gets WAY better!
  2. We can now divide the content index into multiple index partitions.  When implemented with multiple query servers we get benefits of redundancy and parallel performance.
  3. The crawl management and the property store data tables have been split into separate databases.  But they took it further! We can have multiple of each!  This opens doors to scale out even further with respect to I/O and storage as well as possibly multiple SQL Servers to handle different search subsystem component databases.
  4. Each index server can be configured to run multiple crawlers.  Multiple crawlers can crawl content in parallel!  So the process of spinning through the entire corpus is no longer a linear style operation!
  5. Index servers are now STATELESS.  The crawlers build the content index and propagate directly to the query servers.  So guess what...  If an Index Server bombs, no big deal.  Just stand up a new one and pick up where you left off!
  6. All of these improvements result in Microsoft's new target number of being able to crawl 100 million content items! 
  7. We can go well beyond 100 million with FAST for SharePoint! 

Based on what I've seen so far, the search subsystem improvements are very promising. I believe that number is totally legit and possibly even on the conservative side!  Time will tell.

One thing is clear already. The FAST Search team has defintely had a positive impact on the already excellent Enterprise Search team at Microsoft.  In fact, some of the architecture in the new SharePoint search platform is remarkably similar to how FAST is designed!

In Part 3 of my SP2010 Scalability series, I'll talk about how Remote BLOB Storage is going to be a HUGE game changer in the SharePoint Enterprise Document Management story.

 

SP2010 Scalability (1 of 4): Introduction

I have been very fortunate over the last several years in that I've had many opprtunities to architect many extremely high scale SharePoint systems.  Everything from your standard 3 million document Imaging Repository to systems with 10's and even more than 100 million documents (thanks to FAST ESP!)

As I look back on SharePoint 2003 and even to existing SharePoint 2007 solutions, there have definately been several challenges as we design systems that can handle the millions of documents we throw at them.  So it is with great pleasure that I am able to present my 4 favorite improvements in SP2010 that handle virtually ALL of my previous challenges.

Oh yeah... For anyone who might have read this series introduction when I initially released it, I was originally going to include a post regarding the Managed Metadata service.  Then it occured to me that while this is an important feature for ECM in SharePoint 2010, it doesn't really improve performance.  So... I yanked it.  I'll do a separate post on Managd Metadata at some point.

So without further delay, my four part (ok... three if you discount the introduction!) series on SP2010 Scalability:

  1. Introduction
  2. SharePoint Search
  3. Remote BLOB Storage RBS
  4. Inline Records Management

I hope this provides hope for those of you who dream of massive SharePoint 2010 implementations like I do!

SPC2009 Decompression

Well, here I am, 24 hours past the SPC2009 ride.  Now I feel like I'm on a long slow decompression ascension (lame scuba reference) to the public launch of SP2010 next year some time.

Now that the SP2010 veil is lifted, I'm finally free to talk about some incredible new SP2010 features that are abslutely CLUTCH for scaling SharePoint to sizes we may only have dreamed of before!

I have all these ideas floating around for what I want to say.  So I started writing this post and it just plain grew too large.  So I decided to break it up into seperate posts in a SP2010 Scalability Series.

In the coming weeks, I'll also be developing a few blog posts and possibly a whitepaper on Architecting SP2007 with SP2010 in mind.  The reality is that many organizations simply refuse to implement a greenfield SP2010 deployment or upgrade an existing SP2007 deployment for several months or even a year after SP2010 launches.  Other organizations simply can't wait to get move their content from a legacy system due to maintenance costs.  So the intent is to help organizations that will definitely move to SP2010 at some point but need to start with SP2007 today.

Then, as we get closer to SP2010 launch, I'll be releasing content around Designing a SP2010 Scalable Architecture.

So I guess it's time to jump on a new ride... the one that takes us to the launch of SP2010.