DoxoEater2

A little C# application for extracting your documents from your Windows SharePoint Service 3.0 (WSS3) Database.  Builds upon the previous work of Plaxo, with some improvement needed for larger Document Library databases.  Works very well on our primarily Document Library WSS3 site which contains ~85,000 files in ~23,000 folders, totalling over 40 GB in size.

As part of our SharePoint Sever backup routine, I like to run this periodically to ensure we can extract our documents from SharePoint Document Library backup, even in a scenario when a functional SharePoint application can not be restored.

DoxoEater 2 - C# Source Code


using System;
using System.Collections.Generic;
using System.Text;
using System.Data;
using System.Data.SqlClient;
using System.IO;
 
namespace DoxoEater2
{
    class Program
    {
        static void Main(string[] args)
        {
 
            // WSS 3.0 default DB via SQL 2005 embedded edition.
            //string DBConnString = "Data Source=\\\\.\\pipe\\MSSQL$MICROSOFT##SSEE\\sql\\query;Database=WSS_Content;Trusted_Connection=yes;Timeout=2000;MultipleActiveResultSets=True";
            
            // Or localhost, WSS_Content, typical SQL connection style, TCP/IP listener.
            string DBConnString = "Data Source=localhost;Database=WSS_Content;Trusted_Connection=yes;Timeout=2000;MultipleActiveResultSets=True";
            string DocsQuery = "select [AllDocs].[Id], [AllDocs].[DirName], [AllDocs].[LeafName] from [AllDocs] where dirname <> '' order by dirname;";
 
            Console.WriteLine("START: " + DateTime.Now.ToUniversalTime() + " UTC");
 
            SqlConnection con;
 
            try // create a DB connection
            {   
                con = new SqlConnection(DBConnString);
                con.Open();
            }
            catch (Exception e) {
                Console.WriteLine(e.Message);
                return;
            }
 
            SqlCommand com = con.CreateCommand();
            com.CommandText = DocsQuery;
 
            // readers for SQL queries: 1) which docs? 2) featch binary data per DocID
            SqlDataReader reader, reader2;
 
            try {   // execute query
                reader = com.ExecuteReader();
            }
            catch (Exception e) {
                if (con != null) { con.Close(); }
                Console.WriteLine(e.Message); 
                return;
            }
 
            while (reader.Read()) { // rows of document ID, name and directory nodes
 
                // grab the file's directory and name
                Guid FileId = (Guid)reader["Id"];
                string DirName = (string)reader["DirName"];
                string LeafName = (string)reader["LeafName"];
           
                // create directory for the file if it doesn't yet exist
                if (!Directory.Exists(DirName))
                {
                    Directory.CreateDirectory(DirName);
                    Console.WriteLine("DIR: " + DirName);
                }
 
                // check if file already exists or not
                if (File.Exists(DirName + "/" + LeafName)) {
                    Console.WriteLine("ERROR: File Already Exists: " + DirName + "/" + LeafName);
                    continue;
                }
 
                SqlCommand com2 = con.CreateCommand();
                com2.CommandText = "select content from [AllDocStreams]  where  [AllDocStreams].[Id] = '" + FileId.ToString() + "';";
                Console.WriteLine("SQL: " + com2.CommandText.ToString());
 
                try
                {   // execute file fetch query
                    reader2 = com2.ExecuteReader(CommandBehavior.SequentialAccess);
                }
                catch (Exception e)
                {
                    if (con != null) { con.Close(); }
                    Console.WriteLine(e.Message);
                    return;
                }
 
                while(reader2.Read()) {
             
                    // create a filestream to spit out the file
                    FileStream fs = new FileStream(DirName + "/" + LeafName, FileMode.Create, FileAccess.Write);
                    BinaryWriter writer = new BinaryWriter(fs);
 
 
                    // depending on the speed of your network, you may want to change the buffer size (it's in bytes)
                    int bufferSize = 1048576;
                    long startIndex = 0;
                    long retval = 0;
                    byte[] outByte = new byte[bufferSize];
 
                    // grab the file out of the db one chunk (of size bufferSize) at a time
                    do
                    {
                        retval = reader2.GetBytes(0, startIndex, outByte, 0, bufferSize);
                        startIndex += bufferSize;
 
                        writer.Write(outByte, 0, (int)retval);
                        writer.Flush();
                    } while (retval == bufferSize);
 
                    // finish writing the file
                    Console.WriteLine("FILE: " + LeafName);
                    writer.Close();
                    fs.Close();
             
                }
                reader2.Close();
 
            }
 
            // close the DB connection and whatnots
            reader.Close();
            con.Close();
 
            Console.WriteLine("DONE: " + DateTime.Now.ToUniversalTime() + " UTC");
        }
    }
}

18 comments:

  1. Good Morning

    Can you tell me what exactly I would need to put in your code to get it to run and exact my documents for my SharePoint server environment? I know very broad question. Do I just need to change this line? string DBConnString = "Data Source=localhost;Database=WSS_Content;Trusted_Connection=yes;Timeout=2000;MultipleActiveResultSets=True";

    And it so can you give me an example of the Data Source parameter?

    Also where does it save the extracted files?

    Thanks
    KO

    ReplyDelete
  2. Hi KO. The specific details depend on where your SharePoint database resides. Uncomment the following line if you are using default WSS 3.0 embedded Windows SQL database.

    string DBConnString = "Data Source=\\\\.\\pipe\\MSSQL$MICROSOFT##SSEE\\sql\\query;Database=WSS_Content;Trusted_Connection=yes;Timeout=2000;MultipleActiveResultSets=True";

    If you are using full SQL or SQL Express, this line will apply instead:
    string DBConnString = "Data Source=localhost;Database=WSS_Content;Trusted_Connection=yes;Timeout=2000;MultipleActiveResultSets=True";

    I use this site whenever I need to figure out the exact syntax of a given SQL connection string. http://www.connectionstrings.com/

    The program saved the extracted file to your current directory.

    ReplyDelete
  3. Thanks Paul. I'll give that a try and let you know how things go. I appreciate your help.

    ReplyDelete
  4. Paul

    We are still having difficulties. Our environment SharePoint 2003, MS SQL Server 2002 on Windows 2003 Server SP2. We have multiple databases in the connected to the SQL server. For the parameter Database=WSS_Content, do I have to specifiy a specific database? If so would it just be Database=DATABASENAME??

    ReplyDelete
  5. Yes, replace Database=WSS_Content with the name of your SharePoint DB. DataSource=localhost assume the SQL DB is on the machine running the code. If not, replace with the server name or IP. Trusted_connection=yes, means it will connect to SQL using the credentials of the current Windows user. Some organization don't use Windows authentication for SQL server access and prefer to setup separate SQL user name and passwords. If so, the syntax for this is different as well.

    ReplyDelete
  6. Great piece of work. I've just done a full export of 6GB of legacy information within an hour of finding this page. Worked like a dream. Thanks!

    ReplyDelete
  7. I need to filter one specific document library which contain n number of folders and files. How do I filter from database

    ReplyDelete
  8. The AllDocs table contains SiteId column, which I believe will let your limit the query to Docs within specific Sites, but not sure how to include/exclude specific Document Libraries. Unfortunately the SP WSS DB structure is not officially documented and appears to not have any PK/FK relationships at the DB level, so understanding the model is a bit of trial and error. Referential integrity must be maintained by the application code itself I imagine. Sorry I couldn't offer more insight.

    ReplyDelete
  9. Hi Paul. Sorry if seems to be a silly question to you.

    When I run the program (I've given it the name extract.cs) from a given directory in this fashion:

    C:\Users\Administrator\Desktop\out>C:\Windows\Microsoft.NET\Framework\v2.0.50727\\csc extract.cs

    it does not generate errors nor extract content although I know there are files/documents in my WSS 3.0 database. Should I conclude that my database is corrupted or I'm doing something wrong, for example when I run it?

    I'm pretty sure that it connects properly to the WSS 3.0 windows internal specific content database I've specified since it is not complaining about connection issues.

    Benoit

    ReplyDelete
  10. Hi Paul -

    Have you ever seen a situation in which this script returns files with the size of 0 KB?

    This is happening to me, I am not sure if there is a setting I could tweak to fix this or something.

    Thanks!

    ReplyDelete
  11. Here's one potential reason why:
    By default, SharePoint stores even the uploaded file data inside the SQL database. As an performance enhancement, SharePoint also allows an advanced option where some upload files are actually stored on disk in a very cryptic format, called BLOBs. The files in the BLOB store still need to be extracted via SQL mean, they are not in the format of regular usable files on disk. More info on SharePoint use of MS SQL BLOB storage can be found here: https://msdn.microsoft.com/en-us/library/bb802976.aspx

    Basically, doxoeater2 was never designed to support files stored in external file system BLOB format as it’s not SharePoints primary of file storage.
    The code above would need to be refactored to consider the two potential locations of file storage per file and then develop the specific SQL syntax to extract from BLOB location when used.

    ReplyDelete
    Replies
    1. Well - one thing that I noticed is that your version of the script actually returned less 0 KB files then Plaxo's version. Not sure if that helps narrow anything down.

      Thanks,
      Eric

      Delete
  12. Yes, I was able to fix a few bug cases I found in my environment when I first started with the Plaxo code. Working with another user of the code I was able to identify external BLOB issue but I don't have a readily available sample WSS site to be able to develop the change test at the moment.

    ReplyDelete
    Replies
    1. I can run it on my site if you want me to help test.

      Delete
    2. Let's chat, shoot me an email via pmarchant@gmail.com

      Delete