Import Wikipedia articles into SQL Server with SSIS

If you’ve ever wanted to mine Wikipedia data, it would be possible – but difficult – to scrape the whole site. Instead of performing such a slow & arduous operation, the Wikimedia Foundation has provided the contents for free, in a downloadable format. These exports can then be loaded and used for a multitude of reasons, including personal use.

Ingredients

I’ve created an SSIS package that will import the articles and pages into a SQL Server 2005 database. To do this, you’ll first need to gather a few files:

You’ll need approximately 75GB of free space – 50GB for the database and 20GB for the XML file. Also, this will likely take several hours, if not longer. If you have separate drive spindles, it would certainly help to separate the XML and database files. My example uses C:wikipedia as the working folder; if you prefer another location, we’ll configure it later. If not, this is what the structure should look like:

Directory Listing

Directory Listing

Recipe

Note: If you’re planning on importing the XML file into a remote server, I highly recommend performing all these operations on the server itself through Remote Desktop. Aside from having to re-transfer the gigantic XML dump, debugging SSIS packages is much easier when working locally.

  1. Connect to the SQL Server database and run WikipediaImportWikipediaImportDatabaseCreate.sql
    1. This creates the database (cleverly named “Wikipedia”) on C:wikipedia. If you want the data and log files located elsewhere, find 50GB of free space and update the CREATE DATABASE statement.
    2. Because we know the database is going to grow immediately, I’ve told the script to allocation 40GB for data and 10GB for log, so this step may take a while to run.
  2. Open WikipediaImportWikipediaImport.sln in Visual Studio 2005
  3. Enable the Variables window if it is not already visible
    1. Select Data Flow
    2. Select the View menu
    3. Select Other Windows
    4. Select Variables

      Enable Variables Menu

      Enable Variables Menu

  4. If the working files were placed somewhere besides C:wikipedia, you can configure that in the Variables window. Be sure to update both PageArticlesXML and PageArticlesXSD

    Set Variable Values

    Set Variable Values

  5. Additionally, if you’re not importing to localhost, configure the database connection variable (named DatabaseConnection)
  6. Verify there are no warnings or errors and build the solution (Ctrl + Shift + B or Build?Build WikipediaImport)
  7. If the build succeeds, go ahead and run it (F5 or Debug?Start Debugging)
  8. If all goes well, the XML file should now be streaming into the database. This will likely take hours, even with a fast RAID. Notice that the file only requires a single pass, rather than scanning it once per table.

    Import Progress

    Import Progress

  9. Switch back to SSMS and run WikipediaImportWikipediaImportIndexCreate.sql
    1. This step is technically optional, but is going to help speed up your queries significantly
    2. If we had created the indexes before the import, the import would have been even slower
    3. This will take a while!
  10. Run a test query
    select * from dbo.vw_Articles where title = 'Green Day'

Behind the Scenes

Getting this to work took a while of tweaking, but there are a few highlights I’d like to point out.

Data Types

XSD: The provided XML Schema Definition file does not contain any information about the intended length of the string data. Fortunately, through testing, I was able to shrink some of those sizes down, although they do not strictly conform to the official database schema. Specifically, I have shrunk page.restrictions and text.space from nvarchar(255) to nvarchar(50). Most other items conform as close as possible. In addition to these, I had to update text.text to a nvarchar(max). SSIS initially suggested an nvarchar(255), but articles are obviously much longer than this. To perform these changes, right-click the XML Source (named PageArticles) in the Import XML Data Flow Task and select “Show Advanced Editor…”

Advanced Editor Menu

Advanced Editor Menu

Expand out each of the changed columns (both External and Output) and update the DataType. For example, SQL Server-specifc nvarchar(max) is a more general “Unicode text stream [DT_NTEXT]” in SSIS. For the others, just update the length from 255 to 50.

Set Data Types

Set Data Types

Extraneous information

There are many more fields in the XML file than I’ve decided to import. Unfortunately, I can’t just turn them off completely, lest SSIS complains. Instead I have chosen to suffer the lesser fate of “Warning”. Additionally, I changed the Error Output to “Ignore failure” on Error. This screen can be accessed by double-clicking the XML Source, PageArticles, then selecting the Error Output page.

Configure Error Output

Configure Error Output

Wrap-up

SSIS is very picky about metadata, making this a somewhat difficult project to get running, however, it is actually running. This could be further extended with an automated download and increased amount of data imported, but for now it serves its purpose.

I know not everyone will get this on first run, so if you have a problem, please leave a comment below and I’ll do my best to answer them in a timely manner.

16 comments on “Import Wikipedia articles into SQL Server with SSIS

  1. BTW – will you update your solution to work with the latest XSD 0.8 file and the now massive 49GB XML of articles, for SQL Server 2012 – or 2008?

  2. How do you get the XML Source to stream the big (now 50GB) file, instead of trying to load it into memory first?????

  3. I am also using SQL Server 2012 and DatabaseCreate.sql works fine.

  4. Asim on said:

    Hi there,
    Thanks a lot for this great post!
    I’ve downloaded all required files and followed the steps, however, I was not able to run WikipediaImport.sln in Visual Studio 2012 :(
    Any advice please!

  5. @Kim The data is streamed into memory as needed instead of reading it all at once.

  6. Hi! How did you manage to load a 20GB XML file without having 20GB of RAM? (I assume you do not have)

  7. taylorgerring on said:

    @jamie

    The link to the SSIS solution is updated and should now work.

  8. Hi, the download link for the SSIS Package and db scripts doesn’t appear to be working, could you point me to where I could get these files?

  9. Thanks so much. you saved my day, I did it the hard way. SSIS is a very interesting topic to learn.

  10. Dapinder on said:

    Hi,

    Thanks for the article . This is informative. What I am looking for right now is an article which can help me to handle a xml schema using user defined data types (i.e a type defined using xs:simpletype) . SSIS does not picks up data defined using user defined type. I tried a lot and it seems everytime I miss something inthe implementation. Also it may be a case that SSIS does not support user defined types. can you help me in doing it ? Any suggestion would be of great help.

  11. Thanks a lot for this tutorial. I’m even more curious to add some other advanced interconnectivity into the SQL server but don’t know how. For example, I’d like to have all the categories information of each article be preserved in my local database. Another point is that I’d hope to import two languages of wikipedia into the local database, and be able to have interconnection for the articles talking about the same thing(just like the column “languages” on the left side of an wikipedia article). Would you please give some advice on the two points? Thanks!!

  12. @Balaji L

    All links are working for me

  13. Balaji L on said:

    Hi, Not able to download the xml file (enwiki-latest-pages-articles.xml.bz2). Seems to be URL is not working. can you pls post it again.

  14. Sergio on said:

    Yes, I tried and it worked like a charm. I even recreated the whole Wikipedia example from scratch to see if I was missing something, but it also worked.

    I’ve read in a few forums the problem might be in the XSD file, but I still haven’t found the culprit. I also tried letting SSIS generate an XSD from (a small sample) of my XML file, but that didn’t work either. I’ll play around with the XSD a bit more to see if I can make it work.

  15. @Sergio

    Did you ever try the Wikipedia XML file itself? I’m curious if you still get the OutOfMemory exception.

  16. Sergio on said:

    Hi,

    While I download the 4.x GB wikipedia file, I’m trying to reproduce this simple extract and load scenario with one of my own files (approx. 450 MB). However, I’m not getting the “streamlined” behavior from your image (all yellow blocks). SSIS starts by only processing the XML Source and after a while it crashes with an OutOfMemory exception, it never starts processing the destination.

    I can’t seem to find any options for this in the XML source properties or editors. I’m not doing any transformations.

    Any particular options I should look for?

    Thanks in advance

Leave a Reply

Your email address will not be published. Required fields are marked *

*

* Copy This Password *

* Type Or Paste Password Here *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>