Import Wikipedia articles into SQL Server with SSIS

In: SQL Server| SSIS


26 Jan 2009

If you’ve ever wanted to mine Wikipedia data, it would be possible – but difficult – to scrape the whole site. Instead of performing such a slow & arduous operation, the Wikimedia Foundation has provided the contents for free, in a downloadable format. These exports can then be loaded and used for a multitude of reasons, including personal use.

Ingredients

I’ve created an SSIS package that will import the articles and pages into a SQL Server 2005 database. To do this, you’ll first need to gather a few files:

You’ll need approximately 75GB of free space – 50GB for the database and 20GB for the XML file. Also, this will likely take several hours, if not longer. If you have separate drive spindles, it would certainly help to separate the XML and database files. My example uses C:\wikipedia\ as the working folder; if you prefer another location, we’ll configure it later. If not, this is what the structure should look like:

Directory Listing

Directory Listing

Recipe

Note: If you’re planning on importing the XML file into a remote server, I highly recommend performing all these operations on the server itself through Remote Desktop. Aside from having to re-transfer the gigantic XML dump, debugging SSIS packages is much easier when working locally.

  1. Connect to the SQL Server database and run WikipediaImport\WikipediaImport\DatabaseCreate.sql
    1. This creates the database (cleverly named “Wikipedia”) on C:\wikipedia\. If you want the data and log files located elsewhere, find 50GB of free space and update the CREATE DATABASE statement.
    2. Because we know the database is going to grow immediately, I’ve told the script to allocation 40GB for data and 10GB for log, so this step may take a while to run.
  2. Open WikipediaImport\WikipediaImport.sln in Visual Studio 2005
  3. Enable the Variables window if it is not already visible
    1. Select Data Flow
    2. Select the View menu
    3. Select Other Windows
    4. Select Variables
      Enable Variables Menu

      Enable Variables Menu

  4. If the working files were placed somewhere besides C:\wikipedia\, you can configure that in the Variables window. Be sure to update both PageArticlesXML and PageArticlesXSD

    Set Variable Values

    Set Variable Values

  5. Additionally, if you’re not importing to localhost, configure the database connection variable (named DatabaseConnection)
  6. Verify there are no warnings or errors and build the solution (Ctrl + Shift + B or Build?Build WikipediaImport)
  7. If the build succeeds, go ahead and run it (F5 or Debug?Start Debugging)
  8. If all goes well, the XML file should now be streaming into the database. This will likely take hours, even with a fast RAID. Notice that the file only requires a single pass, rather than scanning it once per table.

    Import Progress

    Import Progress

  9. Switch back to SSMS and run WikipediaImport\WikipediaImport\IndexCreate.sql
    1. This step is technically optional, but is going to help speed up your queries significantly
    2. If we had created the indexes before the import, the import would have been even slower
    3. This will take a while!
  10. Run a test query
    SELECT * FROM dbo.vw_Articles WHERE title = 'Green Day'

Behind the Scenes

Getting this to work took a while of tweaking, but there are a few highlights I’d like to point out.

Data Types

XSD: The provided XML Schema Definition file does not contain any information about the intended length of the string data. Fortunately, through testing, I was able to shrink some of those sizes down, although they do not strictly conform to the official database schema. Specifically, I have shrunk page.restrictions and text.space from nvarchar(255) to nvarchar(50). Most other items conform as close as possible. In addition to these, I had to update text.text to a nvarchar(max). SSIS initially suggested an nvarchar(255), but articles are obviously much longer than this. To perform these changes, right-click the XML Source (named PageArticles) in the Import XML Data Flow Task and select “Show Advanced Editor…”

Advanced Editor Menu

Advanced Editor Menu

Expand out each of the changed columns (both External and Output) and update the DataType. For example, SQL Server-specifc nvarchar(max) is a more general “Unicode text stream [DT_NTEXT]” in SSIS. For the others, just update the length from 255 to 50.

Set Data Types

Set Data Types

Extraneous information

There are many more fields in the XML file than I’ve decided to import. Unfortunately, I can’t just turn them off completely, lest SSIS complains. Instead I have chosen to suffer the lesser fate of “Warning”. Additionally, I changed the Error Output to “Ignore failure” on Error. This screen can be accessed by double-clicking the XML Source, PageArticles, then selecting the Error Output page.

Configure Error Output

Configure Error Output

Wrap-up

SSIS is very picky about metadata, making this a somewhat difficult project to get running, however, it is actually running. This could be further extended with an automated download and increased amount of data imported, but for now it serves its purpose.

I know not everyone will get this on first run, so if you have a problem, please leave a comment below and I’ll do my best to answer them in a timely manner.

  • Share/Bookmark

7 Responses to Import Wikipedia articles into SQL Server with SSIS

Avatar

Sergio

May 15th, 2009 at 8:54 pm

Hi,

While I download the 4.x GB wikipedia file, I’m trying to reproduce this simple extract and load scenario with one of my own files (approx. 450 MB). However, I’m not getting the “streamlined” behavior from your image (all yellow blocks). SSIS starts by only processing the XML Source and after a while it crashes with an OutOfMemory exception, it never starts processing the destination.

I can’t seem to find any options for this in the XML source properties or editors. I’m not doing any transformations.

Any particular options I should look for?

Thanks in advance

Avatar

Taylor Gerring

May 18th, 2009 at 10:53 am

@Sergio

Did you ever try the Wikipedia XML file itself? I’m curious if you still get the OutOfMemory exception.

Avatar

Sergio

May 18th, 2009 at 12:33 pm

Yes, I tried and it worked like a charm. I even recreated the whole Wikipedia example from scratch to see if I was missing something, but it also worked.

I’ve read in a few forums the problem might be in the XSD file, but I still haven’t found the culprit. I also tried letting SSIS generate an XSD from (a small sample) of my XML file, but that didn’t work either. I’ll play around with the XSD a bit more to see if I can make it work.

Avatar

Balaji L

June 5th, 2009 at 4:17 am

Hi, Not able to download the xml file (enwiki-latest-pages-articles.xml.bz2). Seems to be URL is not working. can you pls post it again.

Avatar

Taylor Gerring

June 5th, 2009 at 8:08 am

@Balaji L

All links are working for me

Avatar

Benny

September 29th, 2009 at 4:36 am

Thanks a lot for this tutorial. I’m even more curious to add some other advanced interconnectivity into the SQL server but don’t know how. For example, I’d like to have all the categories information of each article be preserved in my local database. Another point is that I’d hope to import two languages of wikipedia into the local database, and be able to have interconnection for the articles talking about the same thing(just like the column “languages” on the left side of an wikipedia article). Would you please give some advice on the two points? Thanks!!

Avatar

Dapinder

February 18th, 2010 at 5:47 am

Hi,

Thanks for the article . This is informative. What I am looking for right now is an article which can help me to handle a xml schema using user defined data types (i.e a type defined using xs:simpletype) . SSIS does not picks up data defined using user defined type. I tried a lot and it seems everytime I miss something inthe implementation. Also it may be a case that SSIS does not support user defined types. can you help me in doing it ? Any suggestion would be of great help.

Comment Form

  • ERNST: HOW TO SYNC MULTIPLE GOOGLE CALENDARS IN IPHONE: To show Multiple Google calendars in iPhone, ope [...]
  • linuxfueled: WINDOWS 7 users read..... If your main OS is Windows 7 (the host machine) and you just cannot get [...]
  • Kurt Hitchen: A friend has let me "see" there calendar in Google Calendar, but because their calendar is set to pr [...]
  • bugoy8: https://www.google.com/calendar/dav/nba_13_%4cos+%41ngeles+%4cakers#sports@group.v.calendar.google.c [...]
  • Speaker: Wasn't able to understand how to access the shared folder in the quest OS from this tutorial. Hmmm [...]


This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 United States.