In: SQL Server|SSIS
Written by: Taylor Gerring
If you’ve ever wanted to mine Wikipedia data, it would be possible – but difficult – to scrape the whole site. Instead of performing such a slow & arduous operation, the Wikimedia Foundation has provided the contents for free, in a downloadable format. These exports can then be loaded and used for a multitude of reasons, including personal use.
I’ve created an SSIS package that will import the articles and pages into a SQL Server 2005 database. To do this, you’ll first need to gather a few files:
You’ll need approximately 75GB of free space – 50GB for the database and 20GB for the XML file. Also, this will likely take several hours, if not longer. If you have separate drive spindles, it would certainly help to separate the XML and database files. My example uses C:\wikipedia\ as the working folder; if you prefer another location, we’ll configure it later. If not, this is what the structure should look like:
Note: If you’re planning on importing the XML file into a remote server, I highly recommend performing all these operations on the server itself through Remote Desktop. Aside from having to re-transfer the gigantic XML dump, debugging SSIS packages is much easier when working locally.
select * from dbo.vw_Articles where title = 'Green Day'
Getting this to work took a while of tweaking, but there are a few highlights I’d like to point out.
XSD: The provided XML Schema Definition file does not contain any information about the intended length of the string data. Fortunately, through testing, I was able to shrink some of those sizes down, although they do not strictly conform to the official database schema. Specifically, I have shrunk page.restrictions and text.space from nvarchar(255) to nvarchar(50). Most other items conform as close as possible. In addition to these, I had to update text.text to a nvarchar(max). SSIS initially suggested an nvarchar(255), but articles are obviously much longer than this. To perform these changes, right-click the XML Source (named PageArticles) in the Import XML Data Flow Task and select “Show Advanced Editor…”
Expand out each of the changed columns (both External and Output) and update the DataType. For example, SQL Server-specifc nvarchar(max) is a more general “Unicode text stream [DT_NTEXT]” in SSIS. For the others, just update the length from 255 to 50.
There are many more fields in the XML file than I’ve decided to import. Unfortunately, I can’t just turn them off completely, lest SSIS complains. Instead I have chosen to suffer the lesser fate of “Warning”. Additionally, I changed the Error Output to “Ignore failure” on Error. This screen can be accessed by double-clicking the XML Source, PageArticles, then selecting the Error Output page.
SSIS is very picky about metadata, making this a somewhat difficult project to get running, however, it is actually running. This could be further extended with an automated download and increased amount of data imported, but for now it serves its purpose.
I know not everyone will get this on first run, so if you have a problem, please leave a comment below and I’ll do my best to answer them in a timely manner.

7 Responses to Import Wikipedia articles into SQL Server with SSIS
Sergio
May 15th, 2009 at 8:54 pm
Hi,
While I download the 4.x GB wikipedia file, I’m trying to reproduce this simple extract and load scenario with one of my own files (approx. 450 MB). However, I’m not getting the “streamlined” behavior from your image (all yellow blocks). SSIS starts by only processing the XML Source and after a while it crashes with an OutOfMemory exception, it never starts processing the destination.
I can’t seem to find any options for this in the XML source properties or editors. I’m not doing any transformations.
Any particular options I should look for?
Thanks in advance
Taylor Gerring
May 18th, 2009 at 10:53 am
@Sergio
Did you ever try the Wikipedia XML file itself? I’m curious if you still get the OutOfMemory exception.
Sergio
May 18th, 2009 at 12:33 pm
Yes, I tried and it worked like a charm. I even recreated the whole Wikipedia example from scratch to see if I was missing something, but it also worked.
I’ve read in a few forums the problem might be in the XSD file, but I still haven’t found the culprit. I also tried letting SSIS generate an XSD from (a small sample) of my XML file, but that didn’t work either. I’ll play around with the XSD a bit more to see if I can make it work.
Balaji L
June 5th, 2009 at 4:17 am
Hi, Not able to download the xml file (enwiki-latest-pages-articles.xml.bz2). Seems to be URL is not working. can you pls post it again.
Taylor Gerring
June 5th, 2009 at 8:08 am
@Balaji L
All links are working for me
Benny
September 29th, 2009 at 4:36 am
Thanks a lot for this tutorial. I’m even more curious to add some other advanced interconnectivity into the SQL server but don’t know how. For example, I’d like to have all the categories information of each article be preserved in my local database. Another point is that I’d hope to import two languages of wikipedia into the local database, and be able to have interconnection for the articles talking about the same thing(just like the column “languages” on the left side of an wikipedia article). Would you please give some advice on the two points? Thanks!!
Dapinder
February 18th, 2010 at 5:47 am
Hi,
Thanks for the article . This is informative. What I am looking for right now is an article which can help me to handle a xml schema using user defined data types (i.e a type defined using xs:simpletype) . SSIS does not picks up data defined using user defined type. I tried a lot and it seems everytime I miss something inthe implementation. Also it may be a case that SSIS does not support user defined types. can you help me in doing it ? Any suggestion would be of great help.