If you’ve ever wanted to mine Wikipedia data, it would be possible – but difficult – to scrape the whole site. Instead of performing such a slow & arduous operation, the Wikimedia Foundation has provided the contents for free, in a downloadable format. These exports can then be loaded and used for a multitude of reasons, including personal use.
I’ve created an SSIS package that will import the articles and pages into a SQL Server 2005 database. To do this, you’ll first need to gather a few files:
- Latest Wikipeda pages/articles dump (Download enwiki-latest-pages-articles.xml.bz2, approximately 4.1GB at time of writing)
- MediaWiki XSD (originally located on Manual:XML Import file manipulation in CSharp)
- SSIS Package & Database Scripts
You’ll need approximately 75GB of free space – 50GB for the database and 20GB for the XML file. Also, this will likely take several hours, if not longer. If you have separate drive spindles, it would certainly help to separate the XML and database files. My example uses C:wikipedia as the working folder; if you prefer another location, we’ll configure it later. If not, this is what the structure should look like:
Note: If you’re planning on importing the XML file into a remote server, I highly recommend performing all these operations on the server itself through Remote Desktop. Aside from having to re-transfer the gigantic XML dump, debugging SSIS packages is much easier when working locally.
- Connect to the SQL Server database and run WikipediaImportWikipediaImportDatabaseCreate.sql
- This creates the database (cleverly named “Wikipedia”) on C:wikipedia. If you want the data and log files located elsewhere, find 50GB of free space and update the CREATE DATABASE statement.
- Because we know the database is going to grow immediately, I’ve told the script to allocation 40GB for data and 10GB for log, so this step may take a while to run.
- Open WikipediaImportWikipediaImport.sln in Visual Studio 2005
- Enable the Variables window if it is not already visible
- Select Data Flow
- Select the View menu
- Select Other Windows
- Select Variables
- If the working files were placed somewhere besides C:wikipedia, you can configure that in the Variables window. Be sure to update both PageArticlesXML and PageArticlesXSD
- Additionally, if you’re not importing to localhost, configure the database connection variable (named DatabaseConnection)
- Verify there are no warnings or errors and build the solution (Ctrl + Shift + B or Build?Build WikipediaImport)
- If the build succeeds, go ahead and run it (F5 or Debug?Start Debugging)
- If all goes well, the XML file should now be streaming into the database. This will likely take hours, even with a fast RAID. Notice that the file only requires a single pass, rather than scanning it once per table.
- Switch back to SSMS and run WikipediaImportWikipediaImportIndexCreate.sql
- This step is technically optional, but is going to help speed up your queries significantly
- If we had created the indexes before the import, the import would have been even slower
- This will take a while!
- Run a test query
select * from dbo.vw_Articles where title = 'Green Day'
Behind the Scenes
Getting this to work took a while of tweaking, but there are a few highlights I’d like to point out.
XSD: The provided XML Schema Definition file does not contain any information about the intended length of the string data. Fortunately, through testing, I was able to shrink some of those sizes down, although they do not strictly conform to the official database schema. Specifically, I have shrunk page.restrictions and text.space from nvarchar(255) to nvarchar(50). Most other items conform as close as possible. In addition to these, I had to update text.text to a nvarchar(max). SSIS initially suggested an nvarchar(255), but articles are obviously much longer than this. To perform these changes, right-click the XML Source (named PageArticles) in the Import XML Data Flow Task and select “Show Advanced Editor…”
Expand out each of the changed columns (both External and Output) and update the DataType. For example, SQL Server-specifc nvarchar(max) is a more general “Unicode text stream [DT_NTEXT]” in SSIS. For the others, just update the length from 255 to 50.
There are many more fields in the XML file than I’ve decided to import. Unfortunately, I can’t just turn them off completely, lest SSIS complains. Instead I have chosen to suffer the lesser fate of “Warning”. Additionally, I changed the Error Output to “Ignore failure” on Error. This screen can be accessed by double-clicking the XML Source, PageArticles, then selecting the Error Output page.
SSIS is very picky about metadata, making this a somewhat difficult project to get running, however, it is actually running. This could be further extended with an automated download and increased amount of data imported, but for now it serves its purpose.
I know not everyone will get this on first run, so if you have a problem, please leave a comment below and I’ll do my best to answer them in a timely manner.