Wikipedia.NET
From Gwiki
Overview
Wikipedia.NET is a C# .NET API and tool set for working with Wikipedia data dumps. Wikipedia periodically dumps their entire page database out to a single (very large) XML file, which can be imported into a MySQL database as described in the previous link. This data is useful for many things, for example:
- Use in a custom interface to Wikipedia content.
- Information retrieval/question answering. Wikipedia is awesome because its topical coverage is both broad and deep. This makes it an invaluable knowledge resource for various IR/QA tasks.
- Topic modeling. Think of Wikipedia as an enormous labeled corpus of topics. For each page (topic), rich models (e.g., statistical language models) can be derived. Again, this is useful for IR/QA/NLP tasks.
There are a few immediate problems with the above uses, even after the dumped data is loaded into a MySQL database:
- Pages contain all Wiki markup used for formatting.
- Even if the markup is stripped, there is no nice way to access the pages.
Wikipedia.NET partially (see the limitations) solves these problems by providing the following functionality:
- Data formatting and import: tools for removing Wiki markup and storing a "clean" copy (i.e., same page content, sans markup) in a nicely indexed database schema
- Page-based access: simple and efficient programmatic access to pages in marked up or clean form
- IR Indexing: dump clean page content to files in a TREC-style format. These dumps can then be indexed with popular IR engines such as Lemur.
Limitations
Like all software, the current version of Wikipedia.NET is not perfect. Here are some shortcomings to be aware of:
- There are many ways to specify the same formatting in Wikipedia. Some users stick to the special Wiki markup syntax. Some use HTML. Some combine the two. After all, the goal is to get a nice looking page in the end; however, this creates problems when the markup needs to be stripped. I have covered much of the formatting markup in my filter routines, but some will inevitably slip through into the "clean" database. I would greatly appreciate reports of missed formatting markup that you find.
- Due to the markup removal, pages retrieved from the database do not contain things such as tables, pictures, references — anything that isn't basic page text is removed.
- Bug reports are appreciated.
Download
You can check out a local copy with the following Subversion command:
svn co http://links.cse.msu.edu:8000/svn/NLP/Source/ResourceAPIs/Wikipedia
Alternatively, point your TortoiseSVN client to the same URL.
![]()
All of my software can be used according to the Attribution-NonCommercial-ShareAlike 3.0 license.
