Hacking Days at MIT and Wikimania at the Harvard Law School came to a close yesterday. Here is a brief summary:
Brion Vibber, Chief Technology Officer of Wikimedia, gave many talks. He discussed everything from why wiki projects are difficult to cache (since they are dynamical) to new features to come, like Semantic MediaWiki, a possible Xapian search engine, better Wikitags, a better parser, possible support for PDF documents and integration with the DjVu image format file, among other video formats like OpenID/YADIS/A-Select/better. There were some OLPC -One Child Per Laptop- computers outside which are able to synchronize themselves, being interconnected in order to play music or share any type of information through a wireless net they build by themselves.
Mark Bergsma talked about near-future server technology of the Wikimedia projects, like 64bits servers. He provided information about the geographical sites of the Wikipedia clusters, mainly located in Florida and Amsterdam. He talked about some features of the caching architecture using squid and some new DNS technologies they are exploring, like PowerDNS and geographical balancing (e.g. BGP DNS) and object purging. He announced that they were already using the HTCP inter-cache protocol. He also announced a plan to make the switch with one core switch/router more reliable. Some of the participants proposed the use of Planet Lab services (http://www.planet-lab.org), a collection of machines distributed over the globe running a common software package including a Linux-based OS, mechanisms for bootstrapping, distribution software and a set of node monitor tools. PlanetLab is mainly devoted to research as a testbed for overlay networks, providing groups the opportunity to experiment with planetary-scale services.
A later talk was about enhancing Wiktionary to be used as a database for external applications requiring it. Right now the Wiktionary can only be exploited by addressing a query directly to it. A new database structure is being developed in order to give Wiktionary semantic meaning -among other things, relating each word to its translation in all other languages already in Wikitionary- which will eventually allow many new features and the generation of a full knowledge database rather than a bunch of words ( about a million words at the moment, in all langauges) with definitions.
An interesting talk about managing and posting onto the discussion page also took place. In a nutshell, the idea was to treat each post as a wikipage in itself. Some questions were raised about performance impact on the whole system arising from the huge amount of new wikipages, and other security and control questions emerged too but the idea seemed to be very well accepted. Finally a nice proposal to include video and audio streaming into wikiprojects was presented.
There were several talks about WikitionaryZ during Hacking Days and Wikimania, by Eric Moeller and others.Wiktionary Z is an iniative to create the Ultimate Universal Wiktionary (pretty humble, isn’t it?) by integrating semantic knowledge into Wiktionary. The project is based on defining meaning using a short, simple phrase that defines a word clearly and unequivocally and that could be exactly translated into all the languages of the Wiktionary. There is also a record of the relationships of the words with each other, thus making it possible to build a machine-readable repository of semantic knowledge.
Following that, Andre Engels talked about pywikipediabot and the challenges of writing wikipedia bots avoiding anything that used screen scrapping, making the process of maintaning the bots quite complicated given the need to change bots everytime there is a change in the format of the articles. He also spoke about the dangers of using bots for big tasks, because errors in bot programming can lead to hundreds of thousands of damaged pages.
Other talks were about OpenID, an OpenSource authentication system very similar to Microsoft Passport in that it integrates the user’s identity in several projects (wikimedia projects, blogs, etc) into a single id. There are good plans to integrate this feature into Wikipedia soon.
WYSIWYG for Wikipedia: One of the main problems when using Wikipedia is the difficulty of editing using the Wikitags. Although technologically advanced users can easily adapt to the Wikitags system, most people just can’t get the hang of it. Hence the need for an easy-to-use, simple editor is evident, although the lack of a proper mediaWiki parser and the complexity of the Wikitags language makes such a thing hard to implement. Anyway, Frederico Caldeira Knabben has created a very nice and useful WYSIWYG HTML editor called FCKeditor (www.FCKeditor.org), and is willing to join forces with media wiki to integrate it into wikipedia.
There was also a panel featuring Brion Vibber, Andre Engels and Erik Moeller which addressed the possibility of a MediaWiki API. Many of the attendees were enthusiastic, and some went into a separate room and discussed the specification with Brion and Andre. They came up with a preliminary agreement that may be available soon on the web. The day ended with an enjoyable tour of the MIT’s Media Lab.
Some topics were very boring. For instance, the Wikimedia discussion panel was mainly about internal politics and logistics. Nothing interesting for the broad audience.
From the point of view of our discussions here the most interesting topics at Wikimania related to the Semantic Web and the reliability of Wikiprojects, given that everybody can edit the entries. Jim Giles, the author of the Nature paper comparing Wikipedia and Britannica talked about the strong and weak points of the entries, and the reviews. According to him, Britannica made the point that most of their entries that were evaluated were in the sciences, where Wikipedia is stronger, most of the contributors being from these disciplines. The argument was that this kind of comparison could not serve as an adequate measure of the entire corpus of knowledge covered by Britannica. Furthermore Britannica also made the point that the entries (50 in all) were badly reviewed, accounting for the fact that Wikipedia earned a rating very close to that earned by them (3.9 errors on average per article for Wikipedia as against the 2.9 obtained by Britannica). However the author argues that the reviewers were the same for both sides, so they would on average have committed the same number of errors.
Regarding the session in which they were expecting to have a consensus on improving the Wikipedia content reliability, they didn’t reach it. According to Martin Walker, a professor of Chemistry, things gradually coalesced during discussions over the weekend. Both the Germans and the English-language people seem to have come to a similar consensus:
1. Set up stable (uneditable) versions of articles (static unvandalized
versions). The Germans expect to be testing out this idea within a month.
2. Then set up a validation system, possibly using review teams of trusted
people from wikiprojects, to check every fact (& sign off, giving the
reference source). The fact checking would be made available to any user
who wanted to take the time. This validated version would also be an
3. On the English Wikipedia we thought there ought to be an outside
expert (with a PhD and a reputation) to sign off on the validated version,
so we could say: “This page has been approved by Professor XXX from
University of YYY”.
The discussion page set up on this issue is at:
Concerning the Semantic web, there is already a wikiproject at www.ontoworld.org which is working. The basic idea is to use tags for all pieces of information contained in the articles in order to relate them to other data.
A typical example is:
”’Africa”’ is a continent, south of [[Europe]] and southwest of [[Asia]].
where the tags for Europe and Asia signal a relation to countries with the same tags.
In the end what we have is a big relational database underlying the system which allows queries using a SQL-type query language called SPARQL the specification for which is at:
I conducted some tests using the following entry:
and using the Africa article I made a search of www.ontoworld.org and created a new article from a query looking for each country in Africa sorted by Population. The query was:
Editing Africa Population by Countries>
== List of African countries ==
The technology used is something called RDF. More on that at:
and there are some libraries in several languages that deal with it:
For RDF access with libraries
Get RDFLib from rdflib.net
RAP from www.wiwiss.fu-berlin.de/suhl/bizer/rdfapi
and the description of the MediaWiki extension project for the Semantical Web is at:
We also explored some browsers being developed for the Semantic Web. They are at:
The workshop on “Using Wikipediaâ€™s Knowledge in Your Applications”
(http://wikimania2006.wikimedia.org/wiki/Proceedings:MK1) was very interesting. There I met the speakers (Markus KrÃ¶tzsch, Denny Vrandecic)with whom I exchanged information, agreeing to keep in contact with them to discuss my concerns relating to addressing queries to the URL and transforming unstructured data into semantically usable information. There was some discussion about Natural Language Processing translation to Semantic Information, directly taking clue words from the articles in Wikipedia and introducing tags and relations:
I met Tim Berners-Lee who is also very interested in the Semantic Web and the idea of creating a browser exploiting these features. I also met Betsy Megas who is working on a Semantic Web project called wiktionaryZ.org which is like the wiktionary.org but semantical. We had a long discussion about the convenience of having a “categories” squeme in the Semantic Web. My point was that in most cases the categories could be built dynamicaly. They would be present in an abstract form without there being any need to define them explicitly. A completely relational model strikes me as more interesting. People have tried to categorize everything since the existence of the encyclopedias, but the number of possible categories can be much higher than the number of elements to categorize, simply because categories can be so arbitrary that the final number of categories can reach the number of subsets of elements, which is not useful from my point of view.
A couple of plenary sessions were held in the main auditorium of Harvard Law School. One of them featured a lecture by Lawrence Lessig, a lawyer who has been involved in cases such as the one pitting Monopoly against Microsoft. He is the founder of the Commons Licence, better known as Left-copyright, where some rights are reserved but others are yielded to the people to allow them to be creative when using resources. He talked about the Read-Only (RO) Society, which was how he characterized the last century, and the new Read-Write (RW) Society which is moving toward the Commons Licence, Open Software, Open Hardware, Free and Open Encylopedias like Wikipedia, Freedom of Work/Job (freelancers and people organizing free foundations), free access to human knowledge and communications (Web Access, Skype), Free and Open broadcasting (pod-casting), among other things.
Most of the sessions were recorded and are available at:
And the Abstracts and Procedures are at:
I also found an interesting site for exploring the degree of separation between 2 different topics through Wikipedia (sometimes it works):