Disaster relief for couch potatoes

Last week I was involved in a map-athon for Japan. This was an act of Humanitarian Mapping with Open Street Map. I never had edited OSM before, and I was quite surprised at how easy this was, and how well organised.
The mapping itself is based on changing the existing OSM base map. All damaged items are marked with a special tag, explaining the new situation. At the time of a crisis, the OSM basemap itself needs updating too, since every detail can save lives.
The mapping itself needs a little preparation in terms of registering, but once this is completed you can start right away. Use one of the many map editors available for free. Select an area that you want to work on, download the existing map data and update this based on aerial photography that is included in the OSM editor. This youtube video explains how this works.

A mailing list and twitter account (@hotosm) keeps the volunteers connected.
So..everybody can do this..it’s easy and every contribution is one drop in a giant sea of relief effort.
The difference between the actuality of OSM compared to Google maps is shown in the following comparison. – Note to self: remove link later on..
See also: Humanitarian Open Street Map Team

Watson, how are you?

The last few days I have been reading about the WATSON-Jeopardy! event and how this relates to my current research: a lot..;-)! In this post I explain what Watson has to do to answer a Jeopardy! question and what kind of techniques are used. I collected this information from the creators of WATSON: IBM and Carnegie Mellon University. But first of all..some videos from the event!

The game involves three important intelligence problems to address. First of all the answering of questions. The questions are posed in a cryptic manner using puzzles and metaphores. Next, there is strategy. Three contestants compete in a timeslot to answer these questions. When an aswer is correct, the money related to the question is earned, when wrong, the amount is deducted. This introduces a certain matter of game strategy, because when unsure about an answer the contestant should better not answer. How to program that strategy is intelligence problem number two. Number three is of course finding the correct answers quickly whilst using a massive source database.

It goes without saying that many years of research preceded the Jeopardy! challenge. Not only technical research, but also research about the specific ins and outs of the Jeopardy! game. A random sample of 20.000 questions to determine the scope of the answer database as well as the specifics of the questions. To understand the game strategy, some 2000 historical Jeopardy! shows were analysed. Of this analysis statistics and training sets for machine learning were developed.
So first of all, Watson has to understand the question. This is done by using a series of NLP methodologies, amongst them Natural Language Parsing. With NL Parsing the question sentence is broken down into language parts such as noun, adverb, verb etcetera. Then, the question is decomposed into multiple parts. This is because often the answer to a question depends on a combination of two related questions: “Of the four countries that the US does not have diplomatic relations with, the one that’s farthest north.” Answer categories and Lexical Answer Types have been determined by IBM prior to the game based on examination of thousands of Jeopardy! questions. It is evident that every answer category needs a different type of processing.
This is “just” the determination of syntax of the language. We also need the meaning (semantics) of the question. Words that are written in the same way can have very different meanings. Watson uses semantic reference systems for this, such as ontologies and thesauri. The reference systems are used for semantic reasoning.
While searching for the right answer, Watson holds more than one option (hypotheses) in a queue and decides later which one is the best answer based on found evidence in metadata related to the answer source. The hypotheses are scored according to the totals of each evidence type, and this score is used in the confidality ranking.

Once the question is clear, Watson has to search for documents that contain the answer. This involves indexing and retrieval (a variety of techniques are used, including document search, knowledge base search – SPARQL) as well as information extraction. Watson uses a source database (corpus) that originally existed of encyclopedias, thesauri, literary works, newswire articles and so on. This corpus was enriched with related content from the internet (at the time of the Jeopardy! challenge Watson was not online!) to become an expanded corpus. The related content was electronically examined and relevant pieces of information were selected, extracted and scored based on relevance to the original seed document.
Next, Watson has to compare all possible answers and reason about what is best answer. This is done by examining the context information that is available in the document containing the answer. Different scoring algorithms were used, amongst which worth mentioning are counting and the presence of a diversity of relations such as correlation with question terms, semantic, temporal and geospatial relations. It is a mixture of all these scores that finally determines Watson’s confidence in an answer.
Now an answer is not the same as “a document containing the answer”..! Usually it is just one word..so how to come to that. For this techniques like named entity detection (does the document contain something that is detected as a “name”) and reverse dictionary lookup (!) were used. In the contest it was clear that Watson sometimes did not succeed in this very well, since whole sentences were given as an answer while Brad and Ken gave just one name.

It is no surprise to me that the biggest technical challenge was not so much the development of the separate algorithms. It was rather “..accelerating the innovation process – making it easy to combine, weigh evaluate and evolve many different independently developed algorithms that analyze language from different perspectives.” Watson isn’t a single computer program, but a very large number of programs running simultaneously on different computers that communicate with each other. To make these programs work together and deliver a result within split seconds is the ultimate achievement that Watson has made.

Below is a list of resources used to write this post.
Article in AI magazine
IBM QA
Carnegie Mellon Post
Wikipedia