Virtual Tribunal of the Special Court for Sierra Leone

Joint Project between UC Berkeley War Crimes Studies Center and Department of Computer Science

virtual tribunal computer image

About the Virtual Tribunal for Sierra Leone

We propose a novel way of aligning the audio/video and text streams, which is faster than conventional speech recognition, and requires no supervision. Multimedia of this form includes news broadcast with summaries, parliament proceedings and court trials with transcripts, etc. In addition to applications to video search using the text based indexing, we also show how we can annotate the video with the names of the person appearing in the video to provide a better visual experience. We test the technique on a 80 minute video segment donloaded from ICTY’s 1 website with the corresponding transcripts. The proposed technique achieves 88.49% accuracy on sentence level alignments and 95.5% accuracy on the task of assigning names to faces.

The problem of combining various streams of information coherently for various tasks like search, organization for better browsing, etc, is quite challenging. Large amounts of data exits on the web which contain audio, visual and text information together. These can be divided into two categories, one which has the various forms of media synchronized with one another and the other which do not. Examples of the former include photos with captions, movies with subtitles, etc. The later include news broadcast with the text taken from the newspapers of the same story, proceedings of parliament/courts with transcripts, etc. One can obtain these complimentary sources of information from the web for example by a video, image and text search on the same set of keywords. Clearly there is a huge amount of data on the web for which only weak associations exist and any method to organize them is useful.
In this work we show a technique of obtaining alignments of video and text, in the setting of court room style proceedings with transcripts. The alignment can then be used to annotate, label the video for various tasks. As an example we show how to automatically annotate the faces appearing in the video with the names using the transcripts.

Problem Challenges: The transcripts are noisy as they are taken down manually and are edited to remove sensitive information. Also in many of the trials, especially in the international courts, there is a translator who translates from the native language to English. This induces a lag and lot of additional noise which are not present in the transcripts, rendering the task of alignment non trivial. The whole area of face recognition itself is hard because of the variation in pose, expressions etc.

judges scsl For further reading:
Fast Automatic Alignment of Video and Text for Search/Names and Faces