Multi-modal Image and Video Processing

Dr. Michael Smith

SAVA Advanced Image and Video Solutions
Austin, Texas

Friday, February 1st, 3:00 PM, ENS 637

msmith@savasystems.com


Abstract

A key problem in image and video indexing is that users query content and most systems only match statistical features such as color and texture. Content matching attempts to correlate actual objects with a given query. The user is not limited to selections based on similar color properties, but rather a collection based on content. In this form of matching, the query may be an image or text. The content features, such as caption and face detection correspond to textual descriptions so a query need not be an image.

A number of content-based image and video systems are applicable to the features described in this lecture. In each case, systems are designed to interpret features from multi-modal sources such as text, audio, image and video. A feature is a descriptive parameter that is extracted from an image or video stream. Features may be used to interpret visual content, or as a measure for similarity in image and video databases. In this lecture, features are described in the following categories:

Statistical Features: Features extracted from an image or video sequence without regard to content are described as statistical features. These include parameters derived from such algorithms as image difference and camera motion.

Compressed Domain Features: A feature which is extracted from a compressed image or video stream without regard to content is described as a compressed domain feature.

Content-Based Features: A feature that is derived for the purpose of describing the actual content in an image or video stream is a content-based feature.

A specific object is usually the emphasis of a query in image retrireview. Recognition of articulated objects poses a great challenge, and represents a significant step in content based feature extraction. Many working systems have demonstrated accurate recognition of animal objects, segmented objects, and rigid objects such as planes or automobiles. This presentation is an overview of several techniques and working systems in multi-modal content analysis and their applications to video processing. This presentation will also describe visualization technology for browsing and summarization, characterization and meta-data acquisition, and user-studies to validate specific methodology. This includes a description of traditional static presentations, such as text abstracts and thumbnails and current research in application specific image browsing paradigms.

Biography

Michael Smith, Ph.D., is the Co-Founder and Chief Scientist of SAVA Video Systems and an adjunct faculty member in the ECE Department. Dr. Smith received his Ph.D. in Electrical and Computer Engineering from Carnegie Mellon University in 1997, while working with the Informedia Digital Video Library Initiative. His dissertation topic was the Integration of Image, Audio and Language Understanding for Video Characterization and Variable Skimming, and he has published papers in pattern recognition, biomedical imaging, and interactive systems. He received a M.S.E.E. degree from Stanford University in 1992 and a B.S.E.E. degree from North Carolina A&T University in 1991. Dr. Smith previously worked at AT&T Bell Laboratories in Murray Hill, New Jersey and the Duke University Medical Imaging Center.


A list of Telecommunications and Signal Processing Seminars is available at from the ECE department Web pages under "Seminars". The Web address for the Telecommunications and Signal Processing Seminars is http://signal.ece.utexas.edu/seminars