Research Direction #1: Mobile Video Delivery System Design

Multimedia delivery system design encompass broad range of research topics. Chen's research team is working on four major topics:

1. Assessing and Monitoring of video delivery quality and user experiences

2. Mobile video perception: Viewing environment sensing and adaptation

3. Battery power aware mobile display adaptation

4. Robust and adpative Internet of Video Things (IoVT) via NOMA and edge computing

Multimedia contents, especially videos, have seen their unprescented growth of popularity in the last decades. The ultimate objective of various video delivery systems is to ensure that the end users can enjoy their best possible experiences. We believe we need to push for a significant departure from conventional quality-of-service paradigm and establish a new paradigm in terms of end user oriented system design that move beyond core delivery system itself to encompass new issues of context and environments in mobile video to achieve best quality-of-experience.

The influence factors (IF's) for video delivery include both systems IF's as well as context IF's and user IF's as illustrated below. From system design perspectives, the end-to-end ecological chain includes host, channels, and clients.

Chen's research team is working various topics related mobile video delivery, ranging from video quality assessment of user generated videos when the original videos are unavailable as reference, to mobile video viewing when the ambient environments changes dynamically and requires adaptation, to IoVT applications when the needs for designing new transmission strategy for 5G and future generation mobile communication architecture, and to designing power saving display of video for battery-operated mobile devices such as smart phone. Such a new design paradigm encompasses issues related to either the host-channels-client partition or the system-context-user partition.

Assessing and monitoring of video delivery quality and user experiences:

Illustration of a novel pseudo-reference image assessment principle for the contemporary user-generated multimedia content delivery when the original reference images are unavailable. This is paradigm-shifting idea in that, instead of comapring the image to perfect quality, the pseudo reference is estimated from possible worst distoritons. This idea can be relatively easily implemented as the following operational pipelines:

Mobile video perception: Viewing environment sensing and adaptation:

Conventional video perception focuses on the outcomes from controlled laboratory viewing environment for living room settings. As the smart phones have now become the primary platform for viewing videos, the environments in which the consumers viewing the video signal have changed completely. It has been recognized that three new primary viewing contextual environments have been identified: viewing distance, ambient environment, and viewer motion patterns. It is necessary to design new adaptation schemes based on viewing context environments.New generation of smart phones have been equipped with various sensors whose data can be utilized to estimate the viewing contextual environments. We have designed an novel adaptation scheme to maximize smart phone users' mobile video perception experiences.

Battery power aware mobile display adaptation:

Battery power issues are extremely important for mobile devices especially when the mobile devices such as smart phones are now frequently used to view video contents on the go. It is well-known that the brightness level of the mobile display determines how quickly the battery would be drain out. Coupled with the wireless transmission of the video content, the video delivery to mobile devices service consumes 80% or more power when such service is in session. There are various on-board strategies to save the power consumption. Many of these schemes either consume additional power in terms of video analytics or cause undesired degradation of the video quality. We have developed a holistic approach to attack such a systematic problem by shifting the computation burden from video analytics to cloud center and piggy-pack required operational parameters for display adaptation via negligible data augmentation. This is accomlished by an innovative solution to obtain an R-D-DE profile for a given video content.

The above illustration shows the paradigm shifting idea by augmenting the traditional rate-distortion (R-D) principle to the new rate-distortion-display energy (R-D-DE) principle to incorporate the display energy reduction (DER) into the overall system design. This augmentation enables the shifting of per-device computation on each smart phone to cloud-based computation to save the smart phone power at a massive scale. Every smart phone that access the video content processed by R-D-DE mechanism will save the operation of video content analysis for DER operation. We have also designed several energy saving schemes for video viewing services under various viewing conditions.

Research Direction #2: Image/Video Analytics

Vi deo analytics encompasses broad ranges of research topics. Chen's research team is working on three major topics:

1. Human-Object Interaction (HOI) and Human-Human Interaction (HHI)

2. Scene graph generation (SGG) from Images and Videos

3. Storytelling from Images and Videos

These topics forms different but coherent levels of extracting semantics from images and videos. For human-object interaction, this is a constrainted analytic task that targets extracting from images and videos that contain human subject who are performing some type of action and interact with an object or thing. For scene graph generation, such image/video analytics task attempts to find out all types of objects, including human, and figure out their mutual relationship. The results from scene graph generation will generally represents a relatively complete description of all objects within an image or a video. Storytelling is certainly at a higher level of semantics extraction and is expected to be able to tell a coherent story based on the analytics performed on a given set of images and videos. All three research topics employ different deep learning techniques and also incoprorate common or prior knowledges as well as viusal perception principles to achieve enhanced analytics performance for numerous downstream AI tasks.

Human-Object Interaction & Human-Human Interaction:

Many images and videos contains humans and associated objects. It is important to understand the interactions between human and objects as well as among human subjects within an image or a video. These two types of interaction are most fundamental for the computerized understanding of images and videos.

An illustration of knowledge-aware human-object interaction detection. Red, blue and black lines represent functionally similar objects, behaviorally similar actions and holistically similar interactions. We argue that successful detection of an HOI should benefit from knowledge obtained from similar objects, actions and interactions. The same philosophical argument can also be applied to human-human interaction to understant group activities when multiple human subjects are contained within an image or a video. Our approaches are based on how to incoporate knowledge into the detection and recognition of human-object interaction and human-human interaction.

Scene Graph Generation (SGG) from images and videos

Scene Graph Generation is another fundamental technique in image and video analytics. Scene graph is a structured representation of a scene that can clearly express the objects, attributes, and the relationships between objects in the scene. This is a higher level computer vision technique beyond simple object detection and recognition aiming at higher level of understanding and reasoning about visual scene. One important atrribute about the visual scene is the attention towards salient objects. To better generate scene graph, we believe the saliency attention mechnism will need to be embedded into the design SGG schemes. We have achieved an improved SGG results with the incorporation of saliency attention mechanism.

An illustration of human perception of a scene (a), where humans often allocate attention to the salient visual relations that are worthy of mention in a natural-language utterance. The existing scene graph generation schemes (b) fails to identify such salient relations, while the scene graph with key relations (c) better aligns with human perception by upgrading each edge with an attribute of relation saliency.

Storytelling from images and videos

Storytelling is certainly at a higher level of semantics extraction than human-object interaction, human-human interaction, and scene graph generation. This task is expected to be able to compose a coherent story based on the analytics performed on a given set of images and videos. In recent years, natural language description of a given image has achieved signficant advances with various image captioning approaches. Some of these dense captioning results can describe almost all subjects and objects within an image. Storytelling takes one step further to generated coherent paragraph of several sentences to tell a story that fully describes the theme of a set of related images or a segment of videos.

The above figure illustrations an example of visual storytelling result on SIND datasets. Three stories are generated for each photo stream: story by ground truth, story by baseline (by Park and Kim, NIPS2015) and story by the approach developed by Chen's team implemented as Bidirectional Attention Recurrent Neural Networks (BARNN). The colored words indicate the semantic matches between the generation results against the ground truth. The proposed scheme shows better semantic alignment in storytelling.