Multimedia delivery system design encompass broad range of research topics. Chen's research team is working on four major topics:
1. Assessing and Monitoring of video delivery quality and user experiences
2. Mobile video perception: Viewing environment sensing and adaptation
3. Battery power aware mobile display adaptation
4. Robust and adpative Internet of Video Things (IoVT) via NOMA and edge computing
Multimedia contents, especially videos, have seen their unprescented growth of popularity in the last decades. The ultimate objective of various video delivery systems is to ensure that the end users can enjoy their best possible experiences. We believe we need to push for a significant departure from conventional quality-of-service paradigm and establish a new paradigm in terms of end user oriented system design that move beyond core delivery system itself to encompass new issues of context and environments in mobile video to achieve best quality-of-experience.
The influence factors (IF's) for video delivery include both systems IF's as well as context IF's and user IF's as illustrated below. From system design perspectives, the end-to-end ecological chain includes host, channels, and clients.
Chen's research team is working various topics related mobile video delivery, ranging from video quality assessment of user generated videos when the original videos are unavailable as reference, to mobile video viewing when the ambient environments changes dynamically and requires adaptation, to IoVT applications when the needs for designing new transmission strategy for 5G and future generation mobile communication architecture, and to designing power saving display of video for battery-operated mobile devices such as smart phone. Such a new design paradigm encompasses issues related to either the host-channels-client partition or the system-context-user partition.
Illustration of a novel pseudo-reference image assessment principle for the contemporary user-generated multimedia content delivery when the original reference images are unavailable. This is paradigm-shifting idea in that, instead of comapring the image to perfect quality, the pseudo reference is estimated from possible worst distoritons. This idea can be relatively easily implemented as the following operational pipelines:
Conventional video perception focuses on the outcomes from controlled laboratory viewing environment for living room settings. As the smart phones have now become the primary platform for viewing videos, the environments in which the consumers viewing the video signal have changed completely. It has been recognized that three new primary viewing contextual environments have been identified: viewing distance, ambient environment, and viewer motion patterns. It is necessary to design new adaptation schemes based on viewing context environments.New generation of smart phones have been equipped with various sensors whose data can be utilized to estimate the viewing contextual environments. We have designed an novel adaptation scheme to maximize smart phone users' mobile video perception experiences.
The above illustration shows the paradigm shifting idea by augmenting the traditional rate-distortion (R-D) principle to the new rate-distortion-display energy (R-D-DE) principle to incorporate the display energy reduction (DER) into the overall system design. This augmentation enables the shifting of per-device computation on each smart phone to cloud-based computation to save the smart phone power at a massive scale. Every smart phone that access the video content processed by R-D-DE mechanism will save the operation of video content analysis for DER operation. We have also designed several energy saving schemes for video viewing services under various viewing conditions.
Vi deo analytics encompasses broad ranges of research topics. Chen's research team is working on three major topics:
1. Human-Object Interaction (HOI) and Human-Human Interaction (HHI)
2. Scene graph generation (SGG) from Images and Videos
3. Storytelling from Images and Videos
These topics forms different but coherent levels of extracting semantics from images and videos. For human-object interaction, this is a constrainted analytic task that targets extracting from images and videos that contain human subject who are performing some type of action and interact with an object or thing. For scene graph generation, such image/video analytics task attempts to find out all types of objects, including human, and figure out their mutual relationship. The results from scene graph generation will generally represents a relatively complete description of all objects within an image or a video. Storytelling is certainly at a higher level of semantics extraction and is expected to be able to tell a coherent story based on the analytics performed on a given set of images and videos. All three research topics employ different deep learning techniques and also incoprorate common or prior knowledges as well as viusal perception principles to achieve enhanced analytics performance for numerous downstream AI tasks.
Human-Object Interaction & Human-Human Interaction:
Many images and videos contains humans and associated objects. It is important to understand the interactions between human and objects as well as among human subjects within an image or a video. These two types of interaction are most fundamental for the computerized understanding of images and videos.
An illustration of knowledge-aware human-object interaction detection. Red, blue and black lines represent functionally similar objects, behaviorally similar actions and holistically similar interactions. We argue that successful detection of an HOI should benefit from knowledge obtained from similar objects, actions and interactions. The same philosophical argument can also be applied to human-human interaction to understant group activities when multiple human subjects are contained within an image or a video. Our approaches are based on how to incoporate knowledge into the detection and recognition of human-object interaction and human-human interaction.
Scene Graph Generation (SGG) from images and videos
Scene Graph Generation is another fundamental technique in image and video analytics. Scene graph is a structured representation of a scene that can clearly express the objects, attributes, and the relationships between objects in the scene. This is a higher level computer vision technique beyond simple object detection and recognition aiming at higher level of understanding and reasoning about visual scene. One important atrribute about the visual scene is the attention towards salient objects. To better generate scene graph, we believe the saliency attention mechnism will need to be embedded into the design SGG schemes. We have achieved an improved SGG results with the incorporation of saliency attention mechanism.
An illustration of human perception of a scene (a), where humans often allocate attention to the salient visual relations that are worthy of mention in a natural-language utterance. The existing scene graph generation schemes (b) fails to identify such salient relations, while the scene graph with key relations (c) better aligns with human perception by upgrading each edge with an attribute of relation saliency.
Storytelling from images and videos
The above figure illustrations an example of visual storytelling result on SIND datasets. Three stories are generated for each photo stream: story by ground truth, story by baseline (by Park and Kim, NIPS2015) and story by the approach developed by Chen's team implemented as Bidirectional Attention Recurrent Neural Networks (BARNN). The colored words indicate the semantic matches between the generation results against the ground truth. The proposed scheme shows better semantic alignment in storytelling.