Advanced machine perception model for activity-aware application

National Chung Cheng University

Electrical Engineering of Department

ADVANCED MACHINE PERCEPTION MODEL FOR

ACTIVITY-AWARE APPLICATION

Student: Manh-Hung Ha

Advisor: Professor Oscal Tzu Chiang Chen

June 2021

Acknowledgments

It would have been impossible to work on a thesis without the support of many

people who all have my deep and sincere gratitude. First of all, I would like to thank

Professor Oscal T.-C. Chen, my academic supervisor, for having taken me on as a PhD

student, and having trusted me for years. I could not find any better adviser than

Professor Chen, who gave me ideas, advice, motivation, and, above all, letting me

pursue my thoughts freely. He was an excellent theoretical point of reference and

critical to testing my theories. He continues to impress me with his systematic style,

with compassion and humility, above all, to become a better researcher as well as a

better person. Thank you for introducing me to the world of computer vision and

research and for taking me as a young researcher in my formative years.

Thank you to my thesis committee - Prof. Wei-Yang Lin, Prof. Oscal T.-C. Chen,

Prof. Wen-Nung Lie, Prof. Sung-Nien Yu and Prof. Rachel Chiang for their time, kind

advice and feedback for my thesis. You have taught me a lot and guided me on

improving the quality of my research.

I was incredibly happy to have amazing mentors at CCU over the years. I am

grateful to Professors Wen Nung Lie and Professor Gerald Rau who took me to learn

the ropes in the first few years. I am thankful to Professor Alan Liu, Professor Sung

Nien Yu, Professor Norbert Michael Mayer and Professor Rachel Chiang for taking me

on my academic courses, helping me to expand my field of study and teaching me to

be a systematic experimental experimentalist.

I'd like to express my gratitude to all of my friends and colleagues in the

Department of Electrical Engineering. There were great staffs who taught me

everything during my Ph.D. I didn't have as many chances as I wanted to engage with

them, but I was always motivated by their ideas and study. Also, thanks to other great

iii

researchers in my field: Hoang Tran, Hung Nguyen, Luan Tran, and many others; I

thank them for working with them, and hope that there are even more opportunities to

learn from them in the future. I would like to thank my many laboratory partners in

VLSI DSP group, including Wei-Chih Lai, Ching-Han Tsai, Yi Lun Lee, Yen-Cheng

Lu, Han-Wen Liu, Yu-Ting Liu, etc. I learned many things from each of them and their

support and guidance helped me overcome some stressful times. Finally, thank you for

sharing your thoughts, documentations, data sets and coding programs. All of this work

wasn't even considered and thanks to many of my colleagues working with computer

vision and machine learning. My many amazing co-authors, I want to thank as well.

I was also incredibly lucky to have made many friends in my time in CCU. Also,

thanks to the badminton team in EE department, VSACCU badminton team showed me

all the awesome activities around the PhD journey!

Finally, without the valuable support, encouragement, love and patience of my

family, especially my parents Ha Sinh Hai and Tran Thi Theu, from my first years as a

student, they trained me for this by showing me the importance of hard work, critical

thinking and patience. I thank them for their support, trust and innumerable sacrifices;

I have worked on this study thousands of kilometers from home and missed many

significant life events. And speaking of love and encouragement, my wife Nguyen

Hong Quyen and my daughter Ha Hai Ngan are grateful for all our many wonderful

years and many well-wrought weekends and vacations, and above all for staying by my

side, even when a long distance was between us. Thank you for raising me today and

for having always been there and trusting me.

Abstract

People are one of the most important entities that computer vision systems would

need to understand to be useful and omnipresent in various applications. Most of this

awareness is based on the recognition of human activities for the homecare systems

which observe people and support elderly people. Human beings do this well on their

own: we look at others and describe each action in detail. Moreover, we can reason

about those actions over time, even predict the possible actions in the future. On the

other hand, computer vision algorithms were well behind the challenge. In this study,

my research aim is to create learning models, which can automatically induce

representations of human activities, especially their structure and feature meaning, in

order to solve several higher-level action tasks and approach to context-aware engine

for various action recognition.

In this dissertation, we explore techniques to improve human action

understanding from video inputs which are common and may be found in daily

activities such as surveillance, traffic, education, movies, sports, etc. on challenging

large-scale benchmark datasets and our own panoramic video dataset. This dissertation

targets the action recognition and action detection of humans in videos. The most

important insight is that actions depend on global features parameterized by a scene,

objects, and others, apart from their own local features parameterized by body pose

characteristics. Additionally, modeling the temporal features by optical flow from

motions of people and objects in the scene can further help in recognizing human

actions. These dependencies are exploited in five key fords: (1) Detecting moving

subjects using the background subtraction scheme, tracking extrcated subjects using the

Kalman filter, and using the handcraft features to perform classification via traditional

machine learning (GMM, SVM). (2) Developing a computation-affordable recognition

system with a lightweight model capable of learning from a portable device; (3) Using

capsule networks and skeleton-based map generation to attend to the subjects, and

building their correlation and attention context; (4) Exploring the integrated action

recognition model based on correlations and attention of subjects and scene; (5)

Developing systems based on the refined highway aggregating model.

In summary, this dissertation presents several novel and significant solutions for

efficient DNN architecture analysis, acquisition, and distribution on large-scale video

data. We show that the DNNs using multiple streams, combined model, hybrid structure

on conditional context, feature input representation, global features, local features,

spatiotemporal attention and the modified belief Capsnet have efficiently achieved high

quality results. The consistent improvements from using these components of our

DNNs are addressed to achieve state-of-the-art results on popularly-used datasets.

Furthermore, we also observe that the largest improvements are indeed achieved in

action classes involving human-to-human and human-to-object interactions, and

visualizations of our network show that it is focusing on scene context that is intuitively

relevant to action recognition.

Keywords: attention mechanism, activity recognition, action detection, deep

neural network, convolutional neural network, recurrent neural network, capsule

network, spatiotemporal attention, skeleton.

TABLE OF CONTENTS

PAGES

ACKNOWLEDGMENTS .................................................................................... i

ABSTRACT ......................................................................................................... iii

TABLE OF CONTENTS....................................................................................... v

LIST OF FIGURES ............................................................................................ viii

LIST OF TABLES ................................................................................................ xi

I. INTRODUCTION ................................................................................................. 1

II. MULTI-MODAL MACHINE LEARNING APPROACHES LOCALLY ON

SINGLE OR MULTIPLE SUBJECTS FOR HOMECARE................................. 8

2.1 Introduction………………………………………………………………….. 9

2.2 Technical Approach ........................................................................................11

2.2.1 Handcraft feature extraction by locally body subject estimation ...... 12

2.2.2 Proposed action recognition on single subject .................................. 17

2.2.3 Proposed action recognition on multiple subjects............................. 20

2.3 Experiment Results and Discussion .............................................................. 27

2.3.1 Effectiveness of our proposal to single subject on action recognition28

2.3.2 Effectiveness of our proposal to multiple subjects on action

recognition......................................................................................... 31

2.4 Summary and Discussion .............................................................................. 35

III. ACTION RECOGNITION USING A LIGHTWEIGH MODEL........................ 38

3.1 Introduction ................................................................................................ 37

3.2 Related Work .............................................................................................. 40

3.3 Action recognition by a lightweight model................................................ 41

3.4 Experiments and Results............................................................................ 45

vii

3.5 Summary and Discussion ........................................................................... 49

IV. ATTENTIVE RECOGNITION LOCALLY, GLOBALLY, TEMPORALLY,

USING DNN AND CAPSNET .......................................................................... 50

4.1 Introduction................................................................................................... 51

4.2 Related Previous Work.................................................................................. 55

4.2.1 Diverse Spatio-Temporal Feature Generation................................. 55

4.2.2 Capsule Neural Network................................................................. 59

4.3 Proposed DNNs for Action Recognition....................................................... 59

4.3.1 Proposed Generic DNN with Spatiotemporal Attentions .................. 59

4.3.2 Proposed CapsNet-Based DNNs........................................................ 68

4.4 Experiments, Comparisons of Proposed DNN ............................................. 72

4.4.1 Datasets and Parameter Setup for Simulations.................................. 72

4.4.2 Analyses and Comparisons of Experimental Results......................... 74

4.4.3 Analyses of Computational Time, and Cost....................................... 84

4.4.4 Visualization....................................................................................... 85

4.5 Summary and Discussion............................................................................... 88

V. ACTION RECOGNITION ENHANCED BY CORRELATIONS AND

ATTENTION OF SUBJECTS AND SCENE..................................................... 89

5.1 Introduction................................................................................................... 89

5.2 Related work ................................................................................................. 91

5.3 Proposed DNN.............................................................................................. 92

5.3.1 Projection of SBB to ERB in the Feature Domain ............................ 92

5.3.2 Map Convolutional Fused-Depth Layer ............................................ 93

5.3.3 Attention Mechanisms in SA and TE Layers..................................... 93

5.3.4 Generation of Subject Feature Maps.................................................. 95

5.4 Experiments and Discussion......................................................................... 97

viii

5.4.1 Datasets and Parameter Setup for Implements Details.............................. 97

5.4.2 Analyses and Comparisons of Experimental Results................................. 98

5.5 Summary and Discussion......................................................................... 100

VI. SPATIO-TEMPORALLY WITH AND WITHOUT LOCALIZATION ON

MULTIPLE LABELS FOR ACTION PERCEPTION, USING VIDEO

CONTEXT........................................................................................................ 101

6.1 Introduction................................................................................................. 102

6.2 Related work ............................................................................................... 107

6.2.1 Action Recognition with DNNs....................................................... 107

6.2.2 Attention Mechanisms ..................................................................... 108

6.2.3 Bounding Boxes Detector for Action Detection .............................. 109

6.3 Proposed Methodology ................................................................................110

6.3.1 Action Refined-Highway Network ...................................................112

6.3.2 Action Detection ...............................................................................118

6.3.3 End-to-End Network Architecture on Action Detection.................. 123

6.4 Experimental Results and Discussion......................................................... 123

6.4.1 Datasets............................................................................................ 123

6.4.2 Implementation Details.................................................................... 125

6.4.3 Ablation Studies............................................................................... 127

6.5 Summary and Discussion............................................................................ 133

VII. CONCLUSION AND FUTURE WORK .......................................................... 135

REFERENCES .................................................................................................. 139

APPENDIX A.................................................................................................... 152

LIST OF FIGURES

Figures Pages

2.1 Schematic diagram of height estimation......................................................... 13

2.2 Distance estimation at the situation (I). .......................................................... 14

2.3 Estimated Distance at the situation (II)........................................................... 15

2.4 Distance curve pattern of the measure of the standing subject....................... 16

2.5 Proposal flow chart of our action recognition system. ................................... 18

2.6 Flowchart detection of TV on/off ................................................................... 19

2.7 Proposed activity recognition system. ............................................................ 21

2.8 Example of shape BLOBs generation from forground. .................................. 22

2.9 Illustration of tracking by the Kalman filter method ...................................... 23

2.10 Proposed FSM................................................................................................. 24

2.11 Estimates of activity states in the overlapping interval. ................................. 26

2.12 Proposed incremental majority voting ............................................................ 26

2.13 Room layout and experiment scenario............................................................ 27

2.14 Examples of five activities recorded from the panoramic camera.................. 28

2.15 Total accuracy rate (A) versus p and r............................................................. 32

2.16 Total accuracy (A) versus p at r=0.001, 0.01, 0.05, 0.1, and 0.2. ................... 32

3.1 Proposed recognition system. ......................................................................... 41

3.2 Functional blocks of the proposed MoBiG..................................................... 42

3.3 Proposed finite state machine. ........................................................................ 43

3.4 Proposed incremental majority voting. ........................................................... 44

3.5 Confusion matrix of MoBiG identifying four activities. ................................ 48

4.1 CapsNets integrated in a generic DNN........................................................... 55

4.2 Block diagram of the proposed generic DNN................................................. 58

4.3 Three skeleton channels................................................................................... 61

4.4 One example of the transformed skeleton maps from an input segment........ 63

4.5 Block diagrams of the proposed AJA and AJM. ............................................. 64

4.6 Block diagram of the proposed A_RNN......................................................... 67

4.7 Proposed capsule network for TC_DNN and MC_DNN................................ 69

4.8 Block diagrams of the proposed CapsNet-based DNNs. ................................ 71

4.9 Examples of the panoramic videos about 12 actions where subjects maked by

red rectangular dash-line boxes for observing only. ....................................... 74

4.10 Visualization of the outputs from the intermediate layers of the proposed

TC_DNN ........................................................................................................ 87

4.11 Visualization of the outputs from the intermediate layers of two A_RNNs. .. 87

5.1 Block diagram of the proposed deep neural network ..................................... 92

5.2 Block diagram of the SA generation layer...................................................... 95

5.3 We plot the comparison performance of the AFS and ROS stream for each action classes . 98

5.4 JHMDB21 confusion matrix .................................................................................. 99

6.1 Refined highway block for 3D attention....................................................... 104

6.2 Overview of the proposed architecture for action recognition and detection 110

6.3 Temporal bilinear inception module. .............................................................113

6.4 RH block in RNN structures, like the standard RNN, LSTM, GRU and variant

RNNs..............................................................................................................114

6.5 Schematic recurrent 3D Refined-Highway depth by three RH block ..........116

6.6 3DConvDM layer correlating the feature map X ..........................................118

6.7 The details of GAA module ........................................................................ 122

6.8 mAP for per-category on AVA ...................................................................... 132

6.9 Possible locations associated with proposal regions of a subject ................. 132

6.10 Visualization of R_FSRH module on the UCF101-24 validation set........... 132

6.12 Qualitative results of R_FSRH on action recognition and detection on the

JHMDB21 dataset......................................................................................... 140

6.13 Qualitative results of top predictions for some of the classes using proposed

model on Ava ................................................................................................ 141

A.1 12 category visualization of panoramic camera data in two and three

dimensions with the scatter plot filter........................................................... 151

A.2 T-SNE test data visualization where each of data points represented by the

shots of a frame sequence on UCF101 ......................................................... 151

xii

LIST OF TABLES

Tables Pages

1.1 List of published and submitted papers in this dissertation .............................. 7

2.1 Features used for posture recognition ............................................................. 18

2.2 States and transistions of our FSM ................................................................. 24

2.3 State estimates at the overlapping interval...................................................... 25

2.4 Confusion matrix of TV on/off detection........................................................ 28

2.5 Confusion matrix of activity recognition (I)................................................... 29

2.6 Comparison of features and performance of the proposal and conventinal

activity recognition ......................................................................................... 30

2.7 Average accuracies of four activities at the type-I experiment....................... 33

2.8 Confusion matix of activity recognition (II)................................................... 34

2.9 Example I of activity accuracies of two subjects at the type-II experiment. .. 34

2.10 Example II of activity accuracies of two subjects at the type-II experiment.. 34

2.11 Example III of activity accuracies of three subjects at the type-II experiment35

3.1 Features of original mobilenetV2 and MOBIG .............................................. 46

3.2 Performance comparison of pre-trained mobilenetV2 and MOBIG using the

panoramic camera dataset. .............................................................................. 47

3.3 Accuracies improved by MOBIG plus FSM and IMV. .................................. 48

3.4 Accuracy, complexity, and model size of the proposed system, and the other

DNNs. ............................................................................................................. 49

4.1 Performance of three types of DNNs using the datasets of UCF101, HMDB51,

and panoramic videos ..................................................................................... 75

4.2 Accuracies of 12 actions recognized by three types of DNNs using the

panoramic video dataset.................................................................................. 77

4.3 Accuracies of three topologies from the proposed generic DNN. .................. 78

xiii

4.4 Performance of the proposed generic DNN and three CapNet-based DNNs

with one, two and three input streams ............................................................ 79

4.5 Average accuracies of the proposed TC_DNN at four merging models and

different values of F. ....................................................................................... 81

4.6 Performance comparisons of TC_DNN using different approaches of

generating Tske maps...................................................................................... 82

4.7 Performance comparisons of the proposed and conventinal DNNs using only

an RGB stream................................................................................................ 82

4.8 Performance comparisons of the proposed and conventinal DNNs using the

HMDB51 and UCF101 datasets. .................................................................... 83

4.9 Parameter amounts and inference time of the generic DNN, MC_DNN,

DC_DNN, and TC_DNN................................................................................ 84

5.1 Accuracies of 8 actions recognized by three types ....................................... 99

6.1 Comparisons of DNNs with the R_FSRH layers at different numbers of RH

modules on TP, JHMDB-21 and UCF101-24 datasets............................................ 128

6.2 Results of the DNN based on I3D+ R_FSRH using different scalar values, γ,

on JHMDB-21 and UCF101-24.................................................................... 129

6.3 Comparison of different component combinations at FRH (R=1) on 3DCNN +

FRH with [email protected]........................................................................... 130

6.4 Comparison of our proposal against other types on three datasets at video[email protected]...................................................................................................... 130

A.1 Comparison of the features and performance of our proposed and generic

activitity recognition systems on panoramic video dataset........................... 152

I. INTRODUCTION

The main goal of visual understanding, image classification, computer vision, and

artificial intelligence is to aid people to do their work more efficiently. From support

requests to assistive services, the potential impact of vision and AI on aging society is

immeasurable, and it has grown at an unprecedented rate in recent years. While these

applications are still an active field of study, note that they share a common theme: they

all require systems to interact with and understand humans. Hence, developing

technologies capable of understanding people is critical in achieving the goal of

ubiquitous AI. Human understanding, however, can not be done in isolation by just

observing the person because of being influenced by the objects we interact with as well

as by the environment we exist in.

The development of deep learning has led to rapid improvements in various

fundamental vision problems. Given large labeled datasets, CNNs are able to learn

robust and efficient representations that exceed human performance on video

classification and perform exceedingly well in action recognition, object detection and

key point estimation tasks. But what about tasks that do not have well defined datasets

or are much harder to label, such as 3D structures of objects or all actions afforded by

a scene. Moreover, one of the limitation in scale of models up for a higher level

understanding of human intent and actions over a continuous video stream is considered.

This thesis takes a step towards building and improving systems capable of

understanding human actions and intentions from both benchmark large-scale dataset

and our own dataset. These systems would need to reason about the scene layout in

unison with the humans to perform well. The thesis is divided into three parts to explore

action understanding, corresponding to the five directions described previously: (i)

handcraft feature and classifier by traditional machine learning (ii) lightweight model

for smart device (iii) locally (using poses), temporally (using optical flow), appearance

(using RGB), capsule networks (classifier) (iv) incremental action recognition model

by using correlations and attention of subjects and scenes (v) refined highway, globally

and locally, contextually on action recognition with and without localization.

Homecare Action recognition with single or multiple subjects, using

handcraft feature on traditional machine learning: The homecare of elderly people

has become an important issue that requires being aware of activity recognition of

multiple subjects at home. We start by understanding humans via focusing directly on

them, and especially, their body poses. Features from body pose, typically defined using

a background subtraction and estimation method, provide a useful signal of a person’s

external action state. As a handcraft feature, extracting just a few body keypoints over

time can be enough to recognize human actions. Towards that goal, we build systems

to detect and track these features efficiently and accurately in videos. We then use those

features to direct classify and recognize actions of single and multiple subjects. We

show that the new model using a panoramic camera can enable several novel

applications in home-care systems.

Computation-affordable recognition system with a lightweight model:

Activity recognition systems have been widely used in different areas, including

healthcare, security, as well as other areas like entertainment, are experiencing rapid

growth. Currently, edge devices are very resource limited. This doesn't allow for high

computation. Previous work has successfully developed various DNN models for action

recognition, but these models are computationally intensive and therefore inefficient to

apply to edge devices. One of the most promising solutions today is cloud storage that

can be used to transfer data by several methods for further analyses. However,

continuously sending signals to the cloud demands an increasing bandwith. Routine

Thư viện tri thức trực tuyến

Advanced machine perception model for activity-aware application

Nội dung xem thử

Mô tả chi tiết

Tài liệu tương tự (6)

Advanced machine perception model for activity aware application

Advanced machine learning topics

Advances in financial machine learning

Advanced theory of mechanisms and machines

Advanced boolean techniques

Advanced applications of blockchain technology. Volume 60