Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Advanced machine perception model for activity-aware application
Nội dung xem thử
Mô tả chi tiết
i
National Chung Cheng University
Electrical Engineering of Department
ADVANCED MACHINE PERCEPTION MODEL FOR
ACTIVITY-AWARE APPLICATION
Student: Manh-Hung Ha
Advisor: Professor Oscal Tzu Chiang Chen
June 2021
ii
Acknowledgments
It would have been impossible to work on a thesis without the support of many
people who all have my deep and sincere gratitude. First of all, I would like to thank
Professor Oscal T.-C. Chen, my academic supervisor, for having taken me on as a PhD
student, and having trusted me for years. I could not find any better adviser than
Professor Chen, who gave me ideas, advice, motivation, and, above all, letting me
pursue my thoughts freely. He was an excellent theoretical point of reference and
critical to testing my theories. He continues to impress me with his systematic style,
with compassion and humility, above all, to become a better researcher as well as a
better person. Thank you for introducing me to the world of computer vision and
research and for taking me as a young researcher in my formative years.
Thank you to my thesis committee - Prof. Wei-Yang Lin, Prof. Oscal T.-C. Chen,
Prof. Wen-Nung Lie, Prof. Sung-Nien Yu and Prof. Rachel Chiang for their time, kind
advice and feedback for my thesis. You have taught me a lot and guided me on
improving the quality of my research.
I was incredibly happy to have amazing mentors at CCU over the years. I am
grateful to Professors Wen Nung Lie and Professor Gerald Rau who took me to learn
the ropes in the first few years. I am thankful to Professor Alan Liu, Professor Sung
Nien Yu, Professor Norbert Michael Mayer and Professor Rachel Chiang for taking me
on my academic courses, helping me to expand my field of study and teaching me to
be a systematic experimental experimentalist.
I'd like to express my gratitude to all of my friends and colleagues in the
Department of Electrical Engineering. There were great staffs who taught me
everything during my Ph.D. I didn't have as many chances as I wanted to engage with
them, but I was always motivated by their ideas and study. Also, thanks to other great
iii
researchers in my field: Hoang Tran, Hung Nguyen, Luan Tran, and many others; I
thank them for working with them, and hope that there are even more opportunities to
learn from them in the future. I would like to thank my many laboratory partners in
VLSI DSP group, including Wei-Chih Lai, Ching-Han Tsai, Yi Lun Lee, Yen-Cheng
Lu, Han-Wen Liu, Yu-Ting Liu, etc. I learned many things from each of them and their
support and guidance helped me overcome some stressful times. Finally, thank you for
sharing your thoughts, documentations, data sets and coding programs. All of this work
wasn't even considered and thanks to many of my colleagues working with computer
vision and machine learning. My many amazing co-authors, I want to thank as well.
I was also incredibly lucky to have made many friends in my time in CCU. Also,
thanks to the badminton team in EE department, VSACCU badminton team showed me
all the awesome activities around the PhD journey!
Finally, without the valuable support, encouragement, love and patience of my
family, especially my parents Ha Sinh Hai and Tran Thi Theu, from my first years as a
student, they trained me for this by showing me the importance of hard work, critical
thinking and patience. I thank them for their support, trust and innumerable sacrifices;
I have worked on this study thousands of kilometers from home and missed many
significant life events. And speaking of love and encouragement, my wife Nguyen
Hong Quyen and my daughter Ha Hai Ngan are grateful for all our many wonderful
years and many well-wrought weekends and vacations, and above all for staying by my
side, even when a long distance was between us. Thank you for raising me today and
for having always been there and trusting me.
.
iv
Abstract
People are one of the most important entities that computer vision systems would
need to understand to be useful and omnipresent in various applications. Most of this
awareness is based on the recognition of human activities for the homecare systems
which observe people and support elderly people. Human beings do this well on their
own: we look at others and describe each action in detail. Moreover, we can reason
about those actions over time, even predict the possible actions in the future. On the
other hand, computer vision algorithms were well behind the challenge. In this study,
my research aim is to create learning models, which can automatically induce
representations of human activities, especially their structure and feature meaning, in
order to solve several higher-level action tasks and approach to context-aware engine
for various action recognition.
In this dissertation, we explore techniques to improve human action
understanding from video inputs which are common and may be found in daily
activities such as surveillance, traffic, education, movies, sports, etc. on challenging
large-scale benchmark datasets and our own panoramic video dataset. This dissertation
targets the action recognition and action detection of humans in videos. The most
important insight is that actions depend on global features parameterized by a scene,
objects, and others, apart from their own local features parameterized by body pose
characteristics. Additionally, modeling the temporal features by optical flow from
motions of people and objects in the scene can further help in recognizing human
actions. These dependencies are exploited in five key fords: (1) Detecting moving
subjects using the background subtraction scheme, tracking extrcated subjects using the
Kalman filter, and using the handcraft features to perform classification via traditional
machine learning (GMM, SVM). (2) Developing a computation-affordable recognition
v
system with a lightweight model capable of learning from a portable device; (3) Using
capsule networks and skeleton-based map generation to attend to the subjects, and
building their correlation and attention context; (4) Exploring the integrated action
recognition model based on correlations and attention of subjects and scene; (5)
Developing systems based on the refined highway aggregating model.
In summary, this dissertation presents several novel and significant solutions for
efficient DNN architecture analysis, acquisition, and distribution on large-scale video
data. We show that the DNNs using multiple streams, combined model, hybrid structure
on conditional context, feature input representation, global features, local features,
spatiotemporal attention and the modified belief Capsnet have efficiently achieved high
quality results. The consistent improvements from using these components of our
DNNs are addressed to achieve state-of-the-art results on popularly-used datasets.
Furthermore, we also observe that the largest improvements are indeed achieved in
action classes involving human-to-human and human-to-object interactions, and
visualizations of our network show that it is focusing on scene context that is intuitively
relevant to action recognition.
Keywords: attention mechanism, activity recognition, action detection, deep
neural network, convolutional neural network, recurrent neural network, capsule
network, spatiotemporal attention, skeleton.
vi
TABLE OF CONTENTS
PAGES
ACKNOWLEDGMENTS .................................................................................... i
ABSTRACT ......................................................................................................... iii
TABLE OF CONTENTS....................................................................................... v
LIST OF FIGURES ............................................................................................ viii
LIST OF TABLES ................................................................................................ xi
I. INTRODUCTION ................................................................................................. 1
II. MULTI-MODAL MACHINE LEARNING APPROACHES LOCALLY ON
SINGLE OR MULTIPLE SUBJECTS FOR HOMECARE................................. 8
2.1 Introduction………………………………………………………………….. 9
2.2 Technical Approach ........................................................................................11
2.2.1 Handcraft feature extraction by locally body subject estimation ...... 12
2.2.2 Proposed action recognition on single subject .................................. 17
2.2.3 Proposed action recognition on multiple subjects............................. 20
2.3 Experiment Results and Discussion .............................................................. 27
2.3.1 Effectiveness of our proposal to single subject on action recognition28
2.3.2 Effectiveness of our proposal to multiple subjects on action
recognition......................................................................................... 31
2.4 Summary and Discussion .............................................................................. 35
III. ACTION RECOGNITION USING A LIGHTWEIGH MODEL........................ 38
3.1 Introduction ................................................................................................ 37
3.2 Related Work .............................................................................................. 40
3.3 Action recognition by a lightweight model................................................ 41
3.4 Experiments and Results............................................................................ 45
vii
3.5 Summary and Discussion ........................................................................... 49
IV. ATTENTIVE RECOGNITION LOCALLY, GLOBALLY, TEMPORALLY,
USING DNN AND CAPSNET .......................................................................... 50
4.1 Introduction................................................................................................... 51
4.2 Related Previous Work.................................................................................. 55
4.2.1 Diverse Spatio-Temporal Feature Generation................................. 55
4.2.2 Capsule Neural Network................................................................. 59
4.3 Proposed DNNs for Action Recognition....................................................... 59
4.3.1 Proposed Generic DNN with Spatiotemporal Attentions .................. 59
4.3.2 Proposed CapsNet-Based DNNs........................................................ 68
4.4 Experiments, Comparisons of Proposed DNN ............................................. 72
4.4.1 Datasets and Parameter Setup for Simulations.................................. 72
4.4.2 Analyses and Comparisons of Experimental Results......................... 74
4.4.3 Analyses of Computational Time, and Cost....................................... 84
4.4.4 Visualization....................................................................................... 85
4.5 Summary and Discussion............................................................................... 88
V. ACTION RECOGNITION ENHANCED BY CORRELATIONS AND
ATTENTION OF SUBJECTS AND SCENE..................................................... 89
5.1 Introduction................................................................................................... 89
5.2 Related work ................................................................................................. 91
5.3 Proposed DNN.............................................................................................. 92
5.3.1 Projection of SBB to ERB in the Feature Domain ............................ 92
5.3.2 Map Convolutional Fused-Depth Layer ............................................ 93
5.3.3 Attention Mechanisms in SA and TE Layers..................................... 93
5.3.4 Generation of Subject Feature Maps.................................................. 95
5.4 Experiments and Discussion......................................................................... 97
viii
5.4.1 Datasets and Parameter Setup for Implements Details.............................. 97
5.4.2 Analyses and Comparisons of Experimental Results................................. 98
5.5 Summary and Discussion......................................................................... 100
VI. SPATIO-TEMPORALLY WITH AND WITHOUT LOCALIZATION ON
MULTIPLE LABELS FOR ACTION PERCEPTION, USING VIDEO
CONTEXT........................................................................................................ 101
6.1 Introduction................................................................................................. 102
6.2 Related work ............................................................................................... 107
6.2.1 Action Recognition with DNNs....................................................... 107
6.2.2 Attention Mechanisms ..................................................................... 108
6.2.3 Bounding Boxes Detector for Action Detection .............................. 109
6.3 Proposed Methodology ................................................................................110
6.3.1 Action Refined-Highway Network ...................................................112
6.3.2 Action Detection ...............................................................................118
6.3.3 End-to-End Network Architecture on Action Detection.................. 123
6.4 Experimental Results and Discussion......................................................... 123
6.4.1 Datasets............................................................................................ 123
6.4.2 Implementation Details.................................................................... 125
6.4.3 Ablation Studies............................................................................... 127
6.5 Summary and Discussion............................................................................ 133
VII. CONCLUSION AND FUTURE WORK .......................................................... 135
REFERENCES .................................................................................................. 139
APPENDIX A.................................................................................................... 152
ix
LIST OF FIGURES
Figures Pages
2.1 Schematic diagram of height estimation......................................................... 13
2.2 Distance estimation at the situation (I). .......................................................... 14
2.3 Estimated Distance at the situation (II)........................................................... 15
2.4 Distance curve pattern of the measure of the standing subject....................... 16
2.5 Proposal flow chart of our action recognition system. ................................... 18
2.6 Flowchart detection of TV on/off ................................................................... 19
2.7 Proposed activity recognition system. ............................................................ 21
2.8 Example of shape BLOBs generation from forground. .................................. 22
2.9 Illustration of tracking by the Kalman filter method ...................................... 23
2.10 Proposed FSM................................................................................................. 24
2.11 Estimates of activity states in the overlapping interval. ................................. 26
2.12 Proposed incremental majority voting ............................................................ 26
2.13 Room layout and experiment scenario............................................................ 27
2.14 Examples of five activities recorded from the panoramic camera.................. 28
2.15 Total accuracy rate (A) versus p and r............................................................. 32
2.16 Total accuracy (A) versus p at r=0.001, 0.01, 0.05, 0.1, and 0.2. ................... 32
3.1 Proposed recognition system. ......................................................................... 41
3.2 Functional blocks of the proposed MoBiG..................................................... 42
3.3 Proposed finite state machine. ........................................................................ 43
3.4 Proposed incremental majority voting. ........................................................... 44
3.5 Confusion matrix of MoBiG identifying four activities. ................................ 48
4.1 CapsNets integrated in a generic DNN........................................................... 55
4.2 Block diagram of the proposed generic DNN................................................. 58
x
4.3 Three skeleton channels................................................................................... 61
4.4 One example of the transformed skeleton maps from an input segment........ 63
4.5 Block diagrams of the proposed AJA and AJM. ............................................. 64
4.6 Block diagram of the proposed A_RNN......................................................... 67
4.7 Proposed capsule network for TC_DNN and MC_DNN................................ 69
4.8 Block diagrams of the proposed CapsNet-based DNNs. ................................ 71
4.9 Examples of the panoramic videos about 12 actions where subjects maked by
red rectangular dash-line boxes for observing only. ....................................... 74
4.10 Visualization of the outputs from the intermediate layers of the proposed
TC_DNN ........................................................................................................ 87
4.11 Visualization of the outputs from the intermediate layers of two A_RNNs. .. 87
5.1 Block diagram of the proposed deep neural network ..................................... 92
5.2 Block diagram of the SA generation layer...................................................... 95
5.3 We plot the comparison performance of the AFS and ROS stream for each action classes . 98
5.4 JHMDB21 confusion matrix .................................................................................. 99
6.1 Refined highway block for 3D attention....................................................... 104
6.2 Overview of the proposed architecture for action recognition and detection 110
6.3 Temporal bilinear inception module. .............................................................113
6.4 RH block in RNN structures, like the standard RNN, LSTM, GRU and variant
RNNs..............................................................................................................114
6.5 Schematic recurrent 3D Refined-Highway depth by three RH block ..........116
6.6 3DConvDM layer correlating the feature map X ..........................................118
6.7 The details of GAA module ........................................................................ 122
6.8 mAP for per-category on AVA ...................................................................... 132
6.9 Possible locations associated with proposal regions of a subject ................. 132
6.10 Visualization of R_FSRH module on the UCF101-24 validation set........... 132
xi
6.12 Qualitative results of R_FSRH on action recognition and detection on the
JHMDB21 dataset......................................................................................... 140
6.13 Qualitative results of top predictions for some of the classes using proposed
model on Ava ................................................................................................ 141
A.1 12 category visualization of panoramic camera data in two and three
dimensions with the scatter plot filter........................................................... 151
A.2 T-SNE test data visualization where each of data points represented by the
shots of a frame sequence on UCF101 ......................................................... 151
xii
LIST OF TABLES
Tables Pages
1.1 List of published and submitted papers in this dissertation .............................. 7
2.1 Features used for posture recognition ............................................................. 18
2.2 States and transistions of our FSM ................................................................. 24
2.3 State estimates at the overlapping interval...................................................... 25
2.4 Confusion matrix of TV on/off detection........................................................ 28
2.5 Confusion matrix of activity recognition (I)................................................... 29
2.6 Comparison of features and performance of the proposal and conventinal
activity recognition ......................................................................................... 30
2.7 Average accuracies of four activities at the type-I experiment....................... 33
2.8 Confusion matix of activity recognition (II)................................................... 34
2.9 Example I of activity accuracies of two subjects at the type-II experiment. .. 34
2.10 Example II of activity accuracies of two subjects at the type-II experiment.. 34
2.11 Example III of activity accuracies of three subjects at the type-II experiment35
3.1 Features of original mobilenetV2 and MOBIG .............................................. 46
3.2 Performance comparison of pre-trained mobilenetV2 and MOBIG using the
panoramic camera dataset. .............................................................................. 47
3.3 Accuracies improved by MOBIG plus FSM and IMV. .................................. 48
3.4 Accuracy, complexity, and model size of the proposed system, and the other
DNNs. ............................................................................................................. 49
4.1 Performance of three types of DNNs using the datasets of UCF101, HMDB51,
and panoramic videos ..................................................................................... 75
4.2 Accuracies of 12 actions recognized by three types of DNNs using the
panoramic video dataset.................................................................................. 77
4.3 Accuracies of three topologies from the proposed generic DNN. .................. 78
xiii
4.4 Performance of the proposed generic DNN and three CapNet-based DNNs
with one, two and three input streams ............................................................ 79
4.5 Average accuracies of the proposed TC_DNN at four merging models and
different values of F. ....................................................................................... 81
4.6 Performance comparisons of TC_DNN using different approaches of
generating Tske maps...................................................................................... 82
4.7 Performance comparisons of the proposed and conventinal DNNs using only
an RGB stream................................................................................................ 82
4.8 Performance comparisons of the proposed and conventinal DNNs using the
HMDB51 and UCF101 datasets. .................................................................... 83
4.9 Parameter amounts and inference time of the generic DNN, MC_DNN,
DC_DNN, and TC_DNN................................................................................ 84
5.1 Accuracies of 8 actions recognized by three types ....................................... 99
6.1 Comparisons of DNNs with the R_FSRH layers at different numbers of RH
modules on TP, JHMDB-21 and UCF101-24 datasets............................................ 128
6.2 Results of the DNN based on I3D+ R_FSRH using different scalar values, γ,
on JHMDB-21 and UCF101-24.................................................................... 129
6.3 Comparison of different component combinations at FRH (R=1) on 3DCNN +
FRH with [email protected]........................................................................... 130
6.4 Comparison of our proposal against other types on three datasets at video[email protected]...................................................................................................... 130
A.1 Comparison of the features and performance of our proposed and generic
activitity recognition systems on panoramic video dataset........................... 152
1
I. INTRODUCTION
The main goal of visual understanding, image classification, computer vision, and
artificial intelligence is to aid people to do their work more efficiently. From support
requests to assistive services, the potential impact of vision and AI on aging society is
immeasurable, and it has grown at an unprecedented rate in recent years. While these
applications are still an active field of study, note that they share a common theme: they
all require systems to interact with and understand humans. Hence, developing
technologies capable of understanding people is critical in achieving the goal of
ubiquitous AI. Human understanding, however, can not be done in isolation by just
observing the person because of being influenced by the objects we interact with as well
as by the environment we exist in.
The development of deep learning has led to rapid improvements in various
fundamental vision problems. Given large labeled datasets, CNNs are able to learn
robust and efficient representations that exceed human performance on video
classification and perform exceedingly well in action recognition, object detection and
key point estimation tasks. But what about tasks that do not have well defined datasets
or are much harder to label, such as 3D structures of objects or all actions afforded by
a scene. Moreover, one of the limitation in scale of models up for a higher level
understanding of human intent and actions over a continuous video stream is considered.
This thesis takes a step towards building and improving systems capable of
understanding human actions and intentions from both benchmark large-scale dataset
and our own dataset. These systems would need to reason about the scene layout in
unison with the humans to perform well. The thesis is divided into three parts to explore
action understanding, corresponding to the five directions described previously: (i)
handcraft feature and classifier by traditional machine learning (ii) lightweight model
2
for smart device (iii) locally (using poses), temporally (using optical flow), appearance
(using RGB), capsule networks (classifier) (iv) incremental action recognition model
by using correlations and attention of subjects and scenes (v) refined highway, globally
and locally, contextually on action recognition with and without localization.
Homecare Action recognition with single or multiple subjects, using
handcraft feature on traditional machine learning: The homecare of elderly people
has become an important issue that requires being aware of activity recognition of
multiple subjects at home. We start by understanding humans via focusing directly on
them, and especially, their body poses. Features from body pose, typically defined using
a background subtraction and estimation method, provide a useful signal of a person’s
external action state. As a handcraft feature, extracting just a few body keypoints over
time can be enough to recognize human actions. Towards that goal, we build systems
to detect and track these features efficiently and accurately in videos. We then use those
features to direct classify and recognize actions of single and multiple subjects. We
show that the new model using a panoramic camera can enable several novel
applications in home-care systems.
Computation-affordable recognition system with a lightweight model:
Activity recognition systems have been widely used in different areas, including
healthcare, security, as well as other areas like entertainment, are experiencing rapid
growth. Currently, edge devices are very resource limited. This doesn't allow for high
computation. Previous work has successfully developed various DNN models for action
recognition, but these models are computationally intensive and therefore inefficient to
apply to edge devices. One of the most promising solutions today is cloud storage that
can be used to transfer data by several methods for further analyses. However,
continuously sending signals to the cloud demands an increasing bandwith. Routine