123

REVIEW




REVIEW

www.advintellsyst.com Machine-Learning-Assisted Intelligent Imaging Flow Cytometry: A Review Shaobo Luo, Yuzhi Shi,* Lip Ket Chin,* Paul Edward Hutchinson, Yi Zhang, Giovanni Chierchia, Hugues Talbot, Xudong Jiang, Tarik Bourouina,* and Ai-Qun Liu* Imaging flow cytometry has been widely adopted in numerous applications such as optical sensing, environmental monitoring, clinical diagnostics, and precision agriculture. The system, with the assistance of machine learning, shows unprecedented advantages in automated image analysis, thus enabling highthroughput measurement, identification, and sorting of biological entities. Recently, with the burgeoning developments of machine learning algorithms, deep learning has taken over most of data analysis and promised tremendous performance in intelligent imaging flow cytometry. Herein, an overview of the basic knowledge of intelligent imaging flow cytometry, the evolution of machine learning and the typical applications, and how machine learning can be applied to assist intelligent imaging flow cytometry is provided. Perspectives of emerging machine learning algorithms in implementing future intelligent imaging flow cytometry are also discussed. 1. Introduction Imaging flow cytometry is an analytical tool extensively used to detect, sort, and count phytoplankton, cells, and other microparticles.[1–5] By combining high-throughput flow cytometry with various imaging acquisition technologies such as multispectral imaging,[6] imaging flow cytometry is capable of capturing thousands, even millions of images with multiparametric morphology information, allowing automated high-throughput data collection. However, human experts are often required for performing image analysis on traditional imaging flow cytometry. Intelligent imaging flow cytometry (IIFC), as shown in Figure 1, which combines imaging flow cytometry and artificial intelligence, has been demonstrated for imaging-based highthroughput biosensing.[7–18] Artificial intelligence (particularly Dr. S. Luo, Prof. G. Chierchia, Prof. T. Bourouina ESYCOM, CNRS UMR 9007 Universite Gustave Eiffel Noisy-le-Grand, Paris 93162, France E-mail: tarik.bourouina@esiee.fr Dr. S. Luo Shanghai Gene Sense Biotech Co., Ltd 111 Xiangke Road, Zhangjiang High-Technology Park Pudong New District, Shanghai 201210 China Dr. S. Luo School of Microelectronics Southern University of Science and Technology 1088 Xueyuan Avenue, Nanshan District, Shenzhen Guangdong 518055, China Prof. Y. Shi National Key Laboratory of Science and Technology on Micro/Nano Fabrication Department of Micro/Nano Electronics Shanghai Jiao Tong University 800 Dongchuan Road, Shanghai 200240, China E-mail: yuzhi.shi@sjtu.edu.cn Prof. Y. Shi, Dr. L. K. Chin, Dr. X. Jiang, Dr. A.-Q. Liu School of Electrical & Electronic Engineering Nanyang Technological University 50 Nanyang Ave, Singapore 639798, Singapore E-mail: LKCHIN@mgh.harvard.edu; eaqliu@ntu.edu.sg Dr. L. K. Chin Center for Systems Biology, Harvard University Massachusetts General Hospital Boston, MA 02114, USA Dr. P. E. Hutchinson Life Sciences Institute National University of Singapore #05-02, 28 Medical Drive, Singapore 117456, Singapore The ORCID identification number(s) for the author(s) of this article can be found under https://doi.org/10.1002/aisy.202100073. © 2021 The Authors. Advanced Intelligent Systems published by WileyVCH GmbH. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. DOI: 10.1002/aisy.202100073 Adv. Intell. Syst. 2021, 3, 2100073 2100073 (1 of 21) Prof. Y. Zhang School of Mechanical & Aerospace Engineering Nanyang Technological University 50 Nanyang Ave, Block N3, Nanyang Ave, Singapore 639798, Singapore Prof. H. Talbot Centre de Vision Numerique Universite Paris-Saclay CentraleSupelec, Saint-Aubin, Paris 91190, France © 2021 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH

REVIEW

Figure 1. Overview of intelligent imaging flow cytometry.

deep learning) plays a critical role in IIFC by providing new approaches of image enhancement, reconstruction, correction, and more importantly automated object recognition and identification of cells and other targets of interest. Advances in artificial intelligence lead the development of IIFC. Several instances of IIFC using deep-learning models, such as VGGNet, GoogleNet, ZooplanktoNet, DenseNets, and deep active learning, have been demonstrated.[19–22] A typical IIFC is shown in Figure 1, which combines flow cytometry, image acquisition technologies (laser/image sensors), and artificial intelligence. The system supports multiparametric analysis and highthroughput detection of the properties of a single cell from hundreds to millions of cells per second. IIFC is widely used in clinical diagnostics,[23] environmental monitoring,[24] and other potential biosensing applications.[3,25–28] In this Review, we focus on the recent developments in IIFC from the perspective of imaging technologies, the evolution of machine learning for computer vision, and machine learning techniques that have been developed specifically for IIFC. The emergent imaging technologies such as multispectral imaging,[6] multi-fieldof-view imaging,[29] and serial time-encoded amplified microscopy (STEAM)[30,31] are discussed to reveal more distinctive features of images. To understand cytometry imaging, we introduce the fundamentals of visual understanding and the evolution and knowledge of deep learning, which will increase the understanding of machine learning in a visual perception. Next, we review the interesting applications of machine learning in this field. Finally, we summarize the Review and give the perspectives of future development of machine-learning-assisted IIFC. 2. Imaging Technologies for Flow Cytometry Technologies to obtain images with both high temporal and high spatial resolution are critical but challenging.[6] The fundamental trade-off in imaging technologies is sensitivity, acquisition speed, and the amount of acquired information. There are two typical sensors used for imaging: 1) multipixelated imaging devices (camera-based), such as the charge-coupled device (CCD) and complementary metal–oxide–semiconductor (CMOS),[32] Adv. Intell. Syst. 2021, 3, 2100073 2100073 (2 of 21) and 2) single-pixel photodetectors, e.g., the photomultiplier tube (PMT) and avalanche photodiode (APD).[33] The camera-based imaging flow cytometry has a dense 2D array of CCD or CMOS sensors, such as the commercial systems ImageStream (Figure 2a) and FlowSight, both developed by Millipore.[34] They support multispectral imaging acquisition up to 12 images per cell and three different imaging modes— bright-field, scattering, and fluorescence—based on the technique time delay and integration (TDI).[35–37] The TDI sensor includes multiple rows of CCD or CMOS sensors. When applied to imaging, the detecting objectives move along the column direction and the imaging data are shifted row by row. The system can read out a weak imaging signal without motion blur even with increasing exposure time. Unfortunately, data transfer between rows without gain (e.g., electron multiplication) also restricts the system to the limit of 3000 cells per second. To increase the throughput, multi-field-of-view imaging flow cytometry[29] was developed, as shown in Figure 2b. This method projects multiple fields of view into a 2D camera, such as microfabricating several microfluidic channels with N Â M microlens arrays to capture multiple images simultaneously. Motion blur is a big problem in this kind of imaging cytometry when the targets move too fast and cannot be resolved by the imaging sensor under a fixed exposure time. Temporal coded excitation[38] is a technique used to avoid motion blur, which uses a pseudorandom-code-modulated excitation pulse to illuminate the object. PMT sensors[33] provide superb sensitivity for photon signals with a high dynamic range, high bandwidth, and low dark noise, which serve as perfect candidates to implement high-throughput imaging flow cytometry. Normally, a laser scanner is used to generate the images from the time-domain signals collected from PMTs such as STEAM,[30,31] as shown in Figure 2c. STEAM uses a near-infrared laser light with a wide spectral bandwidth as the illumination. The broadband laser pulses are encoded to 2D with two diffraction gratings for scanning and illuminating the cell. Eventually, the rainbow signal is collected by an APD detector. STEAM can achieve a throughput of 100 000 cells per second. Other examples using PMTs include fluorescence imaging by radiofrequency-tagged emission[39,40] for high-speed fluorescence imaging, spatial-temporal transformation cytometry, etc. © 2021 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH 26404567, 2021, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aisy.202100073 by Test, Wiley Online Library on [03/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License www.advintellsyst.com www.advancedsciencenews.com

Figure 1. Overview of intelligent imaging flow cytometry.

Figure 2. Optical systems of typical imaging flow cytometry. a) Optical configuration of ImageStream imaging flow cytometry. b) Multiple field-of-view

imaging flow cytometer. c) Schematic illustration of STEAM flow analyzer. Reproduced with permission.[6] Copyright 2016, Royal Society of Chemistry. The emerging commercial imaging flow cytometry empowers high-speed cell sorting and microscopic imaging. For example, the ImageStreamX Mk II uses high-resolution and highsensitivity objective lenses to capture bright-field, dark-field, and fluorescence images.[37] The system contributes significantly to the advancement of a wide range of quantitative, statistically robust cellular analyses, cellular classification, cell-to-cell interactions, microalgae morphology, population dynamics, etc. FlowCam is another imaging flow cytometer that was originally developed by Fluid Imaging Technologies (Yarmouth, ME, USA) to study oceanic plankton.[41] It uses a camera and flash illumination to snap the image of the moving particles in real-time. An image-processing software with machine learning algorithms is run to generate a single grayscale or color image of single cells. The software supports the extractions of different features such as area, area-based diameter, length, width, equivalent spherical diameter, and others properties.[42] The Submersible Imaging FlowCytobot is another type of imaging flow cytometer that can be submerged in water up to 40 m depth for 6 months.[43] It can also transmit acquired data to the cloud in real time. The Submersible Imaging FlowCytobot works similarly to a standard flow cytometer that uses hydrodynamic focusing to focus the sample stream and laser (e.g., 635 nm red diode laser for chlorophyll) to excite the particles for light scattering and fluorescence imaging, which allows us to analyze cells with size smaller than 150 μm. 3. Machine Learning 3.1. Machine Vision and Image Analysis Imaging flow cytometry technologies enable capturing and analyzing images of cells with high quality and high throughput. Adv. Intell. Syst. 2021, 3, 2100073 2100073 (3 of 21) In addition to the challenges in image acquisition, storage, and processing, image analysis also requires significant efforts for the development of imaging flow cytometry, which promotes advances in machine vision. The working principle of a machine vision system[44] is elaborated here. First, an object is converted into an image signal through a machine vision device such as a camera. Then, the image signal is sent to a dedicated image-processing system to obtain the morphological information of the captured object. According to the pixel brightness, color, and spatial distribution, the imaging system performs various algorithms on those signals to extract the characteristics of the target object. Next, a control operation of the equipment is generated according to the result of the discrimination algorithms. The goal of computer vision is to fully understand the image of the electromagnetic wave from the reflection of the object surface, mainly the visible and infrared parts. 3.2. Traditional Machine Learning Since 1960,[45] a theoretical framework for object recognition has been conceptualized, as well as several general vision theoretical frameworks, visual integration theoretical frameworks, and many other new research methods and theories have emerged. Consequently, the processing of general 2D information and the research on the model and algorithm of 3D images have greatly improved and the machine has developed vigorously with emerging new concepts and theories. Before the invention of deep learning, the image analysis methods could be divided into the following five categories: image perception, image preprocessing, feature extraction, inference prediction, and recognition.[46] In the early-stage development of machine learning, among the © 2021 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH 26404567, 2021, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aisy.202100073 by Test, Wiley Online Library on [03/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License www.advintellsyst.com www.advancedsciencenews.com

Figure 2. Optical systems of typical imaging flow cytometry. a) Optical configuration of ImageStream imaging flow cytometry. b) Multiple field-of-view

dominant statistical machine learning groups, little attention was

paid to features. The design principle of early-stage development of machine learning is to combine pixel values of the image in a statistical or nonstatistical form to express the part of or the whole object that one wants to identify or detect. In 2001, a face-recognizing approach that was capable of working in real time using Haar-like features to locate a face was launched.[47] Proposed in this approach, the Viola/Jones facial detector is a powerful binary classifier consisting of several simple classifiers, and is still widely used today. However, at the inception of the Viola/Jones facial detector, it was considered relatively time-consuming in the learning phase because adaptive boosting (Adaboost) is used to train the cascade of simple classifiers, such as finding the object of interest (e.g., a face). The model needs to split the input image into multiple rectangular blocks and then submit them to the cascaded weak detectors. If the patch passes through all stages of the cascaded weak detectors, it is classified as a positive example. Otherwise, the algorithm will reject the patch immediately. This whole process is repeated multiple times on various hierarchies of image scales. In 2009, another important feature-based milestone work called deformable part models (DPM) as shown in Figure 3 was developed.[48] The DPM decomposes the object into partial subobjects, which follows the idea on the image model introduced in the 1970s, enforces a set of geometric constraints among them, and treats the simulated potential object center as a potential variable. The DPM excels at object detection tasks (using bounding boxes for localizing objects) and defeating template matching as compared to other object detection methods that were popular at that time whereby the histogram of oriented gradient (HoG)[49] feature, as shown in Figure 4, was used to generate the corresponding “filter” for various objects. The HoG filter can record the edge and contour information of the object and use it as a filter at various positions in different pictures. When the output response value exceeds a certain threshold, the filter and the object in the picture are treated as highly matched, thus completing the detection of the object. The HoG is a good feature descriptor that has been successfully deployed in human face detection problems.[50] The HoG has an advantage on capturing the dense gradient information of images, which is similar to scale invariant feature transform (SIFT),[51] but the HoG demands fewer computation resources. The HoG is also resistant to the lighting conditions; e.g., it reduces shadows’ influence and other illumination variations such as smaller rotation and translation of the particle objects with the gradient and histogram algorithms. As shown in Figure 4, the HoG calculates on small blocks in a window of 8  8 pixels. In that window, the direction of gradient θ and the magnitude G are calculated by G¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi G2 þ G2 , x y Gx ¼ Hx à Iðx, yÞ, θ ¼ arctan Gx Gy Gy ¼ Hy à Iðx, yÞ (1a) (1b) where Iðx, yÞ is the input, Hx is the vector ½À1 0 1 Š, and 2 3 À1 Hy is the vector 40 5. 1 Finally, a gradient histogram of those 8  8 blocks is generated and put into nine bins. Each bin corresponds to the angles of direction of the gradient in 0 , 20 , 40 , 60 , 80 , 100 , 120 , 140 , and 160 . As the gradient and magnitude of the image are mostly sensitive to the lighting, normalization on the histogram is desirable. The gradient histogram would result in more robust feature sets because it can eliminate the effect of variations when the lighting conditions are varying. Local binary pattern (LBP) is a popular texture feature extraction method with an excellent performance in face detection.[52,53] LBP excels in differentiating bright pixels from a dark background, which is used to describe edges, lines, spots, etc. The procedure of LBP feature extraction is shown in Figure 5a. First, the original input image is arranged into individual small cells with 8  8 pixels. Then, the LBP feature of each cell is calculated by comparing the intensity of the eight neighboring pixels with that center pixel and generating an 8 bit binary number in which 0 or 1 indicates that the intensity of the neighboring pixel is lesser or higher than the center pixel, respectively, as shown in Figure 5b. Figure 3. Detections obtained with a single-component person model. Example detection obtained with the person model. The model is defined by a coarse template, several higher-resolution part templates, and a spatial model for the location of each part. Reproduced with permission.[48] Copyright 2008, IEEE. Adv. Intell. Syst. 2021, 3, 2100073 2100073 (4 of 21) © 2021 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH 26404567, 2021, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aisy.202100073 by Test, Wiley Online Library on [03/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License www.advintellsyst.com www.advancedsciencenews.com

dominant statistical machine learning groups, little attention was

Figure 4. Working principle of histograms of oriented gradients.

Figure 5. Local binary patterns. a) The procedure on local binary patterns histograms. b) The process steps on how to calculate local binary patterns. Demonstrated examples of differentiating between Jurkat cells and white blood cells using traditional machine learning with imaging flow cytometry include the imaging flow cytometry data analysis that uses the features generated from CellProfiler and compares with the gradient boosting (GB)[54] classifier or a random forest (RF)[55] classifier to recognize the Jurkat cells.[56] Another example, such as identifying the label-free white blood cells using the features generated from CellProfiler and comparing with five common classifiers such as K-nearest neighbors (KNNs), AdaBoost, GB, RF, or a support vector machine (SVM).[57] The SVM classifier[58] is one of the most popular discriminative classifiers before the era of deep learning, which translates the vector of training data into a high-dimensional space and performs the discrimination. By doing this, the optimal hyperplane can be generated, which splits the dataset into different classes via a training process. SVMs can be expressed as the following optimization problem 8 n X >1 T > W W þC > ξi > <2 i¼1 minimize (2) W ∈ H,b ∈ R,ξi ∈ R > yi ðW T φðx i Þ þ bÞ > 1 À ξi > > > : subject to ξi ≥ 0, i ¼ 1, : : : , n where the two-class problem (binary problem) was defined as y ∈ {1, À1}. W is the weight, ξ is the margin constant, b is the bias, and C ∈ ℛþ is the regularization constant. The φ function Adv. Intell. Syst. 2021, 3, 2100073 2100073 (5 of 21) optionally projects the vector of training data into a highdimension feature space H by the so-called kernel trick, where the SVM can generate the boundary of decision surfaces easily. A good choice for φ is to use the radial basis function kernel as Kðx i , x j Þ ¼ φðxi ÞT φðx j Þ and Kðx i , x j Þ ¼ expðÀγjjx i À x j jj2 Þ, γ > 0 for the kernels. A distance-based classifier such as the Mahalanobis distance classifier is an extension of the least-squares multiclass maximum likelihood classifier taking cross-correlations into account.[59,60] The Mahalanobis distance classifier measures the number of standard deviation distance d with the calculated distance of x to a dataset and a mean ui. The covariance matrix is PÀ1 defined as the equation and T is a standard transpose i operation. The classification result is predicted by measuring the distance from x to classes i and assuming the result has the minimal distance from the cluster of true predicted class. The Mahalanobis distance can be reduced to the Euclidean distance when the covariance matrix is the identity matrix. The equation of the Mahalanobis distance is expressed as[60] dðx, uÞ ¼ ðx À ui ÞT ΣÀ1 ðx À ui Þ i (3) Machine vision is used to determine whether a set of image data contains a specific object, image feature, or motion state. This problem can sometimes be solved automatically by an algorithm, but so far, there is no single method that can be widely used to perform well in varied situations, i.e., to identify any © 2021 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH 26404567, 2021, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aisy.202100073 by Test, Wiley Online Library on [03/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License www.advintellsyst.com www.advancedsciencenews.com

Figure 4. Working principle of histograms of oriented gradients.

object in unpredictable environments. The prior art can only

solve well in the recognition of specific targets, such as simple geometric figure recognition,[61] face recognition,[62] printed or handwritten document recognition,[63] and vehicle recognition.[64] Unfortunately, the recognition often requires a specific lightening in a certain background, and designated target postures. Designing features by hand requires a lot of experience such as a profound understanding of the field. The algorithm may also require a lot of debugging. Moreover, machine vision engineers not only need to manually design features, but also need to design a more suitable classifier algorithm for the problem. The combination of designing features and choosing a classifier at the same time to achieve the best results is a difficult task, requiring well-trained experts. 3.3. Deep Learning Machine vision systems are developed such that users do not need to manually design features and choose classifiers. It is desirable for machine vision systems to learn features and classifiers simultaneously, which means that when a user designs a certain model, the input is just a picture, and the output is its label. With the rapid development of deep learning, the emergence of convolutional neural networks (CNNs) has made this idea possible, and the research of computervision based on deep learning has also developed rapidly. LeCun proposed the first CNN in LeNet[65] in 1998, as shown in Figure 6. The input image is a 32 Â 32 grayscale image. The first layer undergoes a set of convolution sums and generates six 28 Â 28 feature maps (C1), which pass a pooling layer to get six 14 Â 14 feature maps (S2) and pass a convolution layer to generate sixteen 10 Â 10 convolution layers (C3). Next, they pass the pooling layer to generate sixteen 5 Â 5 feature maps (S4). It was used to classify handwritten digits 0–9 with two fully connected layers as the final layers. In 2012, a deeper and wider neural network AlexNet was published, which achieved a breakthrough with proposed 10% higher accuracy than traditional methods in ImageNet large scale visual recognition challenge (LSVRC).[66] Nowadays, deep learning has been applied to a variety of areas and huge progress has been made in those fields, including visual recognition,[67] speech recognition,[68] biomedicine,[69] and natural language processing.[70] Deep-learning methods are well suited to constructing architectures that can be trained end to end from image data to achieve cell classification. This approach reduces manual laboring in the traditional approach, as shown in Figure 7. It can automatically build multiple levels of representation of data with abstraction. For example, the first layer studies the edge or color information. The second layer studies the motif information. The third layer may learn the eyes and nose information. Finally, the deeplearning method can learn the weights for the classifier to detect the human face. The importance layers in the deep neural network are the convolutional layer, active layer, and pooling layer (Figure 8), e.g., the CONV layer (convolutional layer (Convolution) þ the ReLU layer (Activation)), and the fully connected layer (FC layer). The convolution function is used to extract the features from the input. The basic operation of convolution is shown in Figure 9a. On the left side of the figure, the input has a dimension of 32 Â 32 Â 3. It convolutes with a kernel H with a size of 3 Â 3 Â 3. Finally, a feature with 30 Â 30 Â 3 dimensions is generated, which is calculated by sliding the kernel from the top-left corner to the bottom-right on the input line by line and one layer of output is generated by the operating of element by element multiplied and accumulated with the kernel. For example, ten kernels will generate ten layers of output. The ReLU layer, as shown in Figure 9b, is a rectified linear unit activation function. It implements a nonlinear “trigger” function with the formula y ¼ maxðx, 0Þ, while the input has the same size as the output layer. The ReLU layer outputs zero when the input is negative. Compared with other nonlinear functions such as a sigmoid, hyperbolic tangent, and absolute of hyperbolic tangent, the networks with ReLU learn severalfold faster than other nonlinear functions. The max-pooling layer as shown in Figure 9c is used to reduce the resolution of the features. It makes the features more robust with lower noise and distortion. For instance, the pooling layer cuts down the sample from the input dimension of 224 Â 224 Â 64 into an output dimension of 112 Â 112 Â 64 with a filter size of 2 Â 2 and stride with two steps. One or several fully connected layers are normally added to the last layer of a CNN and acts as a classifier for the final decision. The full connected layer always takes a vector of m input (X) as the input volume and generates n output (Y) with a function that is expressed as Figure 6. The architecture of the LeNet-5 neural network. A CNN for handwriting digital recognition. Reproduced with permission.[65] Copyright 1998, IEEE. Adv. Intell. Syst. 2021, 3, 2100073 2100073 (6 of 21) © 2021 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH 26404567, 2021, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aisy.202100073 by Test, Wiley Online Library on [03/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License www.advintellsyst.com www.advancedsciencenews.com

object in unpredictable environments. The prior art can only

Figure 7. Comparison between a) traditional machine learning and b) deep learning for classification.

Figure 8. A convolutional neural network. Figure 9. Layers of a CNN. a) Convolutional operation. b) Rectified linear unit (ReLU). c) Max-pooling operation. Adv. Intell. Syst. 2021, 3, 2100073 2100073 (7 of 21) © 2021 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH 26404567, 2021, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aisy.202100073 by Test, Wiley Online Library on [03/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License www.advintellsyst.com www.advancedsciencenews.com

Figure 7. Comparison between a) traditional machine learning and b) deep learning for classification.

Yn ¼ WXm þ b

(4) where m is the input dimension, which is computed with the weight matrix W with matrix multiplication and added to a bias offset b. The learning and optimization process is used to generate the optimal values of the trainable parameters such as kernel weights in convolutional layers and the weights in dense layers. The parameters are optimized by the backpropagation algorithm, which uses a gradient descent (GD)[71] method to optimize the model iteratively by minimizing a loss function (e.g., crossentropy loss). The three frequently used GD methods are batch gradient descent, stochastic gradient descent, and minibatch gradient descent. Softmax regression[72] of the classification layer outputs was used to train the network, which can be written as expfx k g j yj ¼ Pn , j ¼ 1, 2, : : : , n k i¼1 exp fx i g (5) where x k is in the input and yj is the output probability. i During the training, the loss is calculated from the model input with forwarding propagation whereby the loss difference backward propagates from the output to the input layer to generate the gradient of each layer. The parameters of every layer are updated with that gradient and the parameters of the model are converged after the iterative process. 3.4. Recent Advances in CNNs A CNN is a powerful neural network that is widely used for image classification and segmentation. The CNN is inspired by the natural visual perception mechanism from the human perception system. The early attempt was the proposed neocognitron system in 1980.[73] By improving the structure of the neocognitron, LeCun proposed LeNet-5 to solve handwritten digits, which established the modern framework of the CNN.[65] LeNet5 gave a basic idea of the CNN that uses a three-tier architecture: convolution, downsampling, and nonlinear activation functions. A CNN extracts image space features using convolution and reducing image average sparsity with downsampling. The activation function takes a hyperbolic tangent or sigmoid function. A multilayer neural network as the final classifier uses sparse connection matrices between layers to avoid large computational costs. LeNet-5 can be trained using the backpropagation algorithm and derive an effective representation of the original image, which allows the CNN to recognize the object directly from the original pixels with minimal preprocessing. However, due to the lack of large-scale training data and limited computing power, LeNet-5 could not work well on complex problems. From 1998 to 2010, the developments of neural networks was intense in the machine learning community, but it was not highly visible to the computer vision community. The rich dataset, the advanced theories in deep learning such as improving neural architectures, optimization methods (stochastic gradient descent, Nesterov accelerated descent,[71] etc.), and the hardware improving (e.g., GPUs, low-power CPUs, fast- and low-latency disks such as single-shot detectors (SSDs)) have brought costeffective hardware to the world, making deep neural network Adv. Intell. Syst. 2021, 3, 2100073 2100073 (8 of 21) computation affordable, and opening the door for deep learning. In 2010, a GPU neural network was published.[74] In 2012, AlexNet was published,[66] which is relatively deeper than LeNet’s network and won the first champion of the 2012 ImageNet Challenge,[66] as shown in Figure 10. AlexNet not only has deeper neural networks, but also learns more complex features in the rich image dataset than LeNet. AlexNet introduced the ReLU function instead of tanh as its activation function, which is convex and has no vanishing gradient for positive weights, considerably reducing computation time in the learning phase. Furthermore, AlexNet used the dropout technique to clip certain neurons during training to avoid overfitting. It also introduced max-pooling technology and significantly reduced training time with a GPU. After the success of AlexNet, the researchers proposed other architectures, such as VGG,[75] GoogleNet,[76] residual network (ResNet),[77] MobileNetV2,[78] SENet,[79] and BiT–L (another version of ResNet).[80] Regarding the structure, one of the CNN’s development directions is focused on increasing the number of layers. As the ILSVRC 2015 champion, ResNet has 20 times more layers than AlexNet and 8 times more layers than VGGNet. By increasing the depth, the network can use the increased nonlinearity to derive the approximate structure of the objective function while yielding better performance than previous networks. However, this also increases the overall complexity of the network (more layers) and makes the network difficult to optimize and easily overfit. In addition, the optimization problem becomes more difficult when the network becomes deeper, with a larger parameter space. Therefore, simply increasing the depth of a network will result in higher training error. For example, the accuracy of a 56-layer network is not as good as that of the 20-layer network. In view of the effect from the layer, ResNet was designed with a residual module that allows us to train deeper networks.[77] The core idea of ResNet is to add a direct connection channel (X) to the network, known as identity shortcut connection.[81] The network structure of traditional deep learning is a nonlinear transformation that is performed on the input, whereas ResNet allows the original input information to be passed directly to the subsequent layers, as shown in Figure 11. Traditional convolutional networks or fully connected networks will have more information loss during information transmission. Consequently, they will also cause gradients to disappear or explode and make deep networks unable to train. ResNet solves this problem to a certain extent as it protects the integrity of the information by directly bypassing the input information to the output. The entire network only needs to learn the part of the difference between input and output, simplifying the learning objectives and difficulty. A comparison of VGGNet (e.g., VGG-19) and ResNet is shown in Figure 12. The biggest difference between VGGNet and ResNet is the use of bypass connection to directly connect the input to the subsequent layers, which is also called shortcut or skip connections. Various methods have been proposed to improve network performance in various aspects. The recent improvements of CNN include the convolutional layer, pooling layer, activation function, loss function, regularization, optimization, and fast computing techniques, for example, the inverted residual block (IRB), which was first introduced by the MobileNetV2[78] architecture that includes a 1 Â 1 expansion convolutional © 2021 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH 26404567, 2021, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aisy.202100073 by Test, Wiley Online Library on [03/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License www.advintellsyst.com www.advancedsciencenews.com

Yn ¼ WXm þ b

Figure 10. ImageNet challenge.

depthwise convolution uses a separable filter with one filter per input channel to produce the output channel, as shown in Figure 13a. The depthwise convolution is represented as 0 1 X ˆ i,j,z kÀ1 ˆ x,y,z Xk ¼ δ@ Fk ·XxþiÀ1,yþjÀ1,z þ bk A (6) z i, j ˆ where δ is an activation function, and b is a bias; Fk is the depthk ˆ wise filter in which the zth channel in F only calculates with the ˆ zth channel of XkÀ1 and produces the feature Xk in the zth channel. A pointwise convolution uses a 1 Â 1 filter to produce the final activation map as shown in Figure 13b. Compared to the traditional convolution, the computation saving of the depthwise 1 1 separable convolution is N þ D2 , where N is the number of output k Figure 11. Residual learning building block. The core idea of ResNet is to add a direct connection channel to the network, known as identity shortcut connection. The network structure of traditional deep learning is a nonlinear transformation that is performed on the input while ResNet allows the original input information to be passed directly to the subsequent layers. layer, a depthwise convolution layer, and a 1 Â 1 projection. The depthwise convolution layer and projection layer are referred to as the depthwise separable convolution adopted by Xception.[82] The depthwise separable convolution[83] splits the traditional convolution operation into two separated steps by two convolutions: the depthwise convolution and the pointwise convolution. The Adv. Intell. Syst. 2021, 3, 2100073 2100073 (9 of 21) channels, and Dk is the kernel size. Furthermore, the IRB also increases the memory efficiency with its unique architecture. In addition, the skip connection structure is introduced to the IRB, which allows the network to access features in earlier stages and leads to a deeper neural network with high efficiency. Metric learning is used to learn a distance function that measures similarity whereby similar targets are associated with a small distance, and dissimilar ones with a large distance.[84] Deep metric learning (DML) currently mainly uses the deeplearning-based basement network to extract embedding, and then uses the L2 distance to measure the distance in the embedding space. In general, DML consists of three parts: a feature extraction network to map embedding, a sampling strategy to combine the samples in a minibatch into many subsets, and finally the loss function calculates the loss on each subset as shown in Figure 14. For example, in deep metric learning with © 2021 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH 26404567, 2021, 11, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aisy.202100073 by Test, Wiley Online Library on [03/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License www.advintellsyst.com www.advancedsciencenews.com

Figure 10. ImageNet challenge.



Flipbook Gallery

Magazines Gallery

Catalogs Gallery

Reports Gallery

Flyers Gallery

Portfolios Gallery

Art Gallery

Home


Fleepit Digital © 2021