1 Introduction

Emergency response decision-making for urban fires necessitates careful consideration of numerous factors, especially with the increasing urbanization in China and the growing complexity of buildings and construction [1]. Obtaining timely information about emergencies or fire management is crucial for enhancing the efficiency of emergency decision-making and rationalizing the allocation of rescue resources [2]. The challenges of high operation and maintenance costs, limited usability, information lag, and incomplete coverage associated with traditional fire information collection methods underscore the need for a more advanced approach [3]. In contrast, a real-time monitoring system for fire information has proven invaluable for urban fire response and decision-making planning, incorporating real-time resource path suggestions [4,5,6]. A real-time monitoring system with integrated deep learning for fire classification contributes to more effective and timely responses to urban fire incidents, overcoming the limitations of traditional methods [4, 5].

In recent years, rapid developments in computer vision and deep learning algorithms have led to the categorization of commonly used detection algorithms into two main types. One category is the two-stage object detection network, exemplified by R-CNN (region-based neural network) [7], fast R-CNN [8], mask R-CNN [9], and similar algorithms [10]. This approach involves extracting depth features from the input image in the first stage, followed by generating and classifying candidate areas. While this method achieves low misdetection and recognition rates, its speed is relatively slow, making it unsuitable for real-time detection scenarios. The other category is the one-stage target detection network, represented by algorithms such as YOLO (you only look once) [11] and SSD (single shot multibox detector) [12]. Unlike the two-stage approach, the YOLO method directly generates target location coordinates and categories without generating candidate areas [11]. Although it has higher mismatch and missing recognition rates, it is faster and more suitable for real-time detection scenarios. An improved R-CNN network structure was proposed for fire detection and location, achieving an impressive accuracy of 94.5% and a recall rate of 97% [13]. However, the limited frames per second (fps) at 25 render is unsuitable for real-time detection scenes [13]. Conversely, a study utilized the SSD network structure with principal component analysis (PCA) for forest fire detection, achieving an accuracy of 88.2% and a faster identification speed of 46 FPS [14]. Recent studies of smart city development with automatic fire detection systems using YOLOv6 achieved 93.48% accuracy [15]. In addition, smoke and fire detection methods were proposed using YOLOv5n with 89.3% accuracy [16] and YOLOv8 with 95.7% accuracy [4]. Although the proposed models have new features for smoke detection, real-time systems have not been used in these recent studies. Therefore, training and testing of YOLO models, including real-time applications such as YOLOv3 with 0.24 s at 1 fps, are needed [17].

In urban fire scenarios, decision-makers require timely and accurate information to make informed decisions [2]. However, the vast number of images and videos in the monitoring system, including unrelated information unrelated to fires, imposes strict demands on the accuracy and speed of image screening [5, 18]. First, the YOLO method achieves nearly double the speed of the R-CNN network, although the recognition accuracy might not be optimal [19]. Second, numerous studies have successfully applied YOLO to real-time detection scenarios across diverse fields, yielding positive forest [18] and city [5, 20] fire detection outcomes. Third, the YOLO network performs target category determination and positioning in a single step, comprising only convolution layers and the input image [19]. Therefore, this approach results in an exceptionally rapid recognition speed, making it suitable for real-time detection scenarios. In this study, YOLOv5, the fifth version of the YOLO network model, surpassed its predecessors by enabling faster and more accurate detection by integrating Darknet into PyTorch [21].

In an environment without a CPU, the YOLOv5s algorithm achieved an average detection speed of 35 fps, displaying excellent detection results with a 1.1% increase in accuracy compared to YOLOv5. For instance, it achieved a model accuracy of 91.8% in construction site helmet-wearing detection [22]. While R-CNN has demonstrated functionality in detecting objects, the YOLOv2 network achieved a higher processing time and accuracy in fire and smoke detection in the fast R-CNN and R-CNN algorithms [19]. The accuracy of object detection is a primary concern in this study, and the YOLOv5 network achieves very good results, particularly in implementation, making it suitable for embedded systems on real-time monitoring systems. The development of YOLOv5 with feature fusion and extraction capabilities for remote-sensing object detection resulted in a 0.6% improvement in overall performance compared to R-CNN [23]. Previous studies have consistently highlighted that the YOLOv5 network achieves outstanding performance and processing time in object detection, making its integration into real-time fire monitoring systems promising.

In the official algorithm of YOLOv5, three versions of network models, YOLOv5s, YOLOv5x, and YOLOv5, were used in this study. The size of each YOLOv5 model mainly differs in the width and depth of the networks [24]. The testing performances and complexities of the four variants of network YOLOv5 models are shown in Fig. 1. YOLOv5 can be used for small object detection with low accuracy and is difficult to deploy in real-time platforms [25]. In YOLOv5, the neck architecture includes different modules to help find the spatial and channel-wise relationships in the features extracted by the body [26]. Furthermore, YOLOv5x can process multiple times quickly and can achieve a high level of precision on benchmark datasets [27]. YOLOv5x uses cross-stage partial (CSP) in the neck architecture to enhance the efficiency of deep neural networks (DNNs) [27]. With the deepening and widening of the network depth and width of the other models, the accuracy of AP continues to improve, but the speed consumption also continues to increase. Thus, the YOLOv5 model has the smallest width and depth, and the YOLOv5s network has the fastest speed, but the average precision (AP) accuracy is relatively low. It could be suitable for real-time target detection scenarios based on large targets with extra attention mechanism modules to enhance its AP accuracy.

Fig. 1
figure 1

Comparisons between YOLOv5m, YOLOv5l, YOLOv5x and YOLOv5s from https://github.com/ultralytics/yolov5

The YOLOv5 network model can incorporate the spatial pyramid pooling fusion (SPPF) module [18] but not the squeeze-and-excitation network (SE) module. The SPPF module represents an optimized version of spatial pyramid pooling, allowing for the detection of multiscale representations in the input feature maps, particularly for objects of varying sizes [28]. In addition, the SE module facilitates dynamic channel-wise feature recalibration, enhancing the overall performance [29]. The benefits of incorporating the SE and SPPF modules could improve real-time monitoring systems for fire detection and classification.

In this study, the YOLOv5s model with SE and SPPF modules and three types of YOLOv5 models (YOLOv5s, YOLOv5, and YOLOv5x) were compared and implemented for real-time fire detection. Annotated images from the internet were used to train and test the proposed model. In addition, the fire scenarios were classified into four different categories: car/vehicle, building, industrial, and normal. This categorization would facilitate distinct responses and planning for fires, as well as the allocation of manpower. Subsequently, the trained model was integrated into a real-time fire monitoring system to evaluate the real-time video processing time. Figure 2 illustrates the real-time fire monitoring system with rescue path responses. The main contributions of this study are as follows:

  1. 1)

    The aim of this study is to demonstrate the feasibility of incorporating an attention mechanism that is crucial for enhancing the performances of the SE and SPPF modules in the YOLOv5s model.

  2. 2)

    The performances of YOLOv5s, YOLOv5, and YOLOv5x and the proposed model on the YOLOv5s model are compared with those of the SE and SPPF modules.

  3. 3)

    The proposed YOLOv5s model is known for its speed and ability to perform real-time fire detection. The proposed approach is suitable for smart city systems integrated with quick and timely detection of fires and responses.

Fig. 2
figure 2

Screenshots of the real-time smart city fire monitoring system: A displayed on the main screen, B showcasing secure routes and C, D promoting fire detection screens

This study presents the proposed YOLOv5 design with the SE module in Methods section. In Results section, the experimental evolution of the proposed YOLOv5 model is compared with that of YOLOv5s, YOLOv5, and YOLOv5x. The proposed YOLOv5 model was also integrated into smart city systems, and its real-time fire detection performance was demonstrated. Finally, the results are compared with those of a recent study to show the contributions in Discussion section.

2 Methods

YOLOv5s employs 12 convolutional kernels to transform the original 608 × 608 × 3 image into a 304 × 304 × 12 feature map. These configurations involve nuanced modifications to the depth and width of the feature graph, with YOLOv5s characterized by the most modest width and depth. The augmentation of residual components and convolutional kernels is a continuous process that contributes to the refinement of detection accuracy. However, this enhancement is counterbalanced by an escalated demand for computational speed. YOLOv5 comprises four similar network structures: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x [30]. The visual representation distinctly reveals that the YOLOv5s network exhibits superior speed, albeit at the expense of the lowest AP accuracy. This attribute renders it well suited for real-time target detection scenarios, particularly those involving substantial targets.

The YOLOv5 model proposed in this study is structured into distinct components, as illustrated in Fig. 3: the input, backbone, neck, and prediction architectures. The key modules contributing to the architecture include focus, Conv-Bn-Leakyrelu (CBL), CSP, and the SPPF module. The CBL module integrates convolutional layers, batch normalization layers, and hard-wish activation functions. In the YOLOv5 model, two CSP modules known as CSP1_x and CSP2_x are employed. The CSP1_x module, combined with x residual connection units in the CSL module, enhances the ability to capture and propagate features in the backbone architecture. In the neck architecture, the CSP includes x CSL modules, which are called the CSP2_x module. Furthermore, the SPPF module is integrated into the backbone architecture to increase the pooling performance with a limited amount of program computing. The SPPF module performs max pooling with varying kernel sizes and consolidates the obtained features by concatenating them. First, the SPPF module features three MaxPool convolution layers with a size of 5 × 5 on the data serially transferred from the convolutional normalization activation function, which is convolution + batch normalization + SiLU (CBS). After that, concatenation splicing occurred on the outputs from CBS and the three MaxPools to extract richer feature information. Finally, a CBS module was used to compile the results. Through parallel pooling procedures with varying sized convolutional kernels, the spatial pyramid pooling (SPP) structure increases program computation at the expense of decreased performance [31]. Therefore, the proposed model employs the SPPF module.

Fig. 3
figure 3

Proposed YOLOv5 design with the squeeze-and-excitation network (SE) module

The backbone architecture of the YOLOv5 model is designed with the inclusion of the BottleneckCSP and Focus modules [32,33,34]. The Focus module, which is unique to the YOLOv5 model, plays a crucial role in downsampling the original 608 × 608 × 3 image to a 304 × 304 × 12 input map. This also involves passing the input maps through a convolution operation with 32 kernels, resulting in an input map of size 304 × 304 × 32.

The BottleneckCSP module functions in performing feature extraction on input maps. Based on the cross-stage partial network (CSPNet) structure, the BottleneckCSP module addresses the issue of redundancy in gradient information within the model. It integrates gradient variations into the input maps to balance the speed and accuracy trade-offs while minimizing the model size. The distinctions between YOLOv5s, YOLOv5m, YOLOv5, and YOLOv5x stem from adjustments made to the depth, width, and residual components of the BottleneckCSP module.

The proposed model incorporates an attention mechanism crucial for enhancing its performance. The attention mechanisms in this study are primarily categorized into spatial and channel attention methods. Spatial attention is dedicated to the pixel position in the image and its spatial relationships with neighboring pixels. In contrast, channel attention emphasizes the relationships between different color channels of each pixel, such as the red, green, and blue channels, in RGB images.

Integrating the SE-ResNet module with channel attention methods aims to achieve improved accuracy [29]. The SE module is applied to enhance image features through a loss function, adjusting the feature weights of valid images and reducing those of invalid images during the model training process. This strategy takes into account the typically low resolution of input maps and potential variations in target sizes. In the proposed model, the SE module is strategically placed at the end of the backbone architecture to enhance the accuracy of urban fire detection systems, as depicted by the red box in Fig. 3.

The objective of the Neck architecture is to enhance the model’s capacity to effectively detect objects across various scales, allowing it to recognize the same object at different sizes and scales. This architecture comprises two components: a feature pyramid network (FPN) [35] and a path aggregation network (PAN) [36]. The FPN adopts a top-down approach, employing upsampling to fuse feature information and generate final maps. In contrast, the PAN algorithm follows a bottom-up design, facilitating the fusion of feature information twice. The synergy of the FPN and PAN strengthens the model’s feature fusion capabilities, as illustrated in Fig. 3. The neck architecture involves three feature fusion layers, which produce new feature maps at three distinct scales: 76 × 76 × 255, 38 × 38 × 255, and 19 × 19 × 255, each with 255 channels. Notably, as the target object size decreases in the feature map, the corresponding grid unit increases in size. This finding implies that the 19 × 19 × 255 feature map is suitable for detecting large objects, while the 76 × 76 × 255 feature map is optimal for detecting small objects.

The prediction architecture serves the purpose of predicting image features, generating boundary boxes, predicting and classifying the target into its category, and calculating the associated loss. The boundary box comprises parameters such as x, y, W, H, and confidence. x and y represent the relative coordinates of the center of the boundary box within the grid. W and H represent the width and height of the bounding box, respectively, in proportion to the overall image. The confidence indicates whether the grid contains objects and the accuracy of the boundary box coordinate predictions. Equation 1 illustrates the confidence calculation, where \({\text{IOU}}_{\text{pred}}^{\text{truth}}\) represents the intersection over union (IOU) between the predicted boundary box and the true area of the object. The variable \({P}_{\text{r}}\left(\text{Object}\right)\) equals 1 if the grid contains an object and 0 if it does not:

$$\text{Confidence}={P}_{\text{r}}\left(\text{Object}\right)*{\text{IOU}}_{\text{pred}}^{\text{truth}}$$
(1)

Equation 2 defines \({\text{IOU}}_{\text{pred}}^{\text{truth}}\) as the ratio of the intersection between the predicted boundary box and the true region of the target to the union of their areas. A denotes the area of the predicted boundary box, and B denotes the area of the true region of the target:

$${\text{IOU}}_{\text{pred}}^{\text{trth}}=\frac{\left|A\cap B\right|}{\left|A\cup B\right|}$$
(2)

In the prediction architecture, three types of loss functions are computed: classified loss (cls_loss), positioning loss (box_loss), and confidence loss (obj_loss). The binary cross-entropy loss function (BCE_loss), as presented in Eq. 3, is utilized for computing the classification loss. In this equation, N represents the total number of samples, yi denotes the category to which the sample belongs, and pi signifies the predicted confidence of the sample. A lower BCE_loss value indicates a higher accuracy in target classification. In other words, a reduced BCE_loss value signifies more accurate target classification. Equation 3 expresses the calculation of BCE_loss:

$$\text{BCE}\_\text{loss}=-\frac{1}{n}\sum\limits_{i=1}^{N}({y}_{i}*\text{log}\left({p}_{\text{i}}\right)+\left(1-{y}_{i}\right)*\text{log}(1-{p}_{\text{i}}))$$
(3)

The positioning loss is computed using the GIOU_loss function according to Eq. 4. GIOU_loss effectively addresses the issue of distance information loss between the predicted frame and the real frame when they do not intersect. A smaller GIoU loss indicates a more accurate prediction box. Ac is the smallest convex hull that encloses both the predicted and true frames.

$$\text{GIOU}\_\text{loss}={\text{IOU}}_{\text{pred}}^{\text{truth}}-\frac{\left|{A}_{\text{c}}-(A\cup B)\right|}{\left|{A}_{\text{c}}\right|}$$
(4)

Equation 5 refers to the formula for the BCE Logits_loss calculation, in which the sigmoid layer was combined with the loss calculation in a multilabel classification. The BCE Logits_loss calculation could trade off the recall and precision by adding the weights to positive samples. In Eq. 5, n is the total number of samples in the batch, and c is the number of categories. When c is greater than 1, multilabel binary classification is performed. When c is equal to 1, single-label binary classification is performed. \({p}_{\text{c}}\) is the weight of the positive examples of target c. For \({p}_{\text{c}}\) greater than 1, the recall rate increases. The precision increases when \({p}_{\text{c}}\) is less than 1. Finally, the smaller the BCE Logits_loss values obtained, the more accurately the regions of detection are:

$$\begin{aligned}{\text{BCE~Logits}}\_{\text{loss}} &= - w_{{n,{\text{c}}}} \;\left[ p_{{\text{c}}} y_{{n,{\text{c}}}} *{\text{log}}~\sigma \left( {x_{{n,c}} } \right) \right.\\ &\left.\quad+ \left( {1 - y_{{n,{\text{c}}}} } \right)*{\text{log}}\left( {1 - \sigma \left( {x_{{n,{\text{c}}}} } \right)} \right) \right]\end{aligned}$$
(5)

3 Materials

3724 urban fire images were obtained through web crawling, capturing a diverse range of fire types related to urban incidents, including flames unrelated to urban fires. Colabeler software (version 2.0.3, MacGenius, Blaze Software, CA, USA) was utilized for annotating target objects and categories. The labeled categories encompass various aspects of urban fires, such as building fires, industrial fires, vehicle/car fires, normal fires, and images of flames unrelated to urban fires. The labeled information is stored in text files corresponding to each image, where each line in the label file denotes target information within the image, specifying the category of the target along with the center coordinates of the detection frame and the length and width information in sequence. The proposed YOLOv5 model was chosen for training on images with a resolution of 608 × 608, which is the best resolution for the proposed model. However, the resolution of the input images was 640 × 640. Therefore, the OpenCV framework was employed in this study to resize the input images to 608 × 608. In this study, the target dataset was divided into a training set and a test set at a ratio of 9:1. The model configuration was adjusted to 4 as the number of classes (NC) parameter, the number of iterations was set to 300, and the batch size was 16. Finally, the YOLOv5 model trained on the urban fire dataset was integrated into a monitoring system for accurate and efficient data analysis. The proposed YOLOv5 model was developed in a Linux operating system with a Python 3.7 environment. The train.py file was modified to set the dataset and model weight. Table 1 provides details on the specific hardware configuration and development environment.

Table 1 Hardware configuration and development environment

4 Results

Figure 4 shows the simple feature extraction results for building fire images on the proposed YOLOv5s and YOLOv5s models after training. The results obtained from the proposed model achieve 0.84% precision, but YOLOv5s achieves only 0.65% precision. Moreover, the training matrix results of the proposed YOLOv5s model are visualized using the TensorBoard tool, as shown in Fig. 5. The matrix diagram displays the declining function loss values for the forecast box, objective function, and classification function, showing notable stabilization after 180 iterations. The precision and recall exhibit a gradual increase during the iterations, reaching stability towards the conclusion of the training process.

Fig. 4
figure 4

Training feature extraction images on the proposed YOLOv5s and YOLOv5s models

Fig. 5
figure 5

Matrix diagram of the training results on the proposed YOLOv5s model

Figure 6 presents the corresponding confusion matrix of the training model. The confusion matrix obtained results indicating 87% accuracy in detecting industrial fires in the proposed YOLOv5s model, demonstrating an accuracy of 12% for falsely identified irrelevant fires. For building fires, the detection accuracy reached 94%, with 6% mistakenly identified as unrelated fires. The accuracy in detecting vehicle fires is 98%, and the detection accuracy for unrelated fires is 100%. Among the images that did not align with the prediction results, 50% were identified as building fires, and the remaining 50% were detected as unrelated normal fires.

Fig. 6
figure 6

Confusion matrix of the training results on the proposed YOLOv5s model

The proposed YOLOv5s, YOLOv5s, YOLOv5, and YOLOv5x models were trained on 3351 images and tested on 373 images. Figure 7 shows the loss results for training and validation on the proposed target and boundary results. The results are consistent with each other. Figure 8 presents the training accuracy, recall rate, mAP@0.5, and mAP@0.5:0.95 results for each model. The accuracy of the proposed YOLOv5s model is 5.3% higher than that of YOLOv5s and 1.8% higher than that of YOLOv5. However, YOLOv5x outperforms the proposed YOLOv5s with a 2.6% higher accuracy. Regarding the recall rate and mAP@0.5:0.95, the proposed YOLOv5s demonstrates similar performance to YOLOv5. Notably, the proposed YOLOv5s exhibits an 8.3% increase in the recall rate and an 8.1% increase in the mAP of 0.5:0.95 compared to both the proposed YOLOv5s and YOLOv5x.

Fig. 7
figure 7

The loss calculation plots for A training and target features, B training and feature extraction boundary regions, C validation and target features, and D validation and feature extraction boundary regions on the corresponding proposed YOLOv5s, YOLOv5s, YOLOv5, and YOLOv5x models

Fig. 8
figure 8

Plots of A precision, B recall rate, C mAP@0.5, and D mAP@0.5:0.95 for the corresponding proposed YOLOv5s, YOLOv5s, YOLOv5, and YOLOv5x models

The evaluation metrics used for assessing the trained model include \(\text{Precision}\), recall, average precision (mAP), and F1 score (F1 score). Precision is the probability of actual positive samples among all the samples predicted to be positive. The formula is shown in Eq. 6:

$$\text{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}}*100\%$$
(6)

TP represents true-positive samples, FN represents false-negative samples, and FP represents false-positive samples. TP is a positive sample correctly detected as positive, FN is a positive sample incorrectly detected as negative, and FP is a negative sample incorrectly detected as positive. Recall is the probability of being predicted as a positive sample among samples with a positive recall rate. The formula is shown in Eq. 7:

$$\text{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}}*100\text{\%}$$
(7)

The average precision (mAP) is the average precision across multiple validation set instances, and the formula is shown in Eq. 8. Precision and recall often have a “dilemma” relationship. The F1 score (F1 score) can be used to effectively assess the performance of two metrics and identify points that are balanced between precision and recall. The formula is shown in Eq. 9. TP (true positive) is a positive sample and is detected as a positive sample; FN (false negative) is a positive sample and is incorrectly detected as a negative sample; and FP (false positive) is a negative sample and is incorrectly detected as a positive sample:

$${\text{mAP}} = \frac{1}{{\text{n}}}\sum\limits_{{k = 1}}^{n} {{\text{AP}}_{k} }$$
(8)
$$\text{F}1=\frac{2*\text{Precision}*\text{Recall}}{\text{Precision}+\text{Recall}}$$
(9)

Figure 9 displays the identified images with precision results, showcasing the boundary regions for the respective YOLO models in scenarios involving car/vehicle fires, building fires, industrial fires, and normal fires. Multiple boundary regions are obtained for industrial fires using the proposed YOLOv5s model, a feature not observed in the other models. In addition, the precision results of the proposed YOLOv5s are higher than those of the other models in this study. In Fig. 10, the results of vehicle fire detection demonstrate that both the proposed YOLOv5s and YOLOv5x models exhibit similar performances, achieving an accuracy of approximately 98%. In contrast, YOLOv5 shows the lowest detection accuracy at 92.6%. Moreover, YOLOv5s and YOLOv5 demonstrate comparable accuracy outcomes in building fire detection, achieving the accuracies of 89.3% and 90.1%, respectively. The proposed YOLOv5s surpasses the other three algorithms, attaining an accuracy of 94.3%, reflecting a 1.2% improvement compared to YOLOv5x. For industrial fire detection, the proposed YOLOv5s and YOLOv5s exhibit similar accuracies of 87.2% and 85.6%, respectively. Conversely, YOLOv5x achieved the highest detection rate for industrial fires, reaching 92.9%. In terms of irrelevant fire detection, YOLOv5s, YOLOv5, and YOLOv5x display comparable accuracies of 95.4%, 96.3%, and 94.1%, respectively. YOLOv5s exhibits a slightly lower accuracy of 89.1%. These findings provide a comprehensive overview of the model’s performance across different fire detection categories.

Fig. 9
figure 9

The recognized images with precision results and feature extraction boundary regions are presented for the corresponding YOLO models in scenarios A car/vehicle fires, B building fires, C industrial fires, and D normal fires

Fig. 10
figure 10

The bar chart illustrates the overall precision results of the proposed YOLOv5s, YOLOv5, YOLOv5x, and YOLOv5s models across scenarios, including car/vehicle fires, building fires, industrial fires, and normal fires

The precision, recall rate, average precision (mAP@0.5), and F1 score are calculated based on the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) results, as shown in Table 2. YOLOv5m exhibits the lowest precision rate, recall rate, average precision, and F1 score among the four algorithms. The accuracy of the proposed YOLOv5s is comparable to that of YOLOv5, while YOLOv5x achieves the highest accuracy. This discrepancy can be attributed to the fact that accuracy is influenced only by TP and FP. However, the recall rate of YOLOv5s outperforms that of YOLOv5x, as the recall rate is dependent not only on the TP but also on the FN. YOLOv5x exhibits lower accuracy, recall rate, average precision, and F1 score than YOLOv5s, primarily due to the smaller number of underlying feature convolution layers in YOLOv5x, resulting in lower recognition accuracy for small objects. In comparison to YOLOv5x, the proposed YOLOv5s achieves an increase of 8.3% in the recall rate, 8.1% in the average precision, and 2.5% in the F1 score.

Table 2 Precision, recall rate, mAP, F1 score and 35 fps video and image processing times of the proposed YOLOv5s, YOLOv5s, YOLOv5, and YOLOv5x models

Consequently, compared with the other three algorithms, the proposed YOLOv5s provides useful results in urban fire detection. In addition, the processing time of the proposed YOLOv5s is significantly shorter, at 0.009 s per image and 0.19 s for a 35 fps video.

5 Discussion

This study conducted a comparative test on three models, namely YOLOv5s, YOLOv5, and YOLOv5x, using the same dataset in a hardware and development environment identical to that of the proposed YOLOv5s model. The proposed model not only accurately classifies flame images unrelated to fire scenes found on social platforms but also effectively detects various fire types in relevant fire scene images, marking their confidence levels. The results indicate that the effectiveness of the model increases with increasing depth and width. The proposed YOLOv5s outperforms the other three models in terms of detection speed on both images and videos at 35 fps. The results of the proposed YOLOv5s consistently indicate superior performance compared to previous fire detection studies that implemented the YOLOv3 model [29, 37]. Moreover, the outcomes show improved performance compared to previous fire detection studies that utilized Faster R-CNN for flame detection [38, 39]. In terms of efficient forest fire detection, the proposed YOLOv5 model achieves an average accuracy of 0.86 for identifying various forest fires with a processing time of 0.015 s per frame [18].

The quantitative results of fire detection are shown in Table 3, and recent results on the performance of fire detection are compared. The proposed YOLOv5s model obtains results comparable to those of new versions of YOLO, such as YOLOv8 [4]. The benefits of the attention mechanism have been demonstrated accurately. However, the recall rate is lower than that of the new version of YOLO. New versions of models might include more object detection modules to increase the complexity and improve the recall rate. Furthermore, the new version is able to detect smoke and fire at the same time, which could not be tested in the proposed model. In addition, the accuracy of automatically moving labeled bounded boxes using YOLOv3 is similar to that of the proposed model [17]. The automatic motion-labeled bounded box mechanism is designed for microchip systems with limited accuracy on blurred images, which is a limitation [17]. Another recent study also obtained similar results in which only three layers were used [5]. Therefore, the proposed YOLOv5s model could balance the trade-off between YOLO model complexity and design to maintain the precision of fire detection.

Table 3 Quantitative results of fire detection

The remarkable results achieved in the proposed YOLOv5 model can be attributed to the inclusion of the SPPF and SE modules. A recent research study also reported performance improvements in YOLOv5 with SPPF modules, aligning with our findings [18]. The results indicate that multiple recognized boundary regions were detected in the proposed YOLOv5 model and overcome the limitation of detecting small objects [4, 15]. This outcome could be attributed to the incorporation of SE modules, which have demonstrated enhanced performance primarily due to the introduction of attention mechanisms. This improvement is evident in the superior capture of objects in the feature maps, addressing the challenge of small object detection. The incorporation of fire alerts into monitoring systems during the early stages of a fire enhances the ability to respond promptly, minimizing damage to human life and property. Placing the SE module at the end of the backbone architecture has proven to be beneficial for fire detection, further optimizing the model’s capabilities.

The data analysis of the experimental results revealed challenges in identifying distant car/vehicle fires. This is attributed to the YOLOv5s network compressing the image, causing the pixels representing distant car/vehicle fires to be too small for effective feature extraction, resulting in diminished recognition performance. It is crucial to address this recognition performance issue, especially as electric bikes and small cars become more commonly available on streets. Fire control of car/vehicle fires could be performed quickly and simply enough using fire extinguishers rather than firefighters. To overcome this limitation, it may be beneficial to consider increasing the width and depth of the YOLOv5s network appropriately. This adjustment can enhance its detection capabilities for distant targets, improving overall performance.

The proposed YOLOv5s model demonstrates suboptimal performance in identifying industrial fires. This limitation can be attributed to the infrequent occurrence of industrial fires compared to vehicle fires and building fires. The challenge in distinguishing between building and industrial fires arises from the similarity in the background and target object characteristics. In addition, the training dataset contains a relatively low number of images related to industrial fires, resulting in an unsatisfactory training effect and lower accuracy. To address this issue, a potential solution could involve collecting more images related to industrial fires for training to enhance the accuracy of identifying industrial fires. This approach could address the real-time allocation of manpower and firefighting equipment, leading to a more efficient response to fires.

6 Future Work

In this study, the results show that the attention mechanism could enhance the model performance. However, this approach is limited to the YOLOv5s model. For this reason, a comprehensive study could be continued to further test and validate the attention mechanism with or without other mechanisms in different versions of the YOLO model. This could further provide more research data on processing time and the performance and robustness of models for use in real-world applications, for example, smoke and/or fire detection. In addition, the proposed YOLOv5s model was trained and tested for bushfires. Bushfires can cost thousands of lives and damage millions of people, but a recent study achieved reasonable accuracy in forest fire detection [18]. A real-time monitoring system could be a good solution to protect lives and damage. Therefore, the proposed YOLOv5s model could also incorporate an additional function for bush fire detection.

While there is a technological trend to integrate fire detection functions into the cloud or Internet of Things (IoT) using advanced models such as YOLOv8 [4], there remains a preference for simple, portable, and high-performance solutions with fast processing speeds, especially in remote-sensing applications. Many devices in such scenarios may lack powerful GPUs and processing capabilities due to the need for portability and energy efficiency. In this context, the proposed YOLOv5s model proves to be a suitable choice for meeting the demand for efficient fire detection within reasonable budget constraints.

Finally, real-time fire detection requires fast and accurate methods related to model performance. As a result, the fire detection model could be integrated into microchips for bushfire or forest fire applications. The small and single microchip could be developed as a device and set up inside the national park or forest for real-time fire detection. Moreover, the autopiloted drone could be the other application of the proposed YOLOv5s model for automatic fire detection and fighting for bush fires or forest fires. However, the integration of the proposed YOLOv5s model into a single processing microchip requires funding and researchers from different research areas to achieve successful results.

7 Limitations

The results do not include images or videos with artifacts in the proposed model, as obtaining fire images with various artifacts is challenging. In addition, the use of high-resolution cameras in monitoring systems may introduce artifacts, but these artifacts can be considered negligible. Moreover, the proposed model integrates modules that are capable of detecting small objects, contributing to its effectiveness. Second, the dataset could encompass various fire scenarios, including not only vehicle and building fires but also water and forest fires. The superior accuracy in vehicle fire detection may be attributed to the greater proportion of vehicle fire images in the training set, while the lower accuracy in industrial fire detection across all algorithms is linked to the scarcity of industrial fire images in the training set, coupled with the similarity of industrial fires to building fires. Finally, the detection speed is dependent on the complexity of the proposed model and hardware. It is suggested that the proposed model should be integrated into a workstation rather than a PC computer, as the overall efficiency results on a PC computer were not obtained in this study.

8 Conclusion

This study integrates the SE module into the YOLOv5 model and conducts tests using a fire imaging library sourced from the internet with varying image resolutions. The dataset is manually annotated, and images are classified into different fire categories, thereby expanding the application scope of this method for detecting both rural and urban fires. The proposed YOLOv5 network not only improves the recognition of target images, achieving faster and more accurate results but also facilitates the classification of fires for informed decision-making during accidents. This research underscores the importance of rapidly and in real-time detecting fire accidents at their early stages to enhance safety and support effective decision-making in the event of accidents. Although the new versions of YOLO models can detect smoke and fire concurrently by increasing complexity, simple, fast and accurate fire detection YOLO models are required in real-time surveillance systems. Therefore, this capability allows the YOLOv5 network to meet the demands of real-world applications, including surveillance systems for fire detection. In future work, the system could be integrated with the IoT and/or microchips to kill fires.