Smart Motorcycle Helmet: Utilizing CNN for Multi-Task Learning

Enhancing road safety through education and enforcement relies on automated motorcycle helmet detection via video surveillance. However, existing methods encounter challenges like tracking individual motorcycles and distinguishing riders from passengers. To overcome these limitations, we propose a CNN-based multi-task learning (MTL) method. Our approach focuses on identifying and tracking individual motorcycles, particularly emphasizing rider-specific helmet use. We introduce the HELMET dataset, containing 91,000 annotated frames from 10,006 motorcycles across 12 observation sites in Myanmar, serving as a benchmark for future detection techniques. Leveraging concurrent visual similarity learning and helmet use classification, our MTL approach achieves enhanced efficiency and accuracy, operating at over 8 FPS on consumer hardware. With a weighted average F-measure of 67.3% for detecting riders and helmet use, our method underscores deep learning's accuracy and resource efficiency in gathering critical road safety data. Furthermore, we present an intelligent motorcycle helmet equipped with infrared transceivers, an image sensor, an embedded computation module, a charging module, a microphone, and earphones. Designed for large vehicle approach notification, the helmet utilizes image recognition modes for both day and night conditions. Experimental results demonstrate successful vehicle registration plate recognition for large trucks/buses, achieving up to 75% accuracy during the day and 70% at night. In real-time, the proposed intelligent motorcycle helmet detects approaching large vehicles within a 5-meter distance.


Introduction
In contemporary times, motorcycles have become widely embraced as a convenient and popular mode of transportation across various regions, including China, Indonesia, India, the Philippines, Malaysia, Taiwan, Thailand, and Vietnam.
Offering distinct advantages such as affordability, superior fuel efficiency, and a compact size suitable for easy parking in congested areas, motorcycles dominate the transportation landscape in many Asia-Pacific regions poses significant risks and potential dangers.Globally, there are approximately 200 million motorcycles, with a density of around 33 motorcycles per 1,000 people [1] .The fatality rate per vehicle mile traveled for motorcycles in 2007 was 37 times higher than that of passenger cars, according to data from the National Highway Traffic Safety Administration (NHTSA) in the United States.Disturbingly, the World Health Organization (WHO) reports an annual toll of 2.17 million motorcyclist deaths from traffic accidents, with an additional 20 to 50 million sustaining injuries.
In response to the critical issue of road safety for motorcyclists, numerous countries have implemented laws mandating the use of helmets.Research indicates that wearing a helmet can significantly reduce traffic accident fatalities by nearly one-third (approximately 29%).Taiwan, for instance, grapples with safety concerns given its substantial motorcycle population, constituting 63.83% of the total motor vehicles by the end of 2023 [2] .Alarming statistics reveal that over 3,500 people in Taiwan perish annually in road traffic accidents, with motorcyclists accounting for more than 70% of these incidents [3] .The risk of death per kilometer traveled on a motorcycle is 20 times higher than that for car drivers or passengers.Large vehicles, such as trucks and buses, are frequently implicated in motorcycle collisions, exacerbated by factors like blind spots and front-view dead angles [4] .
To address the pressing issue of heavy vehicle rear collisions and safeguard motorcyclists, this paper proposes an intelligent motorcycle helmet.This innovative helmet integrates infrared (IR) sensors with an image sensor, utilizing realtime image processing and recognition methodologies to identify large trucks and buses [5] .The system aims to prevent collisions and enhance road safety for motorcyclists, particularly in congested areas where traditional alerts may lose effectiveness.Moving forward, the paper introduces a novel system for helmet use detection in Intelligent Transportation Systems (ITS) using a combination of Transformer models for vision and the Cascade RCNN framework [6] .Leveraging the Swin Transformer as a feature extraction backbone, the proposed method outperforms existing CNN-based approaches in detecting helmet usage with a mean average precision (mAP) of 30.4 [7] .
The study emphasizes the potential of Transformer-based models in solving computer vision problems within the realm of road safety.By introducing cutting-edge technologies and methodologies, the paper aims to contribute to the ongoing efforts to mitigate the risks associated with motorcycle accidents and improve overall road safety.Traditional methods for detecting active motorcycles typically follow a standardized procedure.Initially, a background subtraction technique is employed to isolate moving objects or vehicles from video data [8] .Subsequently, a binary classifier, such as a support vector machine (SVM), is applied to identify motorcycles.The next step involves localizing the head region of motorcyclists, followed by the utilization of an additional classifier to differentiate between helmet use and non-helmet use.
To enhance the performance of the binary classifier, handcrafted features, like extracting a histogram of oriented gradients (HOG) from detected head regions, are commonly employed [9] .
However, these methods encounter limitations when dealing with scenarios involving numerous motorcycles or multiple riders on a single motorcycle.In contrast to handcrafted feature design, deep learning methods aim to automatically derive representations from raw image data that are optimal for helmet use detection.In [10] , helmet use is classified within detected head regions using a convolutional neural network (CNN).In [11] and [12] , two separate CNNs are trained-one to distinguish motorcycles from other vehicles, and the other to classify helmet and non-helmet use in the head region of riders.To address the time-consuming nature of employing two separate CNNs for motorcycle and helmet use detection, [13] and [14] advocate using a single CNN for simultaneous detection of motorcycles and helmet use.However, the tracking of individual motorcycles across single frames of recorded video is only incorporated in half of the existing approaches outlined in Table 1.As video data from traffic surveillance infrastructure inherently follows a frame-based structure, helmet use data generated through automatic detection must be mapped onto individual motorcycles to enable an accurate evaluation of helmet use [15] .This necessitates the remapping of frame-based detection results for motorcycle and rider counts, as well as helmet use, onto individual motorcycles appearing in multiple frames.
Unfortunately, some approaches lack this cross-frame tracking element.
To address the absence of tracking, alternatives include utilizing single-frame detection at a fixed point or line in the frame to prevent repeated detection of the same motorcycle or collecting helmet use data in every video frame without tracking, resulting in a loss of information regarding the number of motorcycles at an observation site [16] .Both shortcuts compromise the quality of helmet use data and hinder the utilization of multiple frames of an individual motorcycle for helmet use and rider detection.Concerning rider number and position detection, only one of the approaches listed in Table 1 provides detailed information on this aspect [17] .While alternative approaches use head counts on the motorcycle as a proxy for rider numbers, this method lacks the granularity needed for accurate detection of rider number and position [18] .

Methodology
The identification of motorcycles within a single frame constitutes a traditional object detection task.For this purpose, we employed a cutting-edge object detection algorithm trained specifically for motorcycle detection in our dataset.Presently, object detection algorithms can be broadly categorized into two approaches: one-stage and two-stage.While two-stage algorithms generally exhibit superior accuracy in object detection, they tend to be comparatively slower due to the need to process frames twice-once to identify potential object locations and once more to detect the actual objects.In contrast, single-stage methods consolidate the steps of localizing potential objects and object detection into a single processing stage [19] .This consolidation results in a slight reduction in accuracy but a significant decrease in processing time.
A relatively recent single-stage method, [20] employs a multi-scale feature pyramid in conjunction with focal loss to effectively overcome limitations in detection accuracy.It achieves faster detection compared to two-stage methods while maintaining a higher detection accuracy than comparable single-stage methods like YOLO [21] .Consequently, we opted to utilize a RetinaNet model for motorcycle detection.Given the similarity between motorcycle detection and other object detection tasks, we chose not to train the model from scratch.Instead, we fine-tuned a RetinaNet model using pre-trained weights obtained from the COCO dataset [22] .This approach leverages existing knowledge to enhance the model's proficiency in accurately identifying motorcycles in our specific dataset.
Within the framework of edge computing, data processing occurs on servers positioned at the edge of the network.These servers establish direct connections with a myriad of sensors and controllers, enabling them to analyze information and execute machine learning algorithms for real-time decision-making [23] .This project employs an ESP32 module equipped with an integrated camera functioning as a Wi-Fi camera.The processed data stream is securely transmitted to the Google Cloud Platform via the Cloud IoT Core, facilitated by a Raspberry Pi board serving as a local server executing the TensorFlow object detection model [24] .The data undergoes event-driven processing, triggering alerts as necessary.
Additionally, a local server facilitates access to a web interface for offline monitoring of cameras, while Firebase cloud functions are responsible for archiving data on Firebase.This archival process facilitates the streaming of video to internet-connected users via the web interface.The Raspberry Pi Gateway scans the network via mDNS to locate local cameras, identifying objects and transmitting processed data to the cloud [25] .It also hosts a web interface for local data access.The @tensorflow/models package offers diverse pre-made machine learning models through NPM for various data types and purposes.

Model Training and classification
For a closed track of sufficient length, its helmet use is estimated by pooling the helmet use prediction of cropped image patches within the track.More specifically, let (x (n) ) N n=1 be the cropped image patches that are assigned to a tracked motorcycle, then the track's helmet use class is estimated The Swin Transformer architecture processes a sequence of tokens as its input, with these tokens being generated through the application of a patch partition layer on the input image, dividing it into N patches [26] .The architecture's hidden layers consist of multiple blocks, each comprising a multi-head self-attention module (MSA) as depicted in Figure 2.This MSA module employs an attention function on a set of query (Q), key (K), and value (V) vectors.It maps the query to a collection of key-value pairs, producing an output through the dot product of the query vector with all key vectors.A softmax function is then employed to scale the inner products and normalize them into weights (k), as defined in Eq.1:The W-MSA module conducts attention calculations locally, applying self-attention on non-overlapping windows.To facilitate cross-window self-attention computations, the SW-MSA module performs identical calculations as the W-MSA module but after shifting the windows.This innovative approach enhances the Swin Transformer's ability to capture relationships within the input sequence using window-based attention mechanisms.
Training the model involves leveraging TensorFlow's Object Detection API, which simplifies the creation, training, and deployment of object detection models.Each model within TensorFlow's repertoire is characterized by its Speed, Mean Average Precision (mAP), and Output [27] .Typically, a higher mAP corresponds to a slower speed.In our case, we opted for the EfficientDet-Lite0 architecture for training.However, the choice of model architecture can vary depending on the priority between speed and accuracy.The EfficientDet-Lite [1][2][3] [4] family comprises mobile/IoT-friendly object detection We conducted a more in-depth analysis of the results, categorizing them by the number of occupants per motorcycle, and observed improved outcomes for classes with up to two riders, as detailed in Table 1.This improvement is likely due to the prevalence of single or dual-rider scenarios in real-world settings, resulting in a larger dataset for these particular classes.However, our scrutiny also brought to light some dataset limitations, notably the presence of imbalanced classes.
Some classes lacked sufficient samples in both the training and test sets, with classes 21 to 32 having no examples, and classes 33 to 36 having fewer samples in the training set and none in the test set.This imbalance posed challenges for accurate detection of these categories by the models.
Our proposed model, incorporating the Swin Transformer as a backbone and integrating it with a Cascade RCNN framework for object detection, outperformed all other models in terms of mean average precision (mAP) across 36 classes.Notably, it surpassed the YOLOv7 [28] model, which achieved the second-highest mAP score.This superiority can be attributed to the Swin Transformer's more effective feature extraction capabilities, leveraging attention mechanisms to extract pertinent features from input images.In contrast, YOLOv7 relies on predefined anchor boxes and a CNN-based backbone for feature extraction, a method that may not be as proficient as the attention mechanism in identifying and extracting relevant features [29] .While the YOLOv7 model shows a slight edge in weighted mean average precision, it falls short of outperforming the Swin Transformer and Cascade RCNN model in the overall mAP.

Intelligent Helmet with Advanced Vehicle Recognition and Alert System
While this intelligent motorcycle helmet excels in recognizing rear-approaching vehicles, it may encounter limitations in identifying larger vehicles.Consequently, its applicability becomes constrained, particularly in congested areas, as depicted in Fig. 3. Notably, existing studies have yet to address the need for an active intelligent helmet capable of detecting large truck/bus proximity, despite the severity of accidents involving such vehicles.
To address this gap, this paper introduces an intelligent helmet that integrates various electronic components, including miniature IR transceivers, a compact image sensor (camera), an embedded systems module, and a battery charging module.The proposed system employs an image recognition method to ascertain the presence of an approaching heavy vehicle in the motorcyclist's vicinity.Furthermore, when such a vehicle is detected, earphones are utilized to promptly deliver a voice alert, prompting the motorcyclist to swiftly navigate away without the need for constant mirror checks.This design allows the motorcyclist to focus on the vehicles ahead without compromising safety.The integration of TensorFlow models in renewable energy into electric vehicles represents a groundbreaking convergence of two cutting-edge technologies with profound implications for sustainable transportation and energy efficiency.This integration harnesses the power of machine learning and renewable energy sources to optimize the performance, range, and environmental impact of electric vehicles (EVs) [30] .
At its core, TensorFlow serves as a robust framework for developing and deploying machine learning models, including those tailored for renewable energy applications.processes.The first process involves assessing whether there is a proportional relationship between the plate and the frame.The objective here is to discern the potential presence of a license registration plate in the image.The second process entails evaluating the color ratios within a rectangular frame to confirm the existence of license registration plates on larger vehicles in the image.
A standard ratio is applied to an object image with dimensions of "2.375" in length and "1" in width.This ratio serves as a criterion to determine the presence of the correct length and width for the license plate rectangle.If these criteria are not met, the system proceeds to capture a new image for re-recognition.Conversely, if the dimensions align, the next step involves assessing whether the rectangular frame meets the pre-established color density threshold.If it falls short, the system prompts the capture of a new image for subsequent recognition.

Conclusion
This paper introduces an intelligent motorcycle helmet designed to enhance motorcyclist safety.The proposed helmet incorporates an image recognition approach focused on identifying large trucks and buses on the road.To achieve this, two recognition algorithms, one for daytime and another for nighttime, are developed specifically for detecting vehicle registration plates on large Taiwanese trucks and buses.The accuracy of the registration plate recognition is evaluated using a dataset of 600 images capturing rear-approaching trucks and buses on real roads, collected by 10 motorcyclists during both day and night conditions.The recognition accuracy rates are reported as approximately 78% at night and 85% during the day.Furthermore, the proposed intelligent helmet features integrated Bluetooth (BT) transmission, enabling the helmet to send notifications when a large truck or bus is approaching.Results from helmet use detection of tracked motorcycles reveal a weighted F-measure of 67.3%, demonstrating the capability of the approach to provide reliable estimates of motorcycle helmet usage, rider number, and position.The ablation study results indicate that the approach achieves a notably high accuracy against ablation experiments, albeit with some compromise on computational efficiency.
Despite this, the approach processes at a speed of more than 8 frames per second (FPS) on consumer hardware, approaching real-time efficiency for 10 FPS video data.Overall, this work demonstrates the successful implementation of all four fundamental elements of helmet use registration through a CNN-based approach that is computationally efficient on consumer hardware.

Figure 1 .
Figure 1.TensorFlow architecture overview Attention(Q, K, V) = Softmax(dkQKT ) V Here, dk represents the key dimension, and normalization is carried out by dividing by dk.The Swin Transformer introduces two variations of the multi-head self-attention module: the window-based Multi-head self-attention module (W-MSA) and the shifted window multi-head self-attention module (SW-MSA).

Figure 2 .
Figure 2. Two consecutive Swin Transformer blocks.The first consists of a regular window multi-head self-attention module (W-MSA), while the latter uses a shifted window configuration (WS-MSA).

Figure 3 .
Figure 3.The system block diagram of the hardware implementation for the proposed intelligent helmet.

Figure 4 .
Figure 4. Flowchart of the detection mode