Deep Learning Algorithm for Detecting and Analyzing Criminal Activity

When applied to an entire field, automation and autonomous systems are among the rare creative superpowers capable of catapulting progress at an exponential rate. The arrival of machine intelligence will give such automated machines the intelligence to perform their tasks with power of outcome, drastically reducing the need for human intervention in redundant processes. Large-scale technological progress can be traced back to responsibilities that are simplified and, as a result, more easily distinguished by means of automation. In accordance with these guidelines, we propose creating a product that eliminates or significantly reduces the need for human intervention in primary issue statements that can be automated and processed. The public safety infrastructure of today relies on surveillance cameras, but these devices are merely video recorders; they have no intelligence of their own. Automated video streams are now required for automatic event detection thanks to the massive amount of data produced by surveillance cameras. The project's main objective is to increase public safety through the mechanization of crime measurement and review using actual Closed-Circuit Television footage (CCTV). This is achieved by assigning the task of recognizing criminal behavior to a system that can do so automatically, allowing for more precise tracking. In this study, we present a model with a precision of 0.95 for assault and 0.97 for abuse.


I. INTRODUCTION
HE need to solve and mechanize supervised classification of live-streaming data in real-time prompted the presentation of the machine vision problem of image classification. Due to the novelty of the issue, there may be workarounds that have not been tried and tested. Beyond that, such applications provide a wide range of solutions, including the initial detection of significant sports actions or everyday activities occurring in a scene, as well as various security and health activities [1]. The goal of our study is to increase public safety by developing an automated system for measuring and analyzing criminal activity. This system will be able to distinguish between normal and criminal behavior based on inferred patterns, freeing humans from the burden of identifying criminal behavior. The current state of detection technology has several drawbacks that prevent it from working with today's widely available infrastructure [2][3][4]. The need for a watchful supervisor to review footage and ensure that any unusual activity is properly detected and addressed is a major weakness of conventional surveillance systems. Inaccuracies may occur when a person reviews CCTV footage [3,38]. The proposal not only eliminates the need for extra guidance in order to reduce human input and labor, but it also instantly recognizes the type of crime that is taking place, notes the people involved, and takes immediate steps to start mitigation strategies at the crime scene. Deep learning techniques, particularly CNN architectures like Residual models, can be used to automatically identify many crimes. That can detect the location of the perpetrator and the weapon in a video. Creating a system of automated activity analyzers to increase public safety is simplified by the availability of large datasets like the UCF-crime collection [3] and the RWF-2000 database [4].

II. LITERATURE REVIEW
Numerous related papers and examples of their use in practice are discussed here. In order to better detect criminal activity, researchers have built a pipeline to recognize firearms from photographs by training them on classifier models. VGGNet 19 was used as the pre-trained model for detection, and results showed an accuracy of 69% and a recall of 75%. A technique was proposed for classifying the presence of weapons in surveillance footage and using that information to establish whether or not a crime took place. Region-based Convolutional neural network (RCNN) and faster region-based convolutional network (FRCNN) models were trained using the researchers' T data. This research used low-quality films that accurately predicted high crime rates. As an introduction to common techniques for locating and identifying action at sporting events, researchers provide an overview of the field. The authors proposed segmenting the activity recognition pipeline into three distinct stages: feature extraction, deep learning depiction of clips, and sport classification [5][6][7]. Using the UCF-supplied Sports dataset as a benchmark, they evaluated the contentious issue. Researchers in their investigation of the CNN structures proved the effectiveness of video categorization when using convolutional neural networks (CNNs) [8]. Researchers outperformed other techniques in terms of reliability and ability to comprehend strong features from sparsely labeled data. Studies of transfer learning demonstrate the generalizability of various categorization tasks and imply that the acquired characteristics are generic. Researchers look into a similar objective by studying how to predict and classify criminal behavior using techniques like Decision Trees and Naive Bayes Classifiers, which are part of machine learning. Datasets here make use of geographical information. For data collected in Los Angeles and Denver, they achieved an accuracy of 54% and 51%, respectively [9,39].

III. PROPOSED METHOD
One of the biggest issues with standard surveillance systems is their reliance on an attentive supervisor to watch film and make sure that any unusual activity is properly recorded and addressed. CCTV footage needs to be reviewed by a human, which could result in mistakes [10]. The proposed work aims to eliminate the need for extra supervision in an effort to reduce manual intervention and labor, but it also instantly recognizes the type of crime that is taking place, takes note of the people involved, and in response, it sets off actions that start taking care of the crime scene right away [11].
Identifying criminal activity in surveillance camera feeds is the primary focus of this paper. Further, a separate module utilizes a technique called Triplet Loss, which has already been implemented, to identify faces in these streams. The paper proposes two modules to illustrate the issue. The first option is to use the techniques outlined here to identify individuals in a CCTV feed [12]. Next we will examine how to use deep learning techniques for crime detection. The paper's two modules are depicted in a high-level overview in Figure 1.
The DNN (Deep Neural Networks) module in Open-builtin CV is used by the Face Recognition feature to train a model specifically for facial recognition. Embedding training with the Triplet Loss Function [13,30] allows us to do this by first locating faces in the input data, then cleaning and preparing the data, and finally training. After calculating these embeddings, the model can be used to identify the faces in question. All that needs to be done is to load the model and use a webcam to identify the face by drawing bounding boxes around it and giving a confidence parameter. This paper accomplished a lot in terms of evaluating this module. The input videos for the crime detection module are first sorted to distinguish between those containing criminal activity and those without. It is trained using a ResNet architecture [14] that already exists. This architecture measures accuracy and other metrics after basic preprocessing steps like adding more data and turning videos into image frames. Finally, the trained model allows crime detection via mobile phone webcam simulation. In this paper, evaluation for a total of six classes is accomplished at an accuracy of over 90%. Altogether, this paper suggests a full pipeline that integrates these two parts to determine the perpetrator simultaneously.

IV. FACE RECOGNITION
The process of face detection involves identifying and returning the position of a face within a picture or video. The next step in face verification is to check if the image of the face being presented matches one already in the database. To do this, we employ distance metrics such as the L2 norm or cosine similarity to determine the degree to which two faces are alike. Finally, face recognition uses both of these methods to extract salient facial features and assign those features to one or more labels from the dataset used to train the model. In this paper, we propose a way to recognize faces by finding faces, computing face embeddings, training a Support Vector Machine (SVM) [15,40] on the given embeddings, and then finding faces in images or simulated video streams. Figure 2 shows the pipeline that is talked about in this paper. Caffe and Open Face Models are responsible for face detection and feature extraction, respectively. OpenCV uses Single Shot Detector (SSD) architecture [16] and a ResNet to perform deep learning face detection. Singleshot detection is a method whereby multiple objects can be located in a single image with a single model training it. The given image is discretized into the various boxes that are generated around the regions with high confidence feature maps [17,45]. After determining the level of certainty associated with each box, their sizes are modified until the best possible detection fit is achieved. When a face is recognized, the final bounding boxes are shown in Figure 3. More so, we can preprocess images and carry out face alignment on datasets with enhanced outcomes by using the dlib library to identify facial markers, including the mouth, right and left brows, eyes, nose, and jawline [18]. After performing some basic editing and alignment on the provided face, we feed it into the proposed neural network. Every input batch needs to have three images: a Positive Picture (another image of person "A"), a Negative Image (the current perception of person "A"), and an Anchor Image (any other image that is not person "A"). The neural network uses triplet loss to calculate the face embedding and fine-tune the weights. As a result, the embedding of the "Anchor" and "Positive" photos are relatively close together, however the embedding of the "Negative" image is more apart. The facial recognition pipeline begins with the computation of face embedding using a convolutional neural network (CNN) (Caffe) model; these embeddings are sufficiently distinct from one another to enable the training of a classifier on top of the computed face embedding (such as Random Forests, SGD Classifiers, SVMs, and so on) [19].

A. DATA AUGMENTATION:
In order to maximize the usefulness of a little amount of data, our research suggests data augmentation techniques [20] to identify patterns in our data. Images can now be off flipped, rotated, zoomed, translated, scaled, cropped, moved along the x and y axes, subjected to shearing, skewing, filtered in black and white, and blurred, among other effects.  The purpose of this component is to identify outliers from typical behavior and video footage. Because it is the only dataset with recordings of many sorts of crimes, the UCF Crimes Dataset is used for training [21,44], each of which has its own unique characteristics. The dataset includes 13 different types of incidents: accidents, fights, burglaries, shoplifting, robberies, shootings, abuse, arrests, arsons, assaults, thefts, explosions, and vandalism. In all, there are about 1,900 pieces of actual stuff in there. Figure 5 and Figure 6 show sample frames. Each scenario uses one of three methods: converting, enriching, and augmenting the data. Abuse, assault, fighting, normal, robbery, and vandalism are the six broad categories used to organize the criminal acts examined in this research. The authors reduced the number of dataset classes because the provided movies were too large. With the help of video editing and trimming, five-minute-long videos were condensed to forty-five seconds, with the emphasis laid on the time of the actual incident, rather than on irrelevant or misleading material. Some parts of the crime scene were highlighted in lowresolution videos that were cropped and sharpened. Each video clip of a criminal act is individually tagged by hand. After that point, the rest of the video feed can be considered canonical.
Finally, data augmentation was carried out by increasing the variety of input data for a given model's training. An illustration of data augmentation is shown in Figure 7.
 The residual network (ResNet): The ResNet layers are created so that they are formulated as learning residual functions with relation to the layer inputs rather than learning unreferenced functions [21][22][23][24]. The signal cannot be transmitted from one convolutional layer to the next since each of the 18-152 layers has a short connection between them. These links can carry gradient flows from the first to the last layer of a network, simplifying the training of extremely deep neural networks. Figure 7 for residual block shows how the link suppresses the signal moving from top to bottom (below). The designers were able to resolve the vanishing gradient issue using skip connections [25,[36][37] and the notion of a Residual Network. As a result, it might create a direct connection to the output and skip training on a few levels.
 Technique for Automatically Categorizing Videos: In order to classify crimes using the proposed pipeline, it is necessary to iterate over each frame of the input video. Similar frames are fed into a convolutional neural network [26], and the results are categorized independently. The model selects the highest probability label before writing the output frame. The previously mentioned solution, which only takes into account a single frame, will not work for us because our problem is sequential. There must be some kind of connection kept alive between successive frames of a single video input for the purposes of crime detection.

Figure 7. Residual Block
This is accomplished by keeping track of the most recent "N" forecasts and computing the aforementioned for a given time frame. The pipeline [27] takes these into account when determining which label has the highest probability by averaging the last "N" predictions, and then returns the result.
 Using the UCF-Crimes Dataset to train the ResNet: Here, we go over the steps involved in ResNet's training and testing processes. Finding the image folders used for training and testing is the first order of business [28]. Parameters for training are also provided, including batch size, epoch count, image width and height, and learning rate. The Tensor Flow Image Data Generator [29,35,[41][42][43] component is used to produce test and train sets. Each epoch's worth of training data is recorded and plotted afterward. The model and its weights are now preserved. The model's hyper-parameters, such as the learning rate, the number of epochs, and the batch size, are specified before training begins. While training a CNN [30,46], it is important to keep in mind the optimal values for its hyper-parameters. The final training epochs are displayed in Figure 8. The ResNet Module was supplied with the information listed below [31]:  Dataset Splitting: 25% of dataset is kept as test data and for model training kept remaining 75%.  Total Epochs = #50  Categorical Cross-Entropy = Loss  Stochastic Gradient Descent Optimizer = 0.0001 (Learning Rate)  Accuracy as Metric The following factors are used in the analysis and conclusion [31][32][33]:  However, no criminal activity actually takes place at the location, resulting in a False Positive in the model.  One type of error is called a "false negative," which occurs when the model does not detect a crime.  One definition of a "True Positive" is when the model reliably detects the offending event.  The true negative scenario is one in which neither the crime nor the failure to detect it occurs.

A. FACIAL RECOGNIZATION:
Using roughly 15 photos of each participant, we trained facial recognition software to distinguish between three different faces. For this module, the authors had to do with as little data as possible due to constraints in both their financial and computational capacities; the results are depicted in Figure 8 below, which shows the use of a camera to simulate a CCTV stream. The model accurately identified each face and provided a confidence score and bounding box for its classification.

B. METHOD OF CRIME SPOTTING
With respect to training Tables I and Table II display the results of testing the model over 50 epochs. Both the training loss and the validation accuracy were recorded. We evaluated their trained model with training dataset, looking for words like "fighting," "vandalism," and "abuse." This resulted in the model's accurate forecasting of the subsequent events in the stream [34,[47][48]. It is shown below in Figure 9.  Table 1 shows the accuracy metrics for the proposed system where Table 2 shows the resultant performance metrics for different classes. Figure 10 shows the training loss and accuracy as well as the validation loss and accuracy that have been recorded.

VI. CONCLUSION
The project's face recognition module replicated a facial recognition model and showcased a virtual version of the final prototype. After all the data had been prepared, the Crime Detection module trained the ResNet Model for several classes and showed good results. It is possible that in the future, additional facial classes will be used to train the face recognition model. It would be fascinating to compile a library of these faces for later use in identifying victims in real time from CCTV footage. Examining the differences between the performances of various facial recognition models would be fascinating as well. The model is trained in the way that the person involved in the activity can also be detected and being useful for further investigation. It is also important to detect the criminal activity at real time from CCTV footages. The paper investigates a limited classes due to computational constraints, however aims to cover more in future. The proposed model gives precision of 0.97 for abuse and 0.95 for assault. The Fiveminute-long films were trimmed down to forty-five seconds with the aid of video editing and cutting, with the emphasis being placed on the time of the actual incident rather than on unrelated or false information, i.e., the data used is in small amount. This is the novel approach used in this research.