Data Mining in Digital Forensics
Data mining is the analysis of observational data sets to find unsuspecting relationships within the data and to summarize the findings in a way that can be understood and useful for the data owner. The data used is typically collected for some purpose other than data mining analysis. This means that data mining objectives play no role in the data collection strategy. Data mining usually focuses on large data sets which also brings new problems such as how to store or access the data, how to analyze the data in a reasonable amount of time, and how to decide if the apparent relationship is just a chance of occurrence not showing any underlying reality. Since data mining’s research began in the 1980s, it has influenced the development of machine learning, artificial intelligence, and pattern recognition. Data mining has also been included in other fields such as criminal forensics. Different techniques in crime data mining involve entity extraction, clustering techniques, deviation detection, classification, string comparator, and social network analysis. Entity extraction is used to identify person, address, vehicle, narcotics and person properties from police reports. Clustering is used to automatically associate different objects (such as persons, organizations, vehicles) in crime reports. Deviation detection is applied to fraud detection, network intrusion detection, and other crime analyses. Classification has been used to detect email spamming and find the authors of unsolicited emails. A string comparator is used to detect deceptive information in criminal records. Social network analysis is used to analyze criminals’ roles and associations among entities in a criminal network.
Data mining has also been implemented in Intrusion Detectors to help analyze audit data, extract evidence of an attack to automatically compute rules that detect attacks. Intrusion Detection Systems (IDS) are used to provide an extra layer of defense against malicious uses of computer systems by sensing a misuse or a breach of a security policy and alerting operators of an ongoing attack. IDSs usually operate in a managed network between a firewall and internal network elements. Streams of data in the network are inspected and security rules are applied to determine whether an attack is underway. The problem with current IDSs is that they are expensive and slow to develop. They are also difficult to optimize local cost-benefit parameters and are ineffective in detecting certain attacks. To develop better systems, new technologies are needed to improve the detectors. This will help increase detection capabilities covering known attacks and new attacks to provide better security at an optimal cost. At Columbia University, a team of researchers has found a new approach to intrusion detection which is based on data mining of audit sources. Detection models are constructed using cost-sensitive machine learning algorithms to achieve optimal performance. The system finds the clusters of attack signatures and normal profiles and makes one light-weight model for each cluster to maximize the utility of each model. After several years, the Columbia team has been able to build a data mining facility applied to intrusion detection that has been able to gather data, analyze data, define the cost with observable features and the systems being protected, automatically compute cost-effective models of known attacks and deploy those models into a functioning detector. This type of work has led to new developments that extend intrusion detection into another generation of systems which can help better protect our critical infrastructure of complex computer networks.
Digital forensics includes a large number of different applications such as law enforcement, fraud investigation, theft or destruction of intellectual property. Some of the techniques used for these types of investigations include data mining and analysis, timeline correlation and information hiding analysis. Since multimedia format is widely used and easily available through the Internet, there has been an increase in criminal activities which involves the transmission and usage of inappropriate material such as child pornography. Digital Image Forensics (DIF) seeks evidence by using appropriate techniques based on image analysis, retrieval, and mining. In a digital forensic investigation, access to data may result in partial data, hidden data or encrypted data. An investigation involving digital evidence would consist of a sequence of rigorous steps including: extracting the data while maintaining the integrity of the original media, filtering out the irrelevant data and identifying the useful data and metadata, deriving timelines, establishing the relationships between the disparate data, establishing causal relationships, identifying and extracting profiles and generating a comprehensive report. To have an efficient image mining system architecture, an operation model of the digital forensic image mining process was developed. This model consists of two activities, one that involves the reduction of a large amount of evidence and the other involving the core image mining activities that deal with the actual image retrieval process for digital forensic examination.
Crime Network Analysis is often applied to investigations of organized crimes such as terrorism, narcotics trafficking, fraud, and gang-related crimes. Since information can flow to and from different crime groups, criminal network analysis needs to integrate information from multiple crime incidents or even multiple sources and discover regular patterns about the structure, organization, operation and information flow in criminal networks. To untangle and disrupt criminal networks both reliable data and sophisticated techniques are needed. In today’s world, criminal network analysis is a manual process that consumes a lot of human time and effort. Data mining law enforcement data can have many obstacles such as incomplete, incorrect or inconsistent data. Criminal network analysis approaches can be broken down into three generations. The first generation being the manual approach. This approach involves an analyst constructing an association matrix by identifying criminal associations from raw. Although a manual approach is helpful in crime investigations, it can become an extremely ineffective and inefficient method when data sets are very large. The second generation is the graphics-based approach which can be done with a tool to automatically produce graphical representations of criminal networks. Even though second-generation tools are capable of using various methods to visualize criminal networks, they are not sophisticated enough to produce any analytical functionality. The third generation is SNA which is expected to provide more advanced analytical functionality to assist crime investigation. Sophisticated analysis tools are needed to mining large volumes of data to discover useful knowledge about the structure and organization of criminal networks.
Email data has become the dominant form of inter and intra organizational written communication for many companies. Unfortunately, the use of email can be misused for the distribution of unsolicited and/or inappropriate messages and documents. This includes junk mail, conveyancing of sensitive information, mailing offensive or threatening material. In some misuse cases, the sender will attempt to hide their real identity to avoid being detected. This can lead to the use of computer forensics to mine for email authorship. To start, the text body of the email is not the only source of authorship. Other forms of evidence include email headers, email traceroute, email attachments, file timestamps, etc. Humans tend to have uniques patterns of behavior, biometric attributes and so on. Therefore certain characteristics of language, composition, and writing, such as particular syntactic and structural layout traits, patterns in vocabulary, unusual language usage and the sub-stylistic feature will remain relatively constant. Text categorization is to categorize a set of text documents based on its contents or topic. Most of these techniques employ the “bag-of-words” or word vector space feature representation where each word in the text document corresponds to a single feature. Algorithms such as decision trees, neural networks, Bayesian probabilistic approaches or support vector machines are then used to classify the text document. Machine learning methods can help with authorship attribution classifier using email documents as the data set.
Data mining has been a proven technology in many industrial sectors. Police forces have attempted to apply advanced computing technologies to tough crimes. One of the ways that law enforcement has tried to implement this technology is through linking crimes of a serious sexual nature. The challenge they face is how the separate offense can be linked as being possibly committed by the same offender. With databases of offenders and criminals it possible to model offender behavior by linking crimes but there is some limitation to this model. Some of the limitations include data set not being complete, unsolved crimes held within the database may be attributed to one of the known offenders or it has to be assumed that the known offenders have committed the crimes that have been attributed to them. The quality of the data directly affects the quality of the results of the mining process.
Data mining has become a big part of digital forensics and has been implemented in different fields in computer forensics. From the advancement of the IDS to making faster and more accurate detection systems. Using mining for image mining to find inappropriate material on the Internet. Criminal network analysis using data mining techniques to break down crime organization’s network and using data mining techniques to help with find misuse of email and to help find serious sexual offenders. Law enforcement and other government agencies have found the usefulness of data mining to help them solve crimes and learn more about crime groups. Computer forensics has evolved to a new level due to the involvement of data mining and other techniques such as machine learning and artificial intelligence. With the advancement of technology, data mining can prove to be more powerful with faster computers and better systems to handle the large amounts of data sets that are needed for data mining. The future of digital forensics is sure to improve with the use of data mining.