From the course: Advanced Cyber Threat Intelligence

Introduction to data processing

- [Alyssa] Hello, and welcome again to the Advanced Cyber Threat Intelligence course. This video is an introduction to the second module, Data Processing. In most of the cases, the data that we collected from multiple sources comes in various formats, and this is due to the different nature of sources. In other words, we are combining two or more data sources, including internal and external or finished reports and threat feeds. This combination is a necessity to keep an eye on the full picture or the full threat landscape, but you will want to make sure that you don't generate duplicate alerts. This is why going through the processing phase is essential. In this short video, we will introduce the data processing phase, the different phases involved in the processing of data, and why is it important for threat intelligence. Let me start with a quick definition. Data processing is the transformation of the collected data into a format usable by the organization. Almost all raw data collected needs to be processed in some manner, whether by humans or machines. Keep in mind that if you are collecting your data from multiple sources with different formats, then you'll need different approaches of processing, and the time consumed in obtaining the desired result depends on the operations which need to be performed on the collected data and on the nature of the output requirement to be obtained. At a high level, the most common approaches used for automated processing today include basic pattern, such as regular expressions, to identify data that is or is not of interest, statistical or probability algorithms to identify things which are or are not similar, machine learning algorithms to provide statistical classification around what is or what is not normal or expected, or natural language processing of a human produced text to extract sentiment, intent, purpose, target, or topic. When it comes to limitation, even with machine learning and expert systems, there is still today no replacement for the human analyst, and thus, there is no fully automated way to produce high-quality, tailored threat intelligence. Now let's talk about a human-based approach. In this method, data is processed manually without the use of the machine. This reliance on humans as part of the process arises from the unique trait that we have over computers, our ability for adaptive reasoning, or in other words, our ability for problem-solving and our ability to think literally. In cases of finished reports, it's difficult to make software to automate extraction of indicators because some of them are non-common items. Some reports may describe incidents without explicitly mentioning IOCs, so an analyst creates HTTP indicator based on this report, while a tool will probably will be unable to classify or normalize properly the threat. As a result, threat intelligence analysts are able to go beyond what any fully automated system can do nowadays in terms of finding related events, observables, tactics, techniques, procedures, and actors, while also providing valuable context and meaning to the business. Data processing is a composed phase, and it is considered combination of sorting and filtering, normalization and storage and integration. Sorting and filtering is often referred to as pre-processing, and it is the stage at which raw data is cleaned up and organized for the following stage of data processing. Basically, if you are collecting data from several sources, you will need to make sure to eliminate bad data, including duplicates, incomplete, or incorrect data. The second stage is normalizing, and here, we are going to choose the standard or format that is the most suitable for our requirements. In other words, if the output is an indicator that will be added to a watch list, then the format should be compatible with the SIEM solution used in our organization. For this, threat intelligence defined multiple standards to describe threats and manipulate threat data. By the end of this stage, raw data takes the form of usable information. The final stage of data processing is storing and integration. We are going to see this stage in more details in future video. All of these stages can be done by single software or a combination of softwares, whichever feasible or required by your company. Nowadays, more and more data is collected from multiple sources, free and paid ones, including network traffic files, malware samples and sandboxing results, finished reports about incidents, lists of email addresses used for phishing campaigns, malicious domains, malicious IPs, et cetera. Dealing with non-processed data is time consuming, and sometimes it's difficult or even impossible for analysts to correlate events and make assessment only based on raw data. This is why processing of collected data is really, really important. This is all for this introduction. In this video, we saw definition of data processing, some approaches of processing, the different stages of data processing, and why is it important. This video was a quick introduction to the second module, Data Processing. In the next lesson, we are going to discover together some examples of common standards used in cyber threat intelligence.

Contents