Issues of Sequential Search Pattern Matching & Pattern Matching Method for Improve performance

Issues of Sequential Search Pattern Matching & Pattern Matching Method for Improve performance

By: R&D Deputy Director Seung-il Song

To protect internal resources from external attacks, the internet environment utilizes various systems as a security measure. One of the most frequently used methods is called pattern matching. As attacks diversify, the number of patterns to respond to them continues to increase. Such an increase causes a decrease in system performance. In this paper, I will discuss ways to improve performance.

-Contents-

Issues of Sequential Search

Methods of Improvement

Conclusion

Issues of Sequential Search

In computer science, pattern matching is a method of specifying where the pattern is in the data, or if it is there at all. This article will cover string pattern matching to find if there is a match in the target string. To find a pattern that matches the string, you must begin from the beginning of the target string to compare it to the pattern list to look for a match. If multiple patterns need to be found, the process is repeated. This method is perfectly fine when there are only a few patterns that require low computing power. However, if you have hundreds and thousands of patterns, your security appliance operating this way can slow down the internet or even stop it from working.

[Pic] Visualization of sequential search using a pattern list

The above picture is a visualization of a sequential search using a list. It circles through the target string repeatedly to find the pattern. If a string is matched with a pattern at the beginning of the list, a search can be done quickly but if there are no matches or the match is with the pattern at the end of the list, it can take up a long time to look for the match. ■ Methods of Improvement There are a few things to consider when improving the performance of pattern matching.

Pattern changes do not occur frequently.
Implementation should be easy and simple.
The ability to use wildcard characters is a must.
Memory usage is limited. Several processes operate simultaneously and each has to load the same patterns, sharing the memory.
We create the pattern.

Aho-Corasick is widely known as a multi-string pattern matching method. This method uses a data structure called Trie, which can detect matching patterns with a single path search. It is known as the best performing string search method. However, depending on the input pattern, the memory can be overworked, and it has difficulty using wildcard characters. Therefore, we need to find a new method that uses an appropriate amount of memory and has better performance.

[Pic] Trie data structure example

The new method consists of a pre-processing step and a pattern matching step. The pre-processing step is a step of making a dedicated structure (pattern map) for improving the performance of pattern matching since pattern change does not occur frequently. Once this is completed, there is no need to repeat the process until patterns change.

The picture above depicts how the patterns are structured. The first 2 bytes of the pattern is indexed, categorized, and saved. This can be seen as an array of 64k sized list. In each list, patterns starting with the same 2 bytes are arranged in ascending order. The sorted list has a referrer, which directs to each item. The referrer is an element for binary search. Each list has a minimum pattern length of information, so if the target string is shorter than the minimum length, the matching step can be omitted. The pattern matching step finds matches using the pattern map created in the earlier step. The pattern matching step has to be faster than other methods to have improved performance.

The picture above shows the process of matching patterns in the target string. 2 bytes are extracted into the index while moving 1 byte from the beginning of the target string, the search pattern list corresponding to the index is extracted from the pattern map, and the pattern list uses the referrer to find matching patterns via binary search. This method distributes the entire pattern to one of the 64K lists based on the first 2 bytes. By doing this, you can reduce the overall number of comparisons and find them in the order they appear in the target string. Each list is sorted, so by performing a binary search, even if the number of patterns increases, the performance does not decrease significantly. Of course in the worst-case scenario, if the first two bytes of all the patterns are the same, everything will be put into a single list. Even so, a binary search is possible, so it performs better than a regular sequential search. The probability of this happening is slim to none and you can simply avoid by not creating patterns this way. ■ Conclusion The method described so far is not theoretically ideal. It definitely has its limitations and faults. However, it can be improved and is easy to implement. It performs better than Aho-Corasick on certain patterns used in Monitorapp’s products. Of course, if there are many matching patterns in the beginning, it is hard to beat Aho-Corasick. It is difficult to solve everything through programming. However, if you make a pattern according to the aforementioned structure, or if you understand the characteristics of the pattern and develop it in a form suitable for it, you can improve the performance significantly.

Products		Company	Support
WAAP	SWG	Who we are	Contact us
CDN	CASB	What we do	FAQ
DDoS Mitigation	FWaaS	Career
WMS	NGDPI
	ZTNA

AIONCLOUD BLOG