Malware Statistical analysis and countermeasures
Written By: M0SA, August 2011 m0sa.vx@gmail.com
Metamorphism is becoming complex and harder to detect, so algorithmic approaches
for detection is in turn becoming more complex and more infeasible for PCs due
to restriction in execution time and memory. The new trend in metamorphic code
detection is the statistical analysis. In this article I will give a quick
overview on statistical analysis and then explain a new approach appeared
in late 2010 called Eigenviruses, and finally, how AVers could beat those
techniques.
Here is the layout of the article:
. What is the problem?!
. Statistical Analysis
. Eigen-Viruses
. How to beat it
. Anti-anti statistical analysis
What is the problem?!
VXers not only always find new ways to beat debuggers and emulators, but also
find new effective ways to obfuscate the new generations of code to look far
different from previous generations.
There are many tricks to beat emulators and get them stuck at specific part of
your code, and so they abort detection process and your code gets executed as
you want. This tricks varying from using instructions that some emulators cannot
handle, such as MMX instructions, or using techniques like tau-obfuscation.
Consequently, emulators are becoming more complex in order to bypass such
techniques and in the same time evolution of emulators has a limit of execution
feasibility in terms of time and/or memory.
Statistical Analysis
Statistical analysis of a set of data means to deduce hidden information about
this data. Statistical analysis is used in many fields around us, such as data
mining, face recognition, handwriting recognition, voice recognition, image
recognition and classification and much more.
Mainly any statistical technique is either:
1- Speculates what is the next pattern of a set of data would be, or
2- Extracts the principal components of a set of data and determine what the common pattern among them is.
Examples of statistical analysis techniques are:
1- HMM (Hidden Markov Model) [1]:
Hidden Markov Model is already used in many applications other than malware
detection. Mark Stamp and Wing Wong introduced it as a method for metamorphic
virus detection in 2006. HMM system is mainly given a set of sample files of a
metamorphic code, and then it predicts with certain probability what could the
next file look like. Then when an unknown input file is tested against the
system, a distance of the input file and the trained model is measured, if it is
below some threshold, then the file belongs to the compared virus family,
otherwise the file is clean.
To beat it: try to look like a benign file in other words, use instructions and
structures of benign files.
2- Bayesian analysis [2]:
A naïve Bayesian classifier is a probabilistic classifier based on Bayes theorem
applied with strong independence assumptions. The technique tries to extract
features of a given virus, and gives a probability for each feature of how
strong a single feature contributes in the virus. Then if an unknown file is
given, the previously extracted features is checked against the new file, if the
file is similar to a certain extent, then the file belongs to the tested virus
family, otherwise the file is clean.
To beat it: Also try to look like a benign file.
3- Statistical analysis of Byte-level content [3]:
Mainly and roughly speaking, the technique is based on n-gram classification
approach. It builds a model of learnt features of a training set to classify an
input file as malware or benign.
To beat it: Also try to look like a benign file.
4- Entropy analysis [4]:
Mainly used to automate the process of identification of packed or encrypted
malware. Basically, entropy measures the ability or probability of independent
prediction of the next byte in a series of bytes. High entropy means difficulty
of predicting the next byte of a sequence, thus it means that the sequence
contains random (or more specifically pseudo-random) bytes.
To beat it: deceive the entropy probability so put guessable patterns among
pseudo-random bytes. That is, you can put sequence of the same byte or gradually
increasing byte values in between blocks of packed code, and maintaining the
decryption process to bypass those low frequency sequences.
5- Eigenviruses [5]: to be discussed in the next section.
VXers’ techniques to beat emulators are completely useless against statistical
technique, because the statistical analysis techniques do not rely on the
behavior of the code and don’t even try to execute it.
Eigen-Viruses
One of the most common and powerful techniques for face recognition is
“Eigenfaces”, it was proposed by Matthew Turk and Alex Pentland in the beginning
of 90s. Roughly speaking, the theory behind Eigenfaces is that although multiple
images for the same person may seem different due to age, light direction and
pose of the face, a common pattern can be extracted from these set of images so
they can all be mapped to the same person.
The same approach is applied to morphed code in Eigenviruses approach.
Eigenviruses is a new technique proposed in late 2010 and is under publication
in IET journal of information security. Eigenviruses treats the code of the
metamorphic virus as a face image. The Eigenviruses system is given a set of
sample files of a certain metamorphic code, and then the system can recognize
the common pattern among them. The more the sample files, the more accurate
information are learnt about the code.
The authors of the technique showed very good result of detecting well known
metamorphic viruses such as MetaPHOR and Zmist. However, Eigenviruses is like
any other method it can be hacked. And that’s what I will explain in the next
section.
How to beat it
The idea behind any anti-statistical analysis technique is to have a big
difference as possible of the byte values of the same part of the code across
generations. For example, look at figure 1(a) and 1(b). Consider that the first
and fourth line are meaningful instructions (they might don’t have meaning in
this example), the second and the third line are random bytes. The difference
between the byte values of the second and third line are the maximum difference
that could be.
Figure 1. Two generation of a metamorphic code
9E 87 65 7C 9B 21 35 6D 98 9A 12 9E 87 65 7C 9B 21 35 6D 98 9A 12
00 00 00 00 00 00 00 00 00 00 00 FF FF FF FF FF FF FF FF FF FF FF
->
00 00 00 00 00 00 00 00 00 00 00 FF FF FF FF FF FF FF FF FF FF FF
8D 93 55 46 82 6F 4D 65 5A 66 2C 8D 93 55 46 82 6F 4D 65 5A 66 2C
a) Nth generation b) (N+1)th generation
Therefore, beating a statistical analysis technique can be done by some tricks:
1- Inserting totally random bytes at random locations while maintaining the same
flow of execution.
2- Size, code size is an important feature taken care of by most of statistical
analysis techniques. The more variable code size of the samples, the more
difficult to extract a common pattern, and so the more difficult to detect the
virus. This can be done by inserting a big size of junk code – in different
areas — that might be even equal or more than the actual virus code size.
3- Code reordering is also a good trick against statistical analysis.
There can be more tricks based on the idea of byte value difference that could
be easily implemented.
Anti-anti-statistical analysis
Fortunately – from AVers point of view- the aforementioned tricks can be beaten.
An emulator or normalizer can be run on the code to clean it from random bytes
chunks or reorder the code based on its execution sequence. However, fortunately
– from VXers view this time- we get back to emulators again which we know how to
beat them and where all anti-emulators and anti-debugging technique apply.
References
[1] Wong,W., Stamp,M., “Hunting for metamorphic engines”, Journal in Computer Virology, 2(3), pp. 211–219, 2006.
[2] InSeon Yoo and Ulrich Ultes-Nitsche, “How to Predict Email Viruses Under Uncertainty”, IEEE International Conference on Performance, Computing, and Communications, 2004.
[3] SM Tabish, MZ Shafiq, “Malware detection using statistical analysis of byte-level file content”, Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics, 2009.
[4] Lyda, R.; Hamrock, J., “Using Entropy Analysis to Find Encrypted and Packed Malware”, Security & Privacy, IEEE, 2007.
[5] A. Nabi, Moustafa Saleh, A. Baith, “Eigenviruses for Metamorphic Virus Recognition”, IET journal of Information security, Under publication.