VALHALLA Issue #1

2019 Re-edition

Malware Statistical analysis and countermeasures 

Written By: M0SA, August 2011 m0sa.vx@gmail.com 

Metamorphism is becoming complex and harder to detect, so algorithmic approaches
for detection is in turn becoming more complex and more  infeasible  for PCs due
to restriction in execution time and memory.   The new trend in metamorphic code
detection is the statistical analysis.  In  this  article  I will  give a  quick
overview on  statistical  analysis  and  then  explain a new  approach  appeared
in late 2010  called  Eigenviruses, and  finally, how  AVers  could  beat  those 
techniques. 

Here is the layout of the article: 

. What is the problem?! 
. Statistical Analysis 
. Eigen-Viruses 
. How to beat it 
. Anti-anti statistical analysis 


What is the problem?! 

VXers  not  only always find new ways to beat debuggers and emulators, but  also 
find  new  effective  ways  to obfuscate the new generations of code to look far
different from previous generations. 

There  are  many tricks to beat emulators and get them stuck at specific part of
your  code, and  so  they abort detection process and your code gets executed as
you want. This tricks varying from using instructions that some emulators cannot
handle, such  as MMX instructions, or  using  techniques  like  tau-obfuscation.
Consequently,  emulators  are  becoming  more  complex  in  order to bypass such
techniques and in the same time  evolution of emulators has a limit of execution 
feasibility in terms of time and/or memory. 


Statistical Analysis 

Statistical analysis  of a set of data means  to deduce hidden information about
this data.   Statistical analysis is used in many fields around us, such as data
mining,  face  recognition,  handwriting  recognition, voice  recognition, image
recognition and classification and much more. 


Mainly any statistical technique is either: 

1- Speculates what is the next pattern of a set of data would be, or 
2- Extracts the principal components of a set of data and determine what the common pattern among them is. 


Examples of statistical analysis techniques are: 

1- HMM (Hidden Markov Model) [1]: 

Hidden  Markov  Model  is already  used  in many applications other than malware 
detection.    Mark Stamp and Wing Wong introduced it as a method for metamorphic 
virus  detection in 2006.  HMM system is mainly given a set of sample files of a 
metamorphic  code, and  then it predicts with certain probability what could the
next  file  look  like.  Then  when an unknown  input file is tested against the 
system, a distance of the input file and the trained model is measured, if it is
below  some  threshold,  then  the  file  belongs to the compared  virus family, 
otherwise the file is clean.

To beat it:  try to look like a benign file in other words, use instructions and
structures of benign files. 


2- Bayesian analysis [2]: 

A naïve Bayesian classifier is a probabilistic classifier based on Bayes theorem
applied  with  strong independence assumptions.   The technique tries to extract
features of a given  virus, and gives a probability  for  each  feature  of  how
strong  a  single  feature  contributes in the virus. Then if an unknown file is
given, the previously extracted features is checked against the new file, if the
file  is  similar to a certain extent, then the file belongs to the tested virus
family, otherwise the file is clean. 

To beat it: Also try to look like a benign file. 

 

3- Statistical analysis of Byte-level content [3]: 

Mainly  and  roughly  speaking, the technique is based on  n-gram classification
approach.  It builds a model of learnt features of a training set to classify an
input file as malware or benign. 

To beat it: Also try to look like a benign file. 

4- Entropy analysis [4]: 

Mainly  used  to  automate  the process of identification of packed or encrypted 
malware.   Basically, entropy measures the ability or probability of independent
prediction of the  next byte in a series of bytes. High entropy means difficulty
of predicting the next byte  of  a  sequence, thus it means  that  the  sequence
contains random (or more specifically pseudo-random) bytes. 

To beat it:  deceive  the  entropy  probability  so put guessable patterns among
pseudo-random bytes. That is, you can put sequence of the same byte or gradually
increasing  byte  values  in between  blocks of packed code, and maintaining the
decryption process to bypass those low frequency sequences. 

 

5- Eigenviruses [5]: to be discussed in the next section. 



VXers’  techniques  to beat emulators are completely useless against statistical
technique,  because  the  statistical  analysis  techniques  do  not rely on the 
behavior of the code and don’t even try to execute it. 



Eigen-Viruses 

One  of  the  most  common  and  powerful  techniques  for  face  recognition is 
“Eigenfaces”, it was proposed by Matthew Turk and Alex Pentland in the beginning
of 90s. Roughly speaking, the theory behind Eigenfaces is that although multiple
images for the same person may seem  different  due  to age, light direction and
pose of the face, a common pattern can be extracted  from these set of images so
they can all be mapped to the same person. 

The  same  approach  is  applied  to  morphed  code  in  Eigenviruses  approach. 
Eigenviruses  is  a new technique proposed in late 2010 and is under publication
in IET journal of information security.   Eigenviruses  treats  the  code of the
metamorphic virus as a face image.   The  Eigenviruses  system is given a set of 
sample  files  of a  certain metamorphic code, and then the system can recognize
the  common  pattern  among  them.  The more the sample files, the more accurate 
information are learnt about the code. 

The  authors  of the  technique  showed very good result of detecting well known 
metamorphic  viruses  such  as MetaPHOR and Zmist. However, Eigenviruses is like
any other method it can be hacked.   And  that’s what I will explain in the next
section. 


How to beat it 

The  idea  behind  any  anti-statistical  analysis  technique  is  to have a big
difference as possible of  the  byte values of the same  part of the code across
generations. For example, look at figure 1(a)  and 1(b). Consider that the first
and fourth line are meaningful instructions (they might don’t  have  meaning  in
this example), the second and the third line are random bytes.   The  difference
between the byte values of the second and third line are the maximum  difference
that could be. 



               Figure 1. Two generation of a metamorphic code 


9E 87 65 7C 9B 21 35 6D 98 9A 12                9E 87 65 7C 9B 21 35 6D 98 9A 12 

00 00 00 00 00 00 00 00 00 00 00                FF FF FF FF FF FF FF FF FF FF FF
                                      ->
00 00 00 00 00 00 00 00 00 00 00                FF FF FF FF FF FF FF FF FF FF FF

8D 93 55 46 82 6F 4D 65 5A 66 2C                8D 93 55 46 82 6F 4D 65 5A 66 2C

       a) Nth generation                             b) (N+1)th generation 



Therefore, beating a statistical  analysis technique can be done by some tricks: 

1- Inserting totally random bytes at random locations while maintaining the same
flow of execution.

2- Size, code size is an important feature  taken care of by most of statistical
analysis techniques.  The  more  variable  code  size  of  the samples, the more
difficult to extract a  common  pattern, and so the more difficult to detect the
virus.  This  can  be  done by inserting a big size of junk code – in  different
areas — that might be even equal or more than the actual virus code size. 

3- Code reordering is also a good trick against statistical analysis. 

There can be more  tricks  based on the idea of byte value difference that could
be easily  implemented. 


Anti-anti-statistical analysis 

Fortunately – from AVers point of view- the aforementioned tricks can be beaten.
An emulator  or  normalizer can be run on the code to clean it from random bytes
chunks or reorder the code based on its execution sequence. However, fortunately
– from VXers view this time- we get back to emulators again which we know how to
beat them and where all anti-emulators and anti-debugging technique apply.

 

 

References 

[1] Wong,W., Stamp,M., “Hunting for metamorphic engines”, Journal in Computer Virology, 2(3), pp. 211–219, 2006. 
[2] InSeon Yoo and Ulrich Ultes-Nitsche, “How to Predict Email Viruses Under Uncertainty”, IEEE International Conference on Performance, Computing, and Communications, 2004. 
[3] SM Tabish, MZ Shafiq, “Malware detection using statistical analysis of byte-level file content”, Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics, 2009. 
[4] Lyda, R.; Hamrock, J., “Using Entropy Analysis to Find Encrypted and Packed Malware”, Security & Privacy, IEEE, 2007. 
[5] A. Nabi, Moustafa Saleh, A. Baith, “Eigenviruses for Metamorphic Virus Recognition”, IET journal of Information security, Under publication.