Saturday, November 24, 2012

Using LCS to Search for Malware Commonalities

Written by:
InfoSecKitten@gmail.com / @InfoSecKitten

Are these the similar? Depends on who you ask.


According to diff, these are very different.  Fuzzy hashing will say the same.  I disagree.

Whenever a new sample of malware comes in, I often have the problem of relating to the rest of my dataset. Fuzzy hashing sometimes works, but sometimes it's completely off. Usually I end up relying on a combination of yara + some custom routines.  

I've always asked the question: "I have 2 malicious samples, show me everything that they have in common".

Another question.... what qualifies as "similar"?  

If I have 2 strings "LULZKITTENS" and "KITTENSLULZ" are those similar?  I think so, but some comparison code may disagree.

The Prelude:

A friend of mine works at a local hospital and I contribute some Ruby code to him once in a while to get some simple DNA sequencing done.  As always, this will lead to researching some efficient string searching algorithms. One of the algorithms in the ever famous Corman book was the "Longest Common Substring" algorithm.  The problem stuck out to me as a solution to my malware categorization problem.  This was extremely close to accomplishing exactly what I needed.

Note:  To accomplish what we're looking to do a modification needed to be made to the algorithm to find longest exact common substring.


Longest common substring in it's pure form can solve another interesting malware problem, I'll talk more about that in a future blogpost.

The Problem Statement:

Here's exactly what I was looking for:
1. Show me the longest common substring of bytes in 2 arbitrary pieces of binary data
2. Show me the next longest one that is not contained in the longest substring
3. Remove the substrings for 1 and 2 and repeat.

A Simple Problem:

Let's take 2 binary pieces of data (DNA in this case) and compare them for commonalities.  The first question is to find the greatest common substring of bytes that is in common.  We have a couple questions to answer.  How long is it?  Is there more than 1 discrete substring that exists of the same length?  How similar are these 2 binaries?

Set 1:
ACAATTGAATGAAAAGGACTGGCTTGATAATCACGTAGTGAACAACAATTGATAAGTATTTA
AGATTTACGTGTTATGATTAGAAGAAAACTTAGCTGAAAGATGTGAATAGTTGACGATAAGC
TAATTAACTATTCGTCGATGTGGATGATTAGGCATAATGTTATAAGCGATTGAAATATTCAT
ATTTTACTTTGAATAGACACGCGTAATTGAATCGTTGGAGAC

Set 2:
ACATCATGTTGTGAGGTAATTCGACTTGAAATTATATATATAGAGAAGACCAATACTGTACG
AAGAGGTAAGCTATAAGTCTATTTAAGGTTAAGTTAATATACTAATATAGATTAAGATCTAG
TGAGGGGTGATTTTACCGTATTTGATACTATATAGGACGGATACACTAACAACAATCGATTA
TAATGGACGCGGAAGTAGGCTTTATTTGTTGAAAATGTGAAA

Spoiler alert.  Solutions below.

The greatest common exact substring of these 2 pieces of data is 9 characters long and took 51984 comparisons.  That's a lot of calculations for these 2 pieces of data.  Luckily we only need to pass through once to find all common substrings and we'll just pluck them all out and then reduce.

The lcs is "AACAACAAT" for those of you taking notes. Let's highlight that, and continue down the process of finding the next largest substring.

Set 1:
ACAATTGAATGAAAAGGACTGGCTTGATAATCACGTAGTGAACAACAATTGATAAGTATTTA
AGATTTACGTGTTATGATTAGAAGAAAACTTAGCTGAAAGATGTGAATAGTTGACGATAAGC
TAATTAACTATTCGTCGATGTGGATGATTAGGCATAATGTTATAAGCGATTGAAATATTCAT
ATTTTACTTTGAATAGACACGCGTAATTGAATCGTTGGAGAC

Set 2:
ACATCATGTTGTGAGGTAATTCGACTTGAAATTATATATATAGAGAAGACCAATACTGTACG
AAGAGGTAAGCTATAAGTCTATTTAAGGTTAAGTTAATATACTAATATAGATTAAGATCTAG
TGAGGGGTGATTTTACCGTATTTGATACTATATAGGACGGATACACTAACAACAATCGATTA
TAATGGACGCGGAAGTAGGCTTTATTTGTTGAAAATGTGAAA

So the way the process works is by checking to see if another substring of 9 bytes occurs (it does not) and then will check for 8 bytes substring (that are not a subset of the 9 byte sequence)

It turns out, that indeed there is!  "TATTTAAG"  <- note: this is the only substring that is 8 bytes long

Let's highlight that and continue the process.

Set 1:
ACAATTGAATGAAAAGGACTGGCTTGATAATCACGTAGTGAACAACAATTGATAAGTATTTA
AGATTTACGTGTTATGATTAGAAGAAAACTTAGCTGAAAGATGTGAATAGTTGACGATAAGC
TAATTAACTATTCGTCGATGTGGATGATTAGGCATAATGTTATAAGCGATTGAAATATTCAT
ATTTTACTTTGAATAGACACGCGTAATTGAATCGTTGGAGAC

Set 2:
ACATCATGTTGTGAGGTAATTCGACTTGAAATTATATATATAGAGAAGACCAATACTGTACG
AAGAGGTAAGCTATAAGTCTATTTAAGGTTAAGTTAATATACTAATATAGATTAAGATCTAG
TGAGGGGTGATTTTACCGTATTTGATACTATATAGGACGGATACACTAACAACAATCGATTA
TAATGGACGCGGAAGTAGGCTTTATTTGTTGAAAATGTGAAA

Now we have found 2 high fidelity "indicators" that represent these 2 discrete pieces of code.  Let's continue down our process and go for 7 characters substrings.

Set 1:
ACAATTGAATGAAAAGGACTGGCTTGATAATCACGTAGTGAACAACAATTGATAAGTATTTA
AGATTTACGTGTTATGATTAGAAGAAAACTTAGCTGAAAGATGTGAATAGTTGACGATAAGC
TAATTAACTATTCGTCGATGTGGATGATTAGGCATAATGTTATAAGCGATTGAAATATTCAT
ATTTTACTTTGAATAGACACGCGTAATTGAATCGTTGGAGAC

Set 2:
ACATCATGTTGTGAGGTAATTCGACTTGAAATTATATATATAGAGAAGACCAATACTGTACG
AAGAGGTAAGCTATAAGTCTATTTAAGGTTAAGTTAATATACTAATATAGATTAAGATCTAG
TGAGGGGTGATTTTACCGTATTTGATACTATATAGGACGGATACACTAACAACAATCGATTA
TAATGGACGCGGAAGTAGGCTTTATTTGTTGAAAATGTGAAA

Let's continue down our process and go for 6 characters substrings and we'll stop there.


Set 1:
ACAATTGAATGAAAAGGACTGGCTTGATAATCACGTAGTGAACAACAATTGATAAGTATTTA
AGATTTACGTGTTATGATTAGAAGAAAACTTAGCTGAAAGATGTGAATAGTTGACGATAAGC
TAATTAACTATTCGTCGATGTGGATGATTAGGCATAATGTTATAAGCGATTGAAATATTCAT
ATTTTACTTTGAATAGACACGCGTAATTGAATCGTTGGAGAC

Set 2:
ACATCATGTTGTGAGGTAATTCGACTTGAAATTATATATATAGAGAAGACCAATACTGTACG
AAGAGGTAAGCTATAAGTCTATTTAAGGTTAAGTTAATATACTAATATAGATTAAGATCTAG
TGAGGGGTGATTTTACCGTATTTGATACTATATAGGACGGATACACTAACAACAATCGATTA
TAATGGACGCGGAAGTAGGCTTTATTTGTTGAAAATGTGAAA

At this point we can stop, while we could look for 5 character sequences and less, at some point you're going to down to 1 byte and eventually the both documents will converge on their commonalities.  The trend is that these 2 documents, while they might be different, they actually contain a lot of the same information. 

Additional Notes:

The reduction problem.  Binary data (depending on the type of data) has a lot of commonalities.  Tons of padding, common headers and patterns.  There is a problem of reduction.

A small example:
  • I'm comparing 2 words for common sequences.  "ABCDEFGHIJK" and "CDEFGHIJKLMO" 
  • The longest exact common substring would be "CDEFGHIJK".  
    • Technically, this would be our highest fidelity signature/description for these 2 words.  
  • Let's search for a second substring with a length of maxlength - 1. We have 2 results.
    • "DEFGHIJK"
    • "CDEFGHIJ"
  • Both of the above ones are legit answers, however they exist as a subset of another substring, so we have redundant information and they need to be filtered out.
  • Recurse

Stepping Back:

So we have a way to compare pieces of code but what about its applicability to something like malware?  Will we get junk data?  Will we get reasonable results?  The final question would be around it's uses outside of finding commonalities.

It's worth mentioning some of the commonalities we can find when running this tool against malicious samples.
  • Encryption routines shared between backdoors
    • Loops, routines (with hard coded bounds), etc...
    • Static keys that are the same across multiple samples
  • Strings (the obvious low hanging fruit)
    • Also things like control characters that are after and before
    • Unique padding between data 
  • Shared Functions
    • Using the same connect functions
    • Similar URI requests, etc....
Essentially what you'll get is a method to say "I have 2 malicious samples, show me everything that they have in common"  It's up to the analyst at that point to dive through the disassembly and figure out what truly describes the nature of the backdoor.

Sample Use Case:

We are going to take 2 different versions a backdoor from the same malware family.  Running it through the process will highlight the commonalities between the 2, but also the differences.

The first piece of code that we run will give us the location and length of the substrings in common between the 2 binaries, the second will actually go back to the binary, fseek them out and then display them as a hex representation.  The third is the most trivial piece of code.  Just to take those hex representations and then present those back to the analyst as ascii (paired up with their binary equivalent).

Enough with the chatter, let's look at some results.  (Disclaimer: these pieces of malware were found from public sources), If you've played this game before you will recognize the samples. Also of note, the results are abridged.

Samples:
https://www.virustotal.com/file/12ad85f7e94199e6bdf6e908c51f598ab70b2e05c2c662b984177dbb3196d842/analysis/
https://www.virustotal.com/file/3122694fdf0b8fa36870b2b02b3259d3e921a2725e659821051362599f613d73/analysis/

Just for reference the max string in common between the 2 has a length of 1330 bytes.  Which in my opinion is pretty significant.

The first two sequences that we get back are: (note: control characters are omitted if they cannot be displayed)

-----------------------------------
4D5A90000300000004000000FFFF0000B800000000000000400000000000000000000000000000000000000000000000000000000000000000000000
  MZ

------------------------------------
 and
-----------------------------------
0000000E1FBA0E00B409CD21B8014CCD21546869732070726F6772616D2063616E6E6F742062652072756E20696E20444F53206D6F64652E0D0D0A2400000000000000
  ยบ This cannot be run in DOS mode

-----------------------------------

This isn't surprising, comparing 2 different binaries and they have commonalities in the PE header.  Let's go a little further down the results:

-----------------------------------
006F00496E7465726E6574436F6E6E6563744100007100496E7465726E6574437261636B55726C41009200496E7465726E65744F70656E4100530048747470456E64526571756573744100BA00496E7465726E6574577269746546696C65005A004874747053656E64526571756573744578410000500048747470416464526571756573744865616465727341000057494E494E45542E646C6C005753325F33322E646C6C00000E004765744D6F64756C6546696C654E616D6545784100000400456E756D50726F636573734D6F64756C657300000500456E756D50726F6365737365730050534150492E444C4C00B202737072696E746600 InternetConnectA InternetCrackUrlA  InternetOpenA HttpEndRequestA InternetWriteFile HttpSendRequestExA WININET.dll WS2_32.dll GetModuleFileNameExA EnumProcessModules EnumProcesses PSAPI.DLL sprintf

-----------------------------------

This isn't surprising again, API calls that are in common binaries.  The ordering might be significant or unique, but in this case, I'm sure there is more interesting data in the binary.

-----------------------------------
4000000000003F3F3F0000000000000000000000000000000000000000000000000000000000434C4F53454400000000000000000000000000000000000000000000000000004C495354454E494E47000000000000000000000000000000000000000000000053594E5F53454E5400000000000000000000000000000000000000000000000053594E5F5243564400000000000000000000000000000000000000000000000045535441424C495348454400000000000000000000000000000000000000000046494E5F5741495431000000000000000000000000000000000000000000000046494E5F57414954320000000000000000000000000000000000000000000000434C4F53455F5741495400000000000000000000000000000000000000000000434C4F53494E47000000000000000000000000000000000000000000000000004C4153545F41434B00000000000000000000000000000000000000000000000054494D455F57414954000000000000000000000000000000000000000000000044454C4554455F5443420000000000000000000000000000000000000000000057696E646F77734170700000696D6167652F6769662C20696D6167652F782D786269746D61702C20696D6167652F6A7065672C20696D6167652F706A7065672C206170706C69636174696F6E2F782D73686F636B776176652D666C6173682C206170706C69636174696F6E2F766E642E6D732D657863656C2C206170706C69636174696F6E2F766E642E6D732D706F776572706F696E742C206170706C69636174696F6E2F6D73776F72642C202A2F2A000000004D6F7A696C6C612F342E302028636F6D70617469626C653B204D53494520372E303B2057696E646F7773204E5420352E31290000436F6E74656E742D547970653A206170706C69636174696F6E2F782D7777772D666F726D2D75726C656E636F6465640D0A000000436F6E6E656374696F6E3A204B6565702D416C6976650D0A000000004163636570742D436861727365743A2049534F2D383835392D312C7574662D383B713D302E372C2A3B713D302E370D0A000000004163636570742D4C616E67756167653A20656E2D75732C656E3B713D302E350D0A00000025642E25642E25642E25640049646C650000000025732020252D3773252D323473252D34307325733A25640A0000000055445000257300002A3A2A0025732020252D3773252D323473252D323473252D31367325733A25640A0000005443500025733A256400000025732020252D3773252D323473252D323473252D31367325730A000050726F746F0000004C6F63616C2041646472657373000000466F726569676E204164647265737300537461746500000050726F63657373002573416C6C6F63617465416E6447657455647045785461626C6546726F6D537461636B204572726F72212E0A000000002573416C6C6F63617465416E6447657454637045785461626C6546726F6D537461636B204572726F72210A00416C6C6F63617465416E6447657455647045785461626C6546726F6D537461636B000000416C6C6F63617465416E6447657454637045785461626C6546726F6D537461636B0000006970686C706170692E646C6C000000002573252D313075202020257320200A002573252D313073202020257320200A0050726F636573734944000000496D616765204E616D6500002573437265617465546F6F6C68656C703332536E617073686F74202166616C73650A0000534F4654574152455C4D6963726F736F66745C57696E646F77735C43757272656E7456657273696F6E5C52756E5C0000

 @
CLOSED LISTENING SYN_SENT SYN_RCVD ESTABLISHED FIN_WAIT1 FIN_WAIT2 CLOSE_WAIT CLOSING LAST_ACK TIME_WAIT DELETE_TCB WindowsApp
image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel,  application/vnd.ms-powerpoint, application/msword, */*
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
Content-Type: application/x-www-form-urlencoded
Connection: Keep-Alive
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Accept-Language: en-us,en;q=0.5
%d.%d.%d.%d
Idle
%s  %-7s%-24s%-40s%s:%d
%s  %-7s%-24s%-24s%-16s%s:%d
%s:%d
%s  %-7s%-24s%-24s%-16s%s
Proto
Local Address
Foreign Address
State
Process
%sAllocateAndGetUdpExTableFromStack Error!.
%sAllocateAndGetTcpExTableFromStack Error!
AllocateAndGetUdpExTableFromStack
AllocateAndGetTcpExTableFromStack
iphlpapi.dll
%s%-10u   %s
%s%-10s   %s
ProcessID
Image Name
%sCreateToolhelp32Snapshot !false
SOFTWARE\Microsoft\Windows\CurrentVersion\Run\
-----------------------------------

Wow, that looks pretty nice.  What this is telling me is that between these 2 copies of the backdoor, the following giant chunk of code is shared.  As a responder, visually I might have made a yara signature including some of those keywords along with the user agent, but truth is, that would be a terrible signature.

Pasted below is one of my favorite 2 sequences (in order):

-----------------------------------
002F7075742E6173703F6E6D3D00000000706820002573676574202573206572726F723A25640A00002573676574202573206F6B210A00000068747470733A2F2F7777772E
/put.asp?nm=
%sget %s error:%d
%sget %s ok!
https://www.
-----------------------------------


-----------------------------------
2E636F6D2F6765742E6173703F6E6D3D00

 .com/get.asp?nm=
-----------------------------------

What this is telling me is that the author swapped out the C2 address.  The 1st sequence stops at www. and then resumes at .com/get.asp.  The only difference in those 2 pieces of code is the everything in between (the URL).

There are many more sequences that fall out as a result of this type of analysis, I'll only mention the above ones for readability.

Not only is that a good indicator of what has changed between these 2 pieces of malware, we now have a signature that mathematically cannot be any higher fidelity.  Searching for those hex bytes will hit on at least those 2 samples sequences, but many others built with the same logic.  When running this code, there are pages and pages of results, so some smart techniques will be needed to sift though the common sequences to find good signatures.

The goal of this project is move away from signatures that are a handful of keywords that an incident responder finds in a binary and uncloak the true commonalities that exist between multiple builds.

At this point plugging those hex values into yara and scanning is the way to go.

Another use of this tool would be to look at malware over a long period of time, this will provide a simple way to determine how it's growing and changing.  If it's sharing a similar code base with other pieces of malware, etc..

How to defeat this:

This wouldn't be a security blog if people didn't read this and think "I've got a way to defeat this sort of behavior profiling".  So I did want to address this.  There is the obvious argument of packers.  They can be used to obfuscate and hide code.  The downfall is that at some point the code will need to get unpacked, and it's up to the analyst to decide when to profile the code and when to try to get it into a common format.

With that said, if we want to find a piece of code that is protected with (pick any packer), we can start sequencing all sorts of code written with the same packer.  What will eventually fall out is quirks in the header, unpacking routines, or patterns that are common in those packed binaries.  At which point those can be fed to some secondary analysis for further scrutiny.

Can it be defeated?  Yes.  Is it trivial? .... I'd argue no, but as with anything, I'm sure that there is an innovative way to do some packing that I haven't been exposed to yet. :)

Additional Uses (outside of security):

  • Plagiarism Detection  <- sorry kids!
    • Bumping an "essay" up against a wikipedia article of the same name might show a whole paragraph or even a couple sentences of commonalities. 
  • Fingerprint Detection
  • DNA Sequencing 
This is a new area of research, I hope that this code gets adopted and modified to better the area of security and malware signatures.   I also hope that this can help contribute to tracing the origins of backdoors and help relate them to their families.

References:

http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring
http://20bits.com/article/introduction-to-dynamic-programming

No comments:

Post a Comment