Our methods for finding and removing website malware

12 Dec

Our methods for finding and removing website malware

by Thomas J. Raef

in Technology

Comments

You might imagine that find and removing website malware is relatively straightforward, right?

Find malicious code and remove it.

Easy right?

Most website malware removal services work on signatures. These signatures positively identify a string of text in your website files. This method is very fast.

The average WordPress website has about 1,900 files. This is regardless of how many posts you have (posts are stored in the database). Joomla websites have about 3,100 files.

A good website malware scanner can search through 1,900 files in about 10 minutes or less. This makes it incredibly fast.

However, the problem with signatures is that the malware string has to be identified first, then a signature can be created to identify it. Does this sound like playing “catch-up” with today’s cyber criminals?

It should.

Some of the malware removal services tell customers, “your website is so badly infected that we must have our team of experts manually remove the malware.” Of course, they charge extra for this “manual” removal.

Can you imagine manually reviewing 1,900 files looking for strings of text that “don’t look like they belong”?

In our never ending fight to find and remove website malware (remediation) we’ve developed our own methods and programs. Typically we get people asking us, “What software do you use to find the malware?”

We use our own. We’ve looked at open source projects like Clam-AV. We realized it was designed to find a different kind of malware. Can it find website malware?

Yes, if you use the correct signature database.

At every step of our scanning and analysis we also include the log files from the website hosting account. This provides us with valuable insight as well.

Our primary method of finding website malware

WeWatchYourWebsite was created back in 2006. We have a copy of every malicious file we’ve ever found. A signature has been created for every piece of malicious code found. We can identify every piece of code we’ve found previously with our database of website malware signatures.

Our service can find and remove website malware extremely fast using this method. It’s incredibly accurate, but it comes with a flaw – you have to identify the malware first. Then you can create a signature to identify that malware string in other files.

The database of malware signatures is updated numerous times each and every day. But, we need our other methods to find new strains of website malware. Then we create signatures to identify it because that is the fastest method.

Anomaly based website malware identification

The second stage of our malware removal uses an anomaly based detection engine we created internally.

This looks for code that looks like it doesn’t belong in your files.

A simple example is PHP code inside a graphic file. There are many, many rules in this engine and they have been proven to be highly accurate while maintaing a false positive rate of .1%

This part of our scanning falls under, “if it looks like a duck and squawks like a duck, it’s a duck.”

An anomaly in the log files would be frequent POSTs, from various IP addresses, to a file in the wp-includes folder or the libraries folder of a Joomla site.

It has been one of our more accurate methods of finding new malware. New rules are added to this, but it has been really stable as far as detecting new website malware.

Any code found in this stage of our website malware removal service is further analyzed to verify that it’s malicious. Once that has been verified a signature is created or sometimes a group of signatures are created to find it fast.

Behavior analysis

This stage of our service actually runs the code in a simulated live environment and records all activity – browser and server.

Our system is designed to look for behaviors like: remotely including files, login pages that have a password embedded in the code and other nefarious activities.

Often we find many false positives with this stage. Often times we’ll get reports of a web host deactivating someone’s websites due to malicious activity. In this stage of our service we find that the host had deactivated the account based on a false positive.

More and more we’re seeing plugin or theme developers using a licensing scheme (not scam) to make certain people pay for their work rather than stealing it and using it for free. This licensing scheme will make a remote call to the developer’s website to request some code. Without that code, the site won’t work. It won’t display the website.

Imagine if you remove the code your hosting provider has identified as being malicious, they reactivate your website and then it doesn’t function! Our behavior analysis will determine that the included code is not malicious in this situation.

Typically in these cases we contact the hosting provider and explain why it’s not malicious and they will reactivate the account.

Our behavior analysis engine finds things like POSTs or GETs from IP addresses that are website accounts. Obviously that behavior is extremely suspicious.

Artificial Intelligence Machine Learning system

This is the method and system I’m most proud of.

In order for machine learning to be effective you have to build the system to analyze every aspect of the website’s files. The average approach would be to “feed” the machine learning system as many malicious files as you can and then “feed” the system as many safe files as you are able. Then let the machine learning system learn the difference.

This is fine for some files but website files can have malicious code injected into them. The entire file isn’t malicious, sometimes just the injected code is malicious. Hackers take legitimate files and inject malicious code in them.

The machine learning system isn’t effective if it identifies an entire file as being malicious. It must find the string of malicious code buried inside all the legitimate code.

Each line of code must be analyzed. Each section of code must be analyzed. Each block of code must be analyzed.

Add to this is the fact that hackers like to hide their code. They use various obfuscation techniques. Coding techniques like base64, gzinflate, etc. help them hide their code from the average person. Sometimes if code looks like it’s trying to hide, it’s a good indicator that it’s suspicious.

However, we also see code that is obfuscated – but legitimately. Often plugin, theme and templates developers use obfuscation techniques so people won’t see their code. Their code is proprietary so they obfuscate it to hide it.

If their legitimate code is removed, the website will not function.

This is why all obfuscation code must be deobfuscated before it’s “fed” into our machine learning system.

This section of our service does not just look at the code inside the website files. We also examine the file and folder structure and the log files.

In some of the recent phishing and spammy cases we’ve seen, there are folders created with a large number .html files or files with no extension, inside. There might be many thousands of files uploaded to a website. If those files are included into other files, legitimate files, they are suspicious. But the folder structure is what we’re interested in as well the lines, sections and blocks of code.

If you’d like your files scanned with the above technology, if you agree that these methods “make sense”, sign-up for our service. https://wewatchyourwebsite.com/pricing/

Thank you!

Tags:

website malware removal