What Do Spam Filters Look At?

Leo A. Notenboom

The Ask Leo! Podcast

What Do Spam Filters Look At?

The Ask Leo! Podcast

What do spam filters look at?

Hi everyone, Leo Notenboom here for Askleo.com.

Someone recently commented to me that a spam filter was really useless

since the spam he was receiving kept coming from different email addresses.

The implication was that this person believed that the

from address is the only thing the spam filters check.

I suppose that's possible, but it's going to be very rare

because as he points out, that's not enough.

Not today!

These days spam filters are super complex.

They are sophisticated pieces of software

that check a lot more than you might think.

Now I have to start this discussion by saying that

there is no such thing as a single spam filter.

There's no single spam filtering technique.

There's no single spam filtering set of rules.

How your spam filter works might be very different

from the way my spam filter works.

Might be very different,

different from another service, and another, and another.

That's particularly true for large email providers.

They each have their own spam filtering technology.

So let me be clear that there's not some set of rules

that will or will not cause email to be marked as spam.

If there were such a set of rules,

then spammers would just start abusing them

because they would know what to do

to get their messages through the system.

In reality, it's all about probability.

The world rules isn't really the right word.

There's no magic rule that once you trip

guarantees your email is going to be sent to spam, or vice versa.

This is something way, way fuzzier.

Probability.

What we look at are the characteristics of email

that add to or reduce the probability

that a message will be marked as spam.

Any single characteristic by itself is usually not enough.

Taken in combination with other characteristics, though,

a message that shows multiple characteristics of being spam

is probably going to get filtered as spam.

Think of each characteristic as a strike against the message.

Too many strikes? You're out.

Of course, not all characteristics are equal.

One characteristic might be a stronger indication

of spamminess than another.

Nor do they stay the same over time.

Spam topics come and go.

They're very reactive to current events.

For example, a word that was completely benign in email last year

might be an indicator of spam today.

So, what are these characteristics?

Well, I've got some for you.

I don't mean it to be a comprehensive list,

because there is no comprehensive list.

Like I said earlier, different services use different characteristics

in different combinations and with different ways.

But here are some of the things it can look at.

The from line.

Sure, it's possible to block email based on the from line alone.

However, as a spam filtering technique, it's pretty useless.

The problem, of course, as our original question indicated,

is that the from address changes on every message.

So, it's kind of sort of useless for spam filtering.

It's great for blocking a specific person,

but it is definitely not something that you can really use effectively

to block spam.

Now, there is a test that spam filters can use,

and that is something like this.

If you see an email that says from someone at some random service dot com

and then askleoexample at hotmail dot com,

in angle brackets,

well, the from line has both a display name and an email address.

I've got an article, actually, on exactly what that means and why it exists.

They both look like email addresses, and they don't match in this example.

That is a common characteristic of spam.

The to and the cc lines.

Same thing with, you know, display name versus email address check applies to

basically any email address that spammers block.

Spammers might make visible in the to and the cc lines.

Other checks might include, are there many?

Spam is often sent to many recipients at the same time.

Does the account you're getting it on appear on either the to or the cc line?

Perhaps you were bcc'd.

Again, spammers often use bcc to send to many more recipients

than the message might show.

Are there any recipients at all?

Spam often appears with a completely blank to or cc line.

The subject line.

Now, I'm sure there's often a list of curated collections

of current common spammy subjects that increase a message's chance

of being filtered as spam.

A message having no subject at all, I suppose, might be on that list.

Similarly, words currently common in spam's subject lines

could count against any message using them.

Grammar counts, though not as much as the body.

Many legitimate subject lines are grammatically incorrect.

I do it all the time.

But most spam subject lines are incorrect.

Spelling, unusual spacing, capitalization of words

can also have a negative effect.

Language, both in word choice and the set of characters used,

can be a sign of spam.

If the message originates in an English-speaking country

and is destined for an English-speaking country,

seeing it in a foreign language or seeing foreign characters

in the message could be a clue.

The message body.

The message body is where things really get interesting

and almost, I'll just say, magical.

This is where the phrase,

looks like spam, really applies at its fuzziest,

since what looks like spam to one person

might not look like spam to another.

Spam filters fight this battle every day.

Some of the few issues a spam filter might see,

just a link.

If there's just a link in the body,

that's kind of suspicious.

If the topic is spammy,

there aren't topics,

like related to body enhancement, politics,

money-making schemes, whatever,

they are likely to be filtered as spam.

Grammar and spelling.

No, you and I aren't perfect,

but most spam is worse.

The quality of the actual writing

can be factored in as a sign of potential spam.

Although I will say that over the last few years,

spammers have been getting better,

and with AI available to write their spam,

for them,

I suspect this won't be as useful as it once was.

Language and character set.

Just like the subject line,

messages in languages that are foreign

to both the sender and the receiver

can be a sign of spam.

Images.

Email messages with images are common in spam,

and thus can act as a sign of spam.

In particular,

email that is only an image,

or email that is mostly images,

geared to try to trick you to display image,

can be a red flag.

Spacing.

This one's really obscure,

but I see it used a lot.

The top part of a message body

might be an explicit call for some spammer's goal.

They're trying to sell you something

or get you to click on something.

But since it's so clearly spam,

they might add several blank lines below the message,

and then append non-spammy,

often random, content at the end.

The presence of non-spammy content

might tilt the balance in favor

of the message not being spam,

when it obviously is.

There are probably many more indications for a body,

but you get the idea.

We're taking a look at all the characteristics

of the message itself.

The unseen headers.

By now, I'm sure you've heard about these magical headers

in emails that you don't normally see.

These are lines,

much like the to and the subject line,

that include a bunch of technical information

about how the email was routed and formatted,

and many times,

what a spam filter might even have thought about the message.

Spam filters can analyze some of the headers for clues.

The most interesting is what I call the chain of custody.

The chain is nothing more than a sequence of information

that looks something like this.

I'm server A,

and I got the message from server B

to be delivered to my customer,

someone at some random service dot com.

I'm B,

and I got this from C.

I'm C,

and I got this from D.

I'm server D,

and I originated this message.

Each one of those steps is identified with an IP address

and often a name.

Now, while we can't use the IP address

to identify a specific source or person

there are generalizations about IP addresses

in the chain of custody

that can affect the probability of the messages being spam.

DNS.

DNS maps names to IP addresses.

So if an IP address doesn't have an associated name,

or it does,

but that name's IP address doesn't match,

the IP address of the server,

then that's a strike against it.

IP location.

That's one thing IP addresses generally do tell you

is where on the planet the message came from.

Does the IP address of where the message came from,

the server,

match where the email address supposedly comes from?

Email from your local ISP's domain,

for example,

should never originate from a server in a foreign country.

IP ownership.

Does the IP address of that message actually match

the servers that are supposedly sending for that domain?

For example,

if that's a message from Gmail account,

did it originate from a Gmail server?

Chain of custody.

Is the chain broken?

For example,

if the line I am C and I got this from D wasn't present,

then the message somehow appears to have hopped from D to B

without C recording anything.

That's highly suspicious

and often a sign of header forgery.

Is the chain reasonable?

This one's a little bit fuzzier,

but as we travel from D to C to B to A,

is the path the message took reasonable?

Did the message appear to take an unnecessary trip

through a foreign server?

Once again,

that's a possible sign of header forgery and spam.

Again,

these are just examples and made up ones at that,

but they give you some idea of what's possible

when spam feeds into your network.

It's possible when spam filters review the headers you don't see.

Now,

new on the scene,

at least for the past few years,

are things called SPF,

DKIM,

and DMARC.

SPF,

Sender Policy Framework,

is mostly about identifying the servers that are allowed

to originate email for a given email address.

For example,

only Yahoo servers are allowed to originate email

from Yahoo email addresses.

And Yahoo has stated,

in their SPF record,

that anything not matching that should be considered spam.

DKIM,

Domain Keys Identified Mail,

is mostly about using encryption and digital signatures

to authenticate that the claimed sender of a message

is the real sender,

and possibly also that the message content has not been altered.

If the confirmation fails,

that's a possible sign of spam.

DMARC,

Domain Based Message Authentication Reporting and Conformance,

is a framework that

allows the apparent sending domain,

say yahoo.com,

to indicate what should happen

if either SPF or DKIM checks fail,

and provide a mechanism for reporting back,

in this case to Yahoo,

what's happening.

Training.

One of the most potentially confusing things about spam filtering

is what's spam to you

might not be spam to me.

Literally.

When we mark something as spam in many email programs

and on many email services,

what we're really saying is,

email like this

is spam to me.

Now, sophisticated email filtering service

can get me used that specific email message

that you said was spam

to do either or both of two things.

Analyze it to see what characteristics it has

and update the things that the filter looks at to check for

for everyone.

Or it could use those characteristics,

perhaps a little more aggressively even,

to update the spam filter specifically for you.

The net result is you end up with a spam filter

customized to your indication of what is and is not spam.

Finally, it's inevitable.

This is the part of the discussion that gets disappointing.

Failure is always an option.

Spam filtering can be incredibly complex

and it can also be incredibly difficult.

It can also be incredibly wrong.

Depending on the sophistication of the spam filter,

depending on its ability to adapt not only to new spam

as spammers try to weasel their way around the filter,

but also to individual user preferences,

and depending on its ability to do its job

in a reasonable amount of time,

spam filters run the range of

pretty darn good, but not perfect,

to relatively pointless.

Some spam will almost always make it through.

And some ham, legitimate email,

the opposite of spam,

will occasionally end up in the spam folder.

My recommendation is simply this.

Train your email program or services spam filter.

Mark spam as spam.

Mark false positives as not spam.

Never reply to spam.

Never try to unsubscribe from spam.

If it's something you asked for by subscribing,

absolutely, it's not spam.

Unsubscribe from it.

But if it's not something you asked for,

that's the definition of spam.

Don't unsubscribe,

but hit that This is Spam button instead.

Above all, don't let spam stress you out.

It's a normal, everyday fact of life on today's internet.

Like I said, mark spam as spam,

and move on.

For updates, for comments,

for links related to this topic and more,

visit askleo.com

slash

one eight nine seven two.

I'm Leo Notenboom,

and this is askleo.com.

Thanks for watching.

Continue listening and achieve fluency faster with podcasts and the latest language learning research.