SQL Server combining LIKE with IN (SELECT LIKE IN query)

System Requirements

  • SQL Server

 

The Problem

This was one of those “it should be obvious” moments.

Dating back to 2004 in its current iteration, the hpcfactor.com membership database has been blighted by SPAM. We are no different to any web community in this regard. A can be a particular problem as many accounts that we may miss are registered and (if they survive) become a valuable commodity to sell as trojan accounts for use in mischief at a later date.

The moderation team at HPC:Factor does a great job of ensuring that these ‘pay per post’ (usually from Indian IP’s) urchins are wasting their time, but it does not stop them from trying. At one point we were having to content with several hundred a day, and had to shutdown registration completely. A custom layer that I wrote to sit on-top of the BBS registration system has done a good job to convert that SPAM torrent into a trickle of a “couple of times a month”.

Over the years, I have added more and more banned email domains to a ban list that is maintained in the user registration database. It occurred to me that as that list has increased, the historic pool of user account has never been re-parsed against the updated banned domains list.

Sounds simple right? Fire up SQL Server Management Studio and create a SQL SELECT LIKE IN query… is it that simple?

 

The Solution

Yes actually it is.

The obvious logic that you want to follow is to take this syntax

SELECT id, username, emailaddress
FROM members
WHERE emailaddress IN (
   SELECT emailaddress FROM bannedemails
)

… and merge it with a LIKE statement to create a SELECT LIKE IN query, something like

SELECT id, username, emailaddress
FROM members
WHERE emailaddress LIKE IN (
   SELECT '%' + emailaddress FROM bannedemails
)

Unfortunately however this syntax is completely illegal.

The answer is to perform the LIKE/IN as part of an INNER JOIN

SELECT members.id, members.username, members.emailaddress
FROM members
INNER JOIN bannedemails ON members.emailaddress LIKE '%' + bannedemails.emailaddress

 

Performance

This is a not a “real time” friendly SQL expression. The inefficiency of using LIKE evaluation becomes very apparent as the record set scales. The HPC:Factor database had 2,271 banned email domains and 30,336 user. Giving a total record iteration and evaluation count of 68,893,056 rows.

The total query execution time for this on was 2 minutes 13 seconds (ouch). Having said that, on reviewing why it took that long, I realised that at some point in time I was being extremely stingy with the SQL Server VM. It only had access to 2 CPU cores and 12 GB of RAM. The hypervisor server itself has access to a 10 core hyper-threaded Xeon.

 

The Result

I have a lot of respect and sympathy for people who use VPNs and temporary email addresses and I genuinely understand why they feel the need (or in some cases must, depending on what part of the world you are in).

I saying that, I also have to deal with the issue from the other side of the equation. If the website is having a bad day, it’s usually because of something coming from one or both of these camps. As ever in life, a minority will insist on spoiling a good thing for everyone else.

After running the query, we had only missed 8 user accounts – credit to the moderation team. The oldest account was 26/09/2006 and my favourite that we probably should not have missed was “mrputin” from January 2013.