Bots are a massive issue nowadays. They infect basically all social websites(even your dumpster fire that you call “”””X””” Elon Musk. By the way its Twitter), flooding people’s DMs with spam, posting cringe and lies to attract any attention, filling up comments of people asking for help and even outside of social media, there are so many bot-made blogs with en masse generation of garbage copied content leading many to believe that the Dark Forest Web is not an IF but a WHEN. Not to mention those made for scraping web pages, generating tons of load on the server.
As a website owner, even though this is a very small corner of the wide web I, too, suffered from it to an extent. Bots came to post spam in the comments, creating accounts with temporary e-mails to try and access more areas, see if they could exploit something, etc. So, like anyone would, I decided to fight back, and here is what I learned.
The early part
The problem started around September last year for me. Every time I logged in my host’s control painel, I noticed a steady increase in resource usage: CPU was going up by about 7-10% every day. My host CPU’s percentage measure is relative to minutes/day used, based on plan chosen(in my case 6min = 100%). This resulted in CPU usage reaching 100% in a little over a week, and 200%(12 minutes) in about 3, resulting in a few HTTP 500 errors and massive slowdown.
As soon as I noticed it, I scrambled to find the problem, figuring it had to be a bad script or very slow function being called whenever a new user visited, but after looking into logs and .php files for days, I eventually looked at the statistics, and saw that, at least according to the service I used for it at the time, majority of my traffic came from bots just going through every single page and media file. That’s when it clicked in my head I never implemented anything to filter out malicious bots.
Fortunately for me, I use cloudflare, which has a very simple and free toggle that filters out a bunch of those. This is my first tip: Enable Bot Fight mode/filter on whatever CDN you use. On its own, this tool blocks on a daily an average 200+ requests from bots(which tend to stop trying to access the website as a whole, for a time, when stopped from loading a page). This helped a lot, but wasn’t quite enough.
The server load was still quite high, and the statistics indicated that a commonly vulnerable WordPress page was being requested the most: xmlrpc.php. That file dates back from when internet connections were slow and unstable, so people would use XML-RPC apps to send their posts to their blogs. This file, however, is useless to me and a huge majority of people, so most recommend to disable it. I went one step above and blocked access entirely using cloudflare. Second tip: ensure you dont have unnecessary endpoints active.
As an aditional measure, which never really came into play for me but is always a good idea to implement: rate limit any public APIs and Login page. In my case, any and all APIs I created have a rate limit of about 15 requests every 10 seconds per IP address, setup, again, on cloudflare to avoid any extra load on my server.
The worse ones
Previous actions, combined with script optimizations and better caching, ended up being enough to again lower the load and make me feel at ease. Sadly, as they say here in Brazil, “Poor people’s happiness don’t last long”: A month or so after the previous issue got solved, a new one came up: Bots were making spam comments. “Well, thats an easy fix” I thought, locking comment for logged-in users only. Unfortunately, this just lead to bots creating accounts first and continuing the same.
My first idea here was probably the same as everyone else in this situation: Add captcha, and they can’t register anymore. At the time I didn’t know about turnstile, which has better privacy policies, so I went with google’s ReCaptcha. According to their console, as of right now, there have been 155 attempts which scored 0.3/1 or below, which means got blocked. This means 155 potential bots which could create hundreds of spam comments/posts in the forums blocked in 3 months.
However, the internet is a battlefield. Some people will constantly try to do bad things, while others try to combat and defend against them. Once I got rid of most automated account creation problems, people behind some of these started just manually making new ones for the bots, which spammed again.
All accounts all had a common pattern: a random word or short sentence without spaces followed by about 4 numbers as name and a temporary e-mail. New ones were made every day, with different e-mail providers each time, which made things difficult. My first attempt at blocking these involved digging around for a massive list of throwaway e-mail providers and checking new register attempts against it. The method was slow but worked for a couple days, until whoever was behind it eventually found a new provider that had not been listed, so I had to get more creative.
After researching a bunch about the topic of temporary e-mails, I came across some interesting things:
- Temp e-mail providers usually don’t have a MX record for them
- They also don’t normally have SPF1 record
So the solution was simple: First, check if the e-mail to be used for registering has MX, then SPF1 info, and if they do then check against the previously mentioned list. If everything is green, and the captcha was solved just fine, hooray! It’s probably a person making an account for themselves and not a bot!
That was it! Once the code was running on the server, bots and fake accounts stopped showing up(at least for the past 5 months) and the issue seems fixed. On a side note, the whole thing didn’t even upset me, but was very entertaining: Despite higher load, I don’t pay for usage, but a fixed yearly amount, so I was not badly affected by any of it, and I ended up learning quite a bit about temporary e-mails, bot behaviour and how to prevent them, all while this website is still small, which means no one was affected by the spams(mind you, they weren’t even going through for the most part, I just had to manually delete a bunch that were under moderation). The entire experience was, honestly, fun!