many overload messages #19
Replies: 64 comments 23 replies
-
The current ddos attack is primarily designed to get your system to shut down either by pushing the RAM usage to maximum or by filling up your available ports for outgoing connection by creating several concurrent connections. The Ntor drops are a side effect of tor not being able to handle them. The only way to get rid of the NTor drops is to increase the number of CPUs until such time tor is patched and fixed to handle the connections better. I have been running and tweaking these rules for quite a while now. I believe the 12 hour timeout is a good balance between aggressive and a lax approach. Each time an IP is released from your block list, you're giving it another chance to make another two connections.. Not sure if you also applied the tweaks I mentioned for your sysctl.conf. But all of them combined will allow your system to run for a long time without having to reboot. It will run at a steady RAM and CPU usage with occasional RAM spikes which will soon recover. Ignore the NTor drops and let the system run. Once you get the HSDIR flag which takes 4 days, the attacks will noticeably increase My relays are running for the past 30 days with a very steady RAM usage and they still show as green. Although for the past few days the attacks have severely increased. The number of blocked IP addresses have doubled and my relays have shown as overloaded 3 times for the past 4 days. But they go back to green within 6 to 12 hours.. |
Beta Was this translation helpful? Give feedback.
-
I had accepted all changes in the sysctl.conf. DoSConnectionEnabled 1 However, this is not very efficient. Do you have any experience that the network adapter can still be optimized? |
Beta Was this translation helpful? Give feedback.
-
The NTor drop as I mentioned is a problem with Tor not being able to handle it. It's not that it won't receive the packets, it's the fact that it receives them but it can't handle it. The only way to mitigate that to a certain point is to increase the number of CPUs. The iptables rules are designed to make sure your server will run for a long time and at a manageable steady RAM and CPU usage but they can't solve the application shortcomings. Search for Enabling Multi-Queue on Network Devices but chances are it's already enabled. ethtool -l eth0 or whatever your network interface is. |
Beta Was this translation helpful? Give feedback.
-
@ToterEngel the attacks seem to have calmed down a bit. How's your system performing now? |
Beta Was this translation helpful? Give feedback.
-
The CPU load has dropped a bit since yesterday. However, the relay is still displayed as overloaded. |
Beta Was this translation helpful? Give feedback.
-
What is the specs for your server e.g. RAM CPU, MaxAdvertisedBandwidth. Also looking in your logs, what is the percentage of your NTor drops? |
Beta Was this translation helpful? Give feedback.
-
I have an i7-7700 with 64GB of RAM. The overload is at least only slightly present. |
Beta Was this translation helpful? Give feedback.
-
I find my Ntor stats vary wildly - sometimes it is a tiny amount and sometimes it is double figures. I have a N5105 with 8GB RAM dedicated, and a 200mbit connection dedicated for this purpose. But you are right, as soon as I get the HSDir flag stuff goes nuts DDoS wise... I think removing the Dir port might have been to spread the load amongst all relays to try and mitigate. I was worried that my CPU was not fast enough but a N5105 with AESNI and 8GB RAM should be plenty... I mean this thing can run Graphical Windows!!! - so a tiny Linux Kernel, and some network routing and TLS offloading should be easy - right? |
Beta Was this translation helpful? Give feedback.
-
It varies a lot for me too. In the last few days, however, the nature of the attack has changed somewhat. The number of blocked relays has dropped from 30-40 to 12. |
Beta Was this translation helpful? Give feedback.
-
I am currently blocking 85 relays... on the tor-ddos hash I have 3061 blocked addresses - it's creeping up... I have reduced my bandwidth rate and bust a bit to see what happens. |
Beta Was this translation helpful? Give feedback.
-
My relay has a CPU without AESNI (its a CPU from 2008) and it always get the overloaded status... it has 4 cores and 4 GB RAM. Hopefully the anti ddos script will help to avoid those issues |
Beta Was this translation helpful? Give feedback.
-
What bandwidth are you donating? |
Beta Was this translation helpful? Give feedback.
-
Im donating 8 MB/s maximum but it doesn't hit that much bandwidth. Its 5 or 6 MB/s in average. |
Beta Was this translation helpful? Give feedback.
-
I am offering 12MBytes per second with a burst of 18Mbytes. I do think for the level you are offering you should have a AESNI capable CPU - it does make a big difference. |
Beta Was this translation helpful? Give feedback.
-
The attacks increased since yesterday and going full force now, that's why the blocked numbers have increased. My blocked IPs generally are in the vicinity of 600 with about 20-25 relays in there. I just checked one of my relays and I have 4096 guests. It will pass. My two relays are assigned 12 VCPU each and with the MaxAdvertisedBandwidth of 20 MiB (160 mbps) each. They've been running for almost 12 days with green status but one of them just turned yellow a few hours ago. The other one is still Green. The overloaded one generally goes back to Green within one or two heartbeats. Rarely longer. |
Beta Was this translation helpful? Give feedback.
-
45 megabits as in 5.6 MiB? Your 8 VCPU processor should easily be able to handle that. I have a feeling there's some other kind of problem you're dealing with which I can't pinpoint. The last time I rebooted for updates, the system was running for about 25 days without a single NTor drop in logs and as I said I'm running them at max Advertised Bandwidth of 160 mbits or 20 MiB. The way we are throttling the connections and by only allowing 4 connections per IP, is the best reasonable compromise we can make without putting too many legitimate IPS in the block list. So is the 12 hour limit for the block list. A shorter expiry will put more pressure on your system as the IPS will get released earlier and your system has to deal with them all over again. |
Beta Was this translation helpful? Give feedback.
-
I meant 45 Mbytes -> 360 MBit and with 2 relays on the machine, the Gbit port is usually well utilized. I let through between 7 and 10 TB of traffic at Hetzner every day. But as previously written, without attacks on the network, it also runs with more bandwidth without drops. |
Beta Was this translation helpful? Give feedback.
-
Ah now we're talking. Yes that's a lot of traffic. I would definitely increase the NumCPUs to at least 16 on each torrc file. Even higher. Worker threads are responsible for decrypting onionskins and Tor decides to keep or drop them based on how long it takes for a worker to become available. By increasing the NumCPUs you're creating available worker threads to be used when needed and until they're needed, they use zero CPU and when they're needed they use something like 1% CPU. |
Beta Was this translation helpful? Give feedback.
-
To add to the above comment. I'd start at 16 and increase them gradually and look at the NTor percentage in the logs. In my case at 4 CPUs I had about 36% drops. At 8, it went down to 1.5-2.6% . At 12 they disappeared completely. With your bandwidth you'll need a lot more. Play with the numbers to find the magic number. However, I believe you'll have to restart Tor for that to take effect. I don't think a simple -HUP will do the trick. But don't take my word for it. Verify. |
Beta Was this translation helpful? Give feedback.
-
Yes, you're right. |
Beta Was this translation helpful? Give feedback.
-
Oh, I'm not sure if you need to reboot. Did you try to stop and start Tor again? That should do the trick. |
Beta Was this translation helpful? Give feedback.
-
In theory, this is enough. |
Beta Was this translation helpful? Give feedback.
-
Hello everyone. I changed the status of this thread from "issue" to "conversation". Feel free to continue the discussion here. |
Beta Was this translation helpful? Give feedback.
-
I have now increased the NumCPUs to 20 each. So far there hasn't been any further dropping. |
Beta Was this translation helpful? Give feedback.
-
So after a few days of testing, I have to say that it still works best with the NumCPU at 16. I'm currently noticing this clearly because my network card has had a load of around 900-930 MBit for hours... |
Beta Was this translation helpful? Give feedback.
-
Gladly you could finally break that spell. |
Beta Was this translation helpful? Give feedback.
-
So with my relays, the type of attacks have changed again in the last few hours. The system load often increases to 6-8, which I haven't had in the last few months. Another approach for filtering has to be found. |
Beta Was this translation helpful? Give feedback.
-
The dev branch is now fully merged with the main. I will issue a new release later but for now you can use the scripts in the main. All you need is update.sh. No reboot or restart of Tor is necessary. This new method may trap a few more relays in the block list. I suggest you run the appropriate remove.sh in the cron folder to release them periodically. Example:
To remove them every minute. Mine is currently set to every 10 minutes. I may reduce or increase the time interval based on the results. |
Beta Was this translation helpful? Give feedback.
-
I have now accepted the change. Let's wait and see whether the attacks will be carried out again intensified in the next few days. By the way, I also find the approach of putting all relays on a whitelist more practical than constantly comparing the lists and deleting them again. |
Beta Was this translation helpful? Give feedback.
-
NumCPUs has a maximum of how many cores the machine has, correct? |
Beta Was this translation helpful? Give feedback.
-
I think the script is well done. At least clearer that even a noob can understand it.
I have now tested these rules for a day, but with half the blocking time.
The number of incoming connections was also almost halved. However, I still have a lot of overload messages in the log.
Are there any other solutions to further optimize it?
Beta Was this translation helpful? Give feedback.
All reactions