many overload messages #19

ToterEngel · 2022-10-07T04:57:15Z

ToterEngel
Oct 7, 2022

I think the script is well done. At least clearer that even a noob can understand it.

I have now tested these rules for a day, but with half the blocking time.

The number of incoming connections was also almost halved. However, I still have a lot of overload messages in the log.

Are there any other solutions to further optimize it?

Enkidu-6 · 2022-10-07T05:34:12Z

Enkidu-6
Oct 7, 2022
Maintainer

The current ddos attack is primarily designed to get your system to shut down either by pushing the RAM usage to maximum or by filling up your available ports for outgoing connection by creating several concurrent connections. The Ntor drops are a side effect of tor not being able to handle them.

The only way to get rid of the NTor drops is to increase the number of CPUs until such time tor is patched and fixed to handle the connections better. I have been running and tweaking these rules for quite a while now. I believe the 12 hour timeout is a good balance between aggressive and a lax approach. Each time an IP is released from your block list, you're giving it another chance to make another two connections..

Not sure if you also applied the tweaks I mentioned for your sysctl.conf. But all of them combined will allow your system to run for a long time without having to reboot. It will run at a steady RAM and CPU usage with occasional RAM spikes which will soon recover. Ignore the NTor drops and let the system run. Once you get the HSDIR flag which takes 4 days, the attacks will noticeably increase

My relays are running for the past 30 days with a very steady RAM usage and they still show as green. Although for the past few days the attacks have severely increased. The number of blocked IP addresses have doubled and my relays have shown as overloaded 3 times for the past 4 days. But they go back to green within 6 to 12 hours..

0 replies

ToterEngel · 2022-10-08T04:49:46Z

ToterEngel
Oct 8, 2022
Author

I had accepted all changes in the sysctl.conf.
For weeks I have also been using the Tor internal DDOS protection with the following config:

DoSConnectionEnabled 1
DoSConnectionMaxConcurrentCount 2
DoSCircuitCreationEnabled 1
DoSCircuitCreationMinConnections 2
DoSCircuitCreationRate 2
DoSCircuitCreationDefenseTimePeriod 3600
DoSRefuseSingleHopClientRendezvous 1

However, this is not very efficient.

Do you have any experience that the network adapter can still be optimized?
An Intel I219-LM is installed on my Hetzner server.

0 replies

Enkidu-6 · 2022-10-08T07:33:22Z

Enkidu-6
Oct 8, 2022
Maintainer

The NTor drop as I mentioned is a problem with Tor not being able to handle it. It's not that it won't receive the packets, it's the fact that it receives them but it can't handle it. The only way to mitigate that to a certain point is to increase the number of CPUs.

The iptables rules are designed to make sure your server will run for a long time and at a manageable steady RAM and CPU usage but they can't solve the application shortcomings.

Search for Enabling Multi-Queue on Network Devices but chances are it's already enabled. ethtool -l eth0 or whatever your network interface is.

0 replies

Enkidu-6 · 2022-10-11T21:29:45Z

Enkidu-6
Oct 11, 2022
Maintainer

@ToterEngel the attacks seem to have calmed down a bit. How's your system performing now?

0 replies

ToterEngel · 2022-10-12T03:58:29Z

ToterEngel
Oct 12, 2022
Author

The CPU load has dropped a bit since yesterday.
Now the blocked IPs have also been reduced from around 1800 to 1000.
The blocked relays also level off between 25-30. (over 70 yesterday)

However, the relay is still displayed as overloaded.
So something is still running in the background.

0 replies

Enkidu-6 · 2022-10-12T04:22:09Z

Enkidu-6
Oct 12, 2022
Maintainer

What is the specs for your server e.g. RAM CPU, MaxAdvertisedBandwidth. Also looking in your logs, what is the percentage of your NTor drops?

0 replies

ToterEngel · 2022-10-12T05:07:07Z

ToterEngel
Oct 12, 2022
Author

I have an i7-7700 with 64GB of RAM.
The targeted bandwidth is currently running at 60mbit.
However, a reduction to 40mbit had made no difference in the long term.

The overload is at least only slightly present.
2-3 days ago it was 2-5%
currently:
Ntor dropped (288125) fraction 0.5897% is above threshold of 0.5000%

0 replies

cybermonkee · 2022-10-20T19:49:42Z

cybermonkee
Oct 20, 2022

I find my Ntor stats vary wildly - sometimes it is a tiny amount and sometimes it is double figures. I have a N5105 with 8GB RAM dedicated, and a 200mbit connection dedicated for this purpose. But you are right, as soon as I get the HSDir flag stuff goes nuts DDoS wise... I think removing the Dir port might have been to spread the load amongst all relays to try and mitigate.

I was worried that my CPU was not fast enough but a N5105 with AESNI and 8GB RAM should be plenty... I mean this thing can run Graphical Windows!!! - so a tiny Linux Kernel, and some network routing and TLS offloading should be easy - right?

0 replies

ToterEngel · 2022-10-20T20:11:20Z

ToterEngel
Oct 20, 2022
Author

It varies a lot for me too. In the last few days, however, the nature of the attack has changed somewhat.

The number of blocked relays has dropped from 30-40 to 12.
But the blocked IPs today increased from 1000-1500 to over 5000.
Here it doesn't matter to me whether the HSDIR flag is present or not in my observations.

0 replies

cybermonkee · 2022-10-20T20:16:18Z

cybermonkee
Oct 20, 2022

I am currently blocking 85 relays... on the tor-ddos hash I have 3061 blocked addresses - it's creeping up... I have reduced my bandwidth rate and bust a bit to see what happens.

0 replies

Superpaul209 · 2022-10-20T20:29:35Z

Superpaul209
Oct 20, 2022

I was worried that my CPU was not fast enough but a N5105 with AESNI and 8GB RAM should be plenty... I mean this thing can run Graphical Windows!!! - so a tiny Linux Kernel, and some network routing and TLS offloading should be easy - right?

My relay has a CPU without AESNI (its a CPU from 2008) and it always get the overloaded status... it has 4 cores and 4 GB RAM. Hopefully the anti ddos script will help to avoid those issues

0 replies

cybermonkee · 2022-10-20T20:35:50Z

cybermonkee
Oct 20, 2022

What bandwidth are you donating?

0 replies

Superpaul209 · 2022-10-20T20:37:19Z

Superpaul209
Oct 20, 2022

What bandwidth are you donating?

Im donating 8 MB/s maximum but it doesn't hit that much bandwidth. Its 5 or 6 MB/s in average.

0 replies

cybermonkee · 2022-10-20T21:17:10Z

cybermonkee
Oct 20, 2022

I am offering 12MBytes per second with a burst of 18Mbytes.

I do think for the level you are offering you should have a AESNI capable CPU - it does make a big difference.

0 replies

Enkidu-6 · 2022-10-20T21:29:59Z

Enkidu-6
Oct 20, 2022
Maintainer

The attacks increased since yesterday and going full force now, that's why the blocked numbers have increased. My blocked IPs generally are in the vicinity of 600 with about 20-25 relays in there. I just checked one of my relays and I have 4096 guests. It will pass.

My two relays are assigned 12 VCPU each and with the MaxAdvertisedBandwidth of 20 MiB (160 mbps) each. They've been running for almost 12 days with green status but one of them just turned yellow a few hours ago. The other one is still Green. The overloaded one generally goes back to Green within one or two heartbeats. Rarely longer.

0 replies

Enkidu-6 · 2022-11-16T13:09:05Z

Enkidu-6
Nov 16, 2022
Maintainer

45 megabits as in 5.6 MiB? Your 8 VCPU processor should easily be able to handle that. I have a feeling there's some other kind of problem you're dealing with which I can't pinpoint. The last time I rebooted for updates, the system was running for about 25 days without a single NTor drop in logs and as I said I'm running them at max Advertised Bandwidth of 160 mbits or 20 MiB.

The way we are throttling the connections and by only allowing 4 connections per IP, is the best reasonable compromise we can make without putting too many legitimate IPS in the block list. So is the 12 hour limit for the block list. A shorter expiry will put more pressure on your system as the IPS will get released earlier and your system has to deal with them all over again.

0 replies

ToterEngel · 2022-11-16T13:17:23Z

ToterEngel
Nov 16, 2022
Author

I meant 45 Mbytes -> 360 MBit and with 2 relays on the machine, the Gbit port is usually well utilized. I let through between 7 and 10 TB of traffic at Hetzner every day.

But as previously written, without attacks on the network, it also runs with more bandwidth without drops.
Now, however, the nature of the attack seems to have changed somewhat.
I have never had such high drop values in the past.

0 replies

Enkidu-6 · 2022-11-16T13:38:58Z

Enkidu-6
Nov 16, 2022
Maintainer

Ah now we're talking. Yes that's a lot of traffic. I would definitely increase the NumCPUs to at least 16 on each torrc file. Even higher. Worker threads are responsible for decrypting onionskins and Tor decides to keep or drop them based on how long it takes for a worker to become available. By increasing the NumCPUs you're creating available worker threads to be used when needed and until they're needed, they use zero CPU and when they're needed they use something like 1% CPU.

0 replies

Enkidu-6 · 2022-11-16T13:43:46Z

Enkidu-6
Nov 16, 2022
Maintainer

To add to the above comment. I'd start at 16 and increase them gradually and look at the NTor percentage in the logs. In my case at 4 CPUs I had about 36% drops. At 8, it went down to 1.5-2.6% . At 12 they disappeared completely. With your bandwidth you'll need a lot more. Play with the numbers to find the magic number. However, I believe you'll have to restart Tor for that to take effect. I don't think a simple -HUP will do the trick. But don't take my word for it. Verify.

0 replies

ToterEngel · 2022-11-16T13:58:55Z

ToterEngel
Nov 16, 2022
Author

Yes, you're right.
I've now increased it to 16 and had to reboot.
I'm curious to see how long it will take for the relays to fully boot up again.
Currently only about 300MBit go over the network card. My maximum a week ago was 830 MBit in Tor, even without bandwidth restrictions.
PS: Load average approx. 1.5 currently

0 replies

Enkidu-6 · 2022-11-16T14:07:22Z

Enkidu-6
Nov 16, 2022
Maintainer

Oh, I'm not sure if you need to reboot. Did you try to stop and start Tor again? That should do the trick.

0 replies

ToterEngel · 2022-11-16T16:31:52Z

ToterEngel
Nov 16, 2022
Author

In theory, this is enough.
However, the last time Tor was restarted, the process hung.
Possibly a bug, since I'm currently running the nightly version.
Since at least nothing important is running on the server for me, a reboot occasionally doesn't hurt even with a few kernel updates etc. :-D

0 replies

Enkidu-6 · 2022-11-17T21:44:48Z

Enkidu-6
Nov 17, 2022
Maintainer

Hello everyone. I changed the status of this thread from "issue" to "conversation". Feel free to continue the discussion here.

0 replies

ToterEngel · 2022-11-18T16:24:50Z

ToterEngel
Nov 18, 2022
Author

I have now increased the NumCPUs to 20 each. So far there hasn't been any further dropping.
However, it's strange that Tor doesn't automatically create more threads as long as there are enough resources to process all requests...

2 replies

cybermonkee Nov 18, 2022

I think I read somewhere that increasing the NumCPUs too much can be a problem eventually. But overall I do agree that knowledge on this setting is very confused and weak. I also find the 'any modern CPU' unhelpful too, it would better if ToR published a proper spec or benchmark that was associated with the amount of throughput that relay operators would want to contribute... I find that I am a continuous guessing game, to try and work out if I have a computer horsepower problem or a DDoS issue or both.

I have two relays running on one host, I have increased the NUmCPU setting to12 each, this seems to have solved most of the issues for now.

Enkidu-6 Nov 19, 2022
Maintainer

However, it's strange that Tor doesn't automatically create more threads as long as there are enough resources to process all requests...

In some cases Tor doesn't even see the available resources. The main Tor process will only see and use one CPU no matter how many you have. If you check the processor usage, you will see one CPU at 80-90% and the other ones doing nothing. And Tor creates one worker thread per available VCPU up to 16 for crypto processing. So even if you have 24 VCPUs Tor will only create 16. Anything more, you have to do it yourself.

ToterEngel · 2022-11-22T18:16:01Z

ToterEngel
Nov 22, 2022
Author

So after a few days of testing, I have to say that it still works best with the NumCPU at 16.
I do have a minimal NTOR drop on a relay, but not so bad that it is a complaint.
Each additional increase only resulted in more system-load, but not in the general improvement.

I'm currently noticing this clearly because my network card has had a load of around 900-930 MBit for hours...

0 replies

Enkidu-6 · 2022-11-22T18:28:50Z

Enkidu-6
Nov 22, 2022
Maintainer

Gladly you could finally break that spell.

0 replies

ToterEngel · 2022-11-26T10:36:38Z

ToterEngel
Nov 26, 2022
Author

So with my relays, the type of attacks have changed again in the last few hours.

The system load often increases to 6-8, which I haven't had in the last few months.
Again overload messages in both relay logs.
However, only about 10% of the usual IP's are in the blocked lists.

Another approach for filtering has to be found.
At least Toralf's filter rules don't seem to be much more effective either. Yesterday his were also marked as overloaded.

2 replies

Enkidu-6 Nov 26, 2022
Maintainer

Yes, it's been a few days. I've already modified mine and been testing it. I've been in the process of updating the scripts in the dev directory and I'll be done in the next 5 or 10 minutes. But the update.sh files are ready. You can use the appropriate update.sh script in the dev branch and give it a test drive if you like before I release them. I suggest that you also run the remove.sh in a cron in conjunction with the new script.

Enkidu-6 Nov 26, 2022
Maintainer

Sorry I put the wrong link. I just edited the link. https://github.com/Enkidu-6/tor-ddos/tree/dev

Enkidu-6 · 2022-11-26T15:22:46Z

Enkidu-6
Nov 26, 2022
Maintainer

The dev branch is now fully merged with the main. I will issue a new release later but for now you can use the scripts in the main. All you need is update.sh. No reboot or restart of Tor is necessary. This new method may trap a few more relays in the block list. I suggest you run the appropriate remove.sh in the cron folder to release them periodically.

Example:

(crontab -l ; echo "*/1 * * * * /your_path_to/remove.sh") | crontab -

To remove them every minute. Mine is currently set to every 10 minutes. I may reduce or increase the time interval based on the results.

1 reply

cybermonkee Nov 26, 2022

I have whitelisted all the relays for now. As having them on a blacklist seemed to impact my consensus weight - I think it might have also caused the loss of the stable flag too.

You can see a gap in the graphs here:
https://metrics.torproject.org/rs.html#details/F9D0C338C4E3C0BF777D6681A0904902717BAA30

ToterEngel · 2022-11-26T17:27:09Z

ToterEngel
Nov 26, 2022
Author

I have now accepted the change. Let's wait and see whether the attacks will be carried out again intensified in the next few days.

By the way, I also find the approach of putting all relays on a whitelist more practical than constantly comparing the lists and deleting them again.

4 replies

Enkidu-6 Nov 26, 2022
Maintainer

Having a few relays in your block list should in no way affect your consensus weight or the stability of your system. I'm willing to bet some other modification is the cause of it. On my systems, I remove the relays from the block list every 15 minutes and even then the list is pretty empty. May be 2 or 3 in there.

I am thinking of creating some sort of white list for relays only to be able to reduce the connlimit to 1 but it's a double edged sword. It wastes a lot of processing power and memory especially when conntrack module has to go through 6000 IP addresses line by line for every SYN request.

Perhaps that has something to do with your system becoming unstable and losing flags. Just a thought.

cybermonkee Nov 26, 2022

It could be - difficult to diagnose tbh...

Enkidu-6 Nov 26, 2022
Maintainer

Well I guess the best way to figure it out would be to reset everything with the update.sh to the original rules with no modifications and see if it causes you any problems.

ToterEngel Nov 27, 2022
Author

Just try it, I'm open to all solutions.
Currently my blacklist has filled up again with 3000-4000 IP's per relay, but still overload messages.

scarlettekk · 2022-11-29T16:45:21Z

scarlettekk
Nov 29, 2022

NumCPUs has a maximum of how many cores the machine has, correct?

14 replies

Enkidu-6 Nov 29, 2022
Maintainer

I see. htop doesn't come out of the box with my OS and it didn't occur to me that it might be more helpful to use it as opposed to top. It is, thanks.

cybermonkee Nov 30, 2022

Yeah htop is much nicer I find. I also use the 'nice' feature in Linux to de-prioritise tor and the threads to a lower priority than the kernel. This allows the kernel processes such as those used for iptables to get executed first by the scheduler. You can set the nice as a value of '1' using htop, so long as htop is running with elevated privileges.

You can also write a little script to this so it is less of a manual effort (the script needs to be run with elevated privileges though).

`#!/bin/bash
'# set -x
'# this file will need checking after a totally fresh install.
'# use getent passwd username - to get the UID for the below

renice -n 1 -u 108
renice -n 2 -u 109/

Enkidu-6 Nov 30, 2022
Maintainer

Yes, htop shows a lot more than top. Using top, I could only see the main process as a whole and none of the threads.

As for playing with priorities, I wouldn't try it. It's a dangerous game especially when the kernel is concerned. On a virtual machine, Conntrack uses unswapable kernel memory of the host, not the VM and PREROUTING processes the packets before Tor even knows they exist.

cybermonkee Nov 30, 2022

I am not following why it would be dangerous TBH. Particularly on a machine that is running two tor instances, if anything this would time slice the processor time - all you are saying with nice, is that this particular process has a lower priority than that other process. I agree the main kernel should be left alone to sort itself out, but tor is in user space, you can and should schedule that so that your system is optimised.

Enkidu-6 Nov 30, 2022
Maintainer

You're correct. I was referring to the main kernel. Modifying processes in the user space is a different story. It's not dangerous but it can have unintended consequences that would be hard to figure out or troubleshoot. I'd personally avoid it on a production machine unless I've already tested it somewhere else. But for experimental purposes, sure, why not?

many overload messages #19

Replies: 64 comments · 23 replies

Enkidu-6 Oct 7, 2022 Maintainer

ToterEngel Oct 8, 2022 Author

Enkidu-6 Oct 8, 2022 Maintainer

Enkidu-6 Oct 11, 2022 Maintainer

ToterEngel Oct 12, 2022 Author

Enkidu-6 Oct 12, 2022 Maintainer

ToterEngel Oct 12, 2022 Author

ToterEngel Oct 20, 2022 Author

Enkidu-6 Oct 20, 2022 Maintainer

Enkidu-6 Nov 16, 2022 Maintainer

ToterEngel Nov 16, 2022 Author

Enkidu-6 Nov 16, 2022 Maintainer

Enkidu-6 Nov 16, 2022 Maintainer

ToterEngel Nov 16, 2022 Author

Enkidu-6 Nov 16, 2022 Maintainer

ToterEngel Nov 16, 2022 Author

Enkidu-6 Nov 17, 2022 Maintainer

ToterEngel Nov 18, 2022 Author

Enkidu-6 Nov 19, 2022 Maintainer

ToterEngel Nov 22, 2022 Author

Enkidu-6 Nov 22, 2022 Maintainer

ToterEngel Nov 26, 2022 Author

Enkidu-6 Nov 26, 2022 Maintainer

Enkidu-6 Nov 26, 2022 Maintainer

Enkidu-6 Nov 26, 2022 Maintainer

ToterEngel Nov 26, 2022 Author

Enkidu-6 Nov 26, 2022 Maintainer

Enkidu-6 Nov 26, 2022 Maintainer

ToterEngel Nov 27, 2022 Author

Enkidu-6 Nov 29, 2022 Maintainer

Enkidu-6 Nov 30, 2022 Maintainer

Enkidu-6 Nov 30, 2022 Maintainer

Replies: 64 comments 23 replies

Enkidu-6
Oct 7, 2022
Maintainer

ToterEngel
Oct 8, 2022
Author

Enkidu-6
Oct 8, 2022
Maintainer

Enkidu-6
Oct 11, 2022
Maintainer

ToterEngel
Oct 12, 2022
Author

Enkidu-6
Oct 12, 2022
Maintainer

ToterEngel
Oct 12, 2022
Author

ToterEngel
Oct 20, 2022
Author

Enkidu-6
Oct 20, 2022
Maintainer

Enkidu-6
Nov 16, 2022
Maintainer

ToterEngel
Nov 16, 2022
Author

Enkidu-6
Nov 16, 2022
Maintainer

Enkidu-6
Nov 16, 2022
Maintainer

ToterEngel
Nov 16, 2022
Author

Enkidu-6
Nov 16, 2022
Maintainer

ToterEngel
Nov 16, 2022
Author

Enkidu-6
Nov 17, 2022
Maintainer

ToterEngel
Nov 18, 2022
Author

Enkidu-6 Nov 19, 2022
Maintainer

ToterEngel
Nov 22, 2022
Author

Enkidu-6
Nov 22, 2022
Maintainer

ToterEngel
Nov 26, 2022
Author

Enkidu-6 Nov 26, 2022
Maintainer

Enkidu-6 Nov 26, 2022
Maintainer

Enkidu-6
Nov 26, 2022
Maintainer

ToterEngel
Nov 26, 2022
Author

Enkidu-6 Nov 26, 2022
Maintainer

Enkidu-6 Nov 26, 2022
Maintainer

ToterEngel Nov 27, 2022
Author

Enkidu-6 Nov 29, 2022
Maintainer

Enkidu-6 Nov 30, 2022
Maintainer

Enkidu-6 Nov 30, 2022
Maintainer