Very large amounts of gaming gpus vs AI gpus

TheMightyCat@ani.social · 9 days ago

It’s almost like client side anti cheat doesn’t work and if proper server side anti cheat is made it wouldn’t matter what platform the client is on.

TheMightyCat@ani.social · edit-2 19 days ago

i found the reason, somehow setting --max_num_seqs 1 makes vllm way more efficient.

Not sure exactly what it does but i think its because vllm batches requests and the api was using with exlamav3 doesn’t

Now im doing 100k with vllm too

(Worker_TP0_EP0 pid=99695) INFO 11-03 17:34:00 [gpu_worker.py:298] Available KV cache memory: 4.73 GiB
(Worker_TP1_EP1 pid=99696) INFO 11-03 17:34:00 [gpu_worker.py:298] Available KV cache memory: 4.73 GiB
(EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1087] GPU KV cache size: 103,264 tokens
(EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1091] Maximum concurrency for 100,000 tokens per request: 1.03x
(EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1087] GPU KV cache size: 103,328 tokens
(EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1091] Maximum concurrency for 100,000 tokens per request: 1.03x

I would say exlamav3 is still slightly more efficient but this explains the huge discrepancy, exlamav3 also allows setting GB per gpu which allows me to get a view more GB then vllm which spreads it evenly because a bunch of memory on gpu 0 is used for other stuff

As for the T/s its about the same, in the 80-100 range, this is what im getting with vllm:

(APIServer pid=99454) INFO 11-03 17:36:31 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:32 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:32 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:34 [loggers.py:127] Engine 000: Avg prompt throughput: 461.4 tokens/s, Avg generation throughput: 17.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.6%, Prefix cache hit rate: 66.9%
(APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:43 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:44 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:44 [loggers.py:127] Engine 000: Avg prompt throughput: 1684.4 tokens/s, Avg generation throughput: 96.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.4%, Prefix cache hit rate: 83.4%

Now that i have found this out ive switched back to vllm because the API i’m using with exlamav3 doesn’t support qwen 3 tools yet :(

TheMightyCat@ani.social · 27 days ago

I would suggest trying exllamav3 once, i have no idea what kind of black magic they use but its very memory efficient.

i can’t load Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 with 16K using vllm

but using exlamav3 i can SOMEHOW load ArtusDev/Qwen_Qwen3-Coder-30B-A3B-Instruct-EXL3:8.0bpw_H8 at its full context of 262.144 with still 2GiB to spare.

I really feel like this is too good to be true and im doing something wrong but it just works so i don’t know.

TheMightyCat@ani.social · edit-2 1 month ago

If I were to guess the biggest peformance impact is if your driver doesn’t support newer types like FP4 and FP8 while your hardware does, but I am not sure.

TheMightyCat@ani.social · 3 months ago

Well that’s annoying

TheMightyCat@ani.social · 4 months ago

I am really curious how the T/s scales with network speed, as said in the video 5G is quite slow.

TheMightyCat@ani.social · 4 months ago

Yeah i should have specified for at home when saying its a scam, i honestly doubt the companies that are buying thousands of B200s for datacenters are even looking at their pricetags lmao.

Anyway the end goal is to run something like Qwen3-235B at fp8, with some very rough napkin math 300GB vram with the cheapest option the 9060XT comes down at €7126 with 18 cards, which is very affordable. But ofcourse that this is theoretically possible does not mean it will actually work in practice, which is what im curious about.

The inference engine im using vLLM supports ROCm so CUDA should not be strictly required.

TheMightyCat@ani.social · 4 months ago

Very large amounts of gaming gpus vs AI gpus