• 1 Post
  • 7 Comments
Joined 5 months ago
cake
Cake day: June 30th, 2025

help-circle

  • i found the reason, somehow setting --max_num_seqs 1 makes vllm way more efficient.

    Not sure exactly what it does but i think its because vllm batches requests and the api was using with exlamav3 doesn’t

    Now im doing 100k with vllm too

    (Worker_TP0_EP0 pid=99695) INFO 11-03 17:34:00 [gpu_worker.py:298] Available KV cache memory: 4.73 GiB
    (Worker_TP1_EP1 pid=99696) INFO 11-03 17:34:00 [gpu_worker.py:298] Available KV cache memory: 4.73 GiB
    (EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1087] GPU KV cache size: 103,264 tokens
    (EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1091] Maximum concurrency for 100,000 tokens per request: 1.03x
    (EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1087] GPU KV cache size: 103,328 tokens
    (EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1091] Maximum concurrency for 100,000 tokens per request: 1.03x
    

    I would say exlamav3 is still slightly more efficient but this explains the huge discrepancy, exlamav3 also allows setting GB per gpu which allows me to get a view more GB then vllm which spreads it evenly because a bunch of memory on gpu 0 is used for other stuff

    As for the T/s its about the same, in the 80-100 range, this is what im getting with vllm:

    (APIServer pid=99454) INFO 11-03 17:36:31 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO 11-03 17:36:32 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO 11-03 17:36:32 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO 11-03 17:36:34 [loggers.py:127] Engine 000: Avg prompt throughput: 461.4 tokens/s, Avg generation throughput: 17.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.6%, Prefix cache hit rate: 66.9%
    (APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO 11-03 17:36:43 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
    (APIServer pid=99454) INFO 11-03 17:36:44 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
    (APIServer pid=99454) INFO 11-03 17:36:44 [loggers.py:127] Engine 000: Avg prompt throughput: 1684.4 tokens/s, Avg generation throughput: 96.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.4%, Prefix cache hit rate: 83.4%
    

    Now that i have found this out ive switched back to vllm because the API i’m using with exlamav3 doesn’t support qwen 3 tools yet :(






  • Yeah i should have specified for at home when saying its a scam, i honestly doubt the companies that are buying thousands of B200s for datacenters are even looking at their pricetags lmao.

    Anyway the end goal is to run something like Qwen3-235B at fp8, with some very rough napkin math 300GB vram with the cheapest option the 9060XT comes down at €7126 with 18 cards, which is very affordable. But ofcourse that this is theoretically possible does not mean it will actually work in practice, which is what im curious about.

    The inference engine im using vLLM supports ROCm so CUDA should not be strictly required.