Most AI tools try to replace your thinking. I built one that doesn't

SuspciousCarrot78@lemmy.world · edit-2 4 days ago

Most AI tools try to replace your thinking. I built one that doesn't

SuspciousCarrot78@lemmy.world · edit-2 2 days ago

FWIW Extra shit I cooked last night. It’s live now, so deserves a PS: of its own

PPS: I inbuilt as spam blocker as well.

allow-list / deny-list domain filters
DDG-lite junk-domain blocklist
ad/tracker URL rejection
relevance gate before any provenance upgrade

Enjoy :) Blurb below

“But what if it just… Googled it?”

We can do that. But better.

You: Who won best picture at the 97th Academy Awards?

Model: Anora won best picture at the 97th Academy Awards.
See: https://www.wdsu.com/article/2025-oscars-biggest-moments/64003102
Confidence: medium | Source: Web

Without >>web, that same 4B model said “The Fabelmans.” Then when I pushed it, “Cannes Film Festival.” With web retrieval, the router searches the internet, scores every result deterministically (phrase match + token overlap + domain trust), and only accepts evidence that passes a hard threshold. Garbage results get rejected, not served. The model never touches the answer - it’s extracted straight from the evidence.

Retrieval cascade: Cheatsheets → Wiki → Web → Model. Each step fires only if the previous one missed. The model is last resort, not first call. Sound familiar?

I asked it who wrote a paper and it invented “Dr. David J. G. Smith” - a person who does not exist. After wiring DOI/Crossref fallback for academic metadata:

You: Who wrote The Anatomy of a Large-Scale Hypertextual Web Search Engine?

Model: Sergey Brin and Lawrence Page.
See: https://research.google/pubs/the-anatomy-of-a-large-scale-hypertextual-web-search-engine/
Confidence: medium | Source: Web

Deterministic extraction from metadata. No model synthesis.

>>web is provider-agnostic - ships with DuckDuckGo (no API key, no account) and supports Tavily, SearxNG, or your own adapter. Add your own trusted domains in one config line (there are a bunch baked in already, like pubmed). Every answer comes with a See: URL so you can verify with one click. Receipts, not pinky promises. PS: I even cooked in allow-list / deny-list domain filters, junk-domain blocklist and ad/tracker URL rejection so your results don’t get fouled with low quality spam shit.

pound_heap@lemmy.dbzer0.com · 2 days ago

This looks awesome! Can you share the real life use cases for this? What are you using it for?

SuspciousCarrot78@lemmy.world · edit-2 2 days ago

Everything you see - every feature - is everything I use. None of it is ornamental.

But my head is in the code right now, so I don’t “use it” so much as try to break it and then fix it.

The end game is a local, expert system, that I can rely on, automate and audit. Because I built it and know exactly how it works.

If you’re asking for my most common uses for it right now (outside of kicking it and then picking it back up)

sentiment analysis ("what did they mean in this email by…)
document analysis
word etymology (I got the language thing with my ASD)
pilot project (see: https://lemmy.world/comment/22058968)
To-do lists
THINKING (and this is a big one for me: I’ll pose a problem, it will rubber duck it with me)
all the side cars (calculations, currency look ups, weather etc)
drafting ideas and research
shooting the shit when bored (local version of Claude-in-a-can is a bit more advanced then what’s on repo; not stable yet. But when it cooks, fuck me it cooks. Will not push it till it’s 100%).

Basically, all the shit you would ideally like to use an LLM for, but self hosted, private and non-bullshitty. I run on a potato (so don’t really use it for coding very much) but if you have a better rig than mine and can run bigger models - the router is agnostic and it should just work ™.

TLDR;

What I’m building towards: a local expert system that picks its own tools (I coded), executes them (how I taught it to), and gives me a single-line audit receipt for every decision (that I can check if it smells funny). I ask a question, the system decides whether to calculate, look up, search, retrieve from my docs, or reason from scratch - then tells me exactly which path it took and why. Think ChatGPT convenience but with a paper trail you can actually inspect.

And when that’s done…I’m probably stick it in a robot. Because why not? :)

https://github.com/poboisvert/GPTARS_Interstellar

(or tee it up with Home-Assistant)

PS: If you want to know the why behind this whole thing -

https://codeberg.org/BobbyLLM/llama-conductor/src/branch/main/DESIGN.md

PPS: Give me about … 15 mins. I’m just about to push a >>web sidecar. Needs one more tweak to properly parse DOIs / pubmed extraction. I was bored and it’s been on my TO-DO list for too long

PPPS: Those were some Planet Namek 15 minutes…but the deed is done. Enjoy

pound_heap@lemmy.dbzer0.com · 2 days ago

Nice! You kinda answered my next question already with this web tool. I was curious if you are getting any useful results from the model itself without feeding it with good data first or relying on hardcoded tools. 4b model must be really dumb for anything even little complicated. I see you recommend to run two models - is it in parallel or the router can control backend and switch models?

SuspciousCarrot78@lemmy.world · edit-2 2 days ago

No, actually it’s probably one of the strongest 4Bs that you can run. On par with ChatGPT 4.1 in many benchmarks.

https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507

I use the DavidAU fine tune, which is even a touch better

https://huggingface.co/DavidAU/Qwen3-4B-Hivemind-Instruct-NEO-MAX-Imatrix-GGUF

The two models thing is a router back end switch that reduces hallucinations when using RAG. Separate but extra to the main stuff.

https://codeberg.org/BobbyLLM/llama-conductor/src/branch/main/FAQ.md#what-is-mentats

There are multiple duo / tag team orchestrations like this (eg: the vision model I use is Qwen 3-VL-4B, which does vision stuff and then feeds the output to “thinker” to work with etc).

One of the eventual goals is parallel swarm or model decomposition, with “thinker” acting as the main orchestrator.

The swarm idea is basically: instead of asking one 4B model to do everything at once (understand, retrieve, evaluate, synthesise, check its own work), you decompose the task into tiny (<1B) single-purpose workers -evidence extractors, contradiction detectors, refusal sentinels, a synthesis worker, and an arbiter (current “critic”) that makes the final call. Then the “thinker” uses that info to reason from.

Each worker is small and stupid at exactly one thing, which means it’s auditable and replaceable.

Think of it as breaking the 4B metacognitive ceiling by not asking any single model to be metacognitive.

The deterministic routing backbone stays -workers only handle the ambiguous semantic stuff that can’t be solved with pure Python. It’s not “more models = better” - it’s “right model, right job, fail-loud if they disagree.”

Basically, similar reasoning as to the research I cited in the Mentats section.

PS: when you load it up, you might notice it refers to itself internally as MoA router. That’s pulling double duty. In normal llm circles that means Mixture of Agents. In my world that means “Mixture of Assholes”. See below -

YOU (question) → ROUTER+DOCS (Ah shit, here we go again. I hate my life)

|

ROUTER+DOCS → Asshole 1: SmolLM2-135M (“I’m right”)

|

ROUTER+DOCS → Asshole 2: SmolLM2-360M (“No, I’m right”)

|

ROUTER+DOCS → Asshole 3: Gemma-3-270M (“Idiots, I’m right!”)

|

ROUTER+DOCS → Asshole 4: Qwen3-1.7B (“You’re all beneath me”)

|

ARBITER: Phi-4-mini (“Shut up, all of you.”) ← (all assholes)

|

→ THINKER: Qwen-4B (“I’m surrounded by idiots. Fine, I’ll do it myself.”)

|

ROUTER (please, let me die)

|

YOU (answer + mad cackle)

Zwuzelmaus@feddit.org · 4 days ago

Well done, meat popsicle :)

SuspciousCarrot78@lemmy.world · 4 days ago

o7

We green? We super green? Corbin Dallas my man?

PS: I know for sure >>fun mode pulls in a bunch of Firefly, Buffy and 5th element snark but I dunno if it will catch on meat popsicle. Maybe? Let me ~~procrastinate~~ uh, perform some urgent QC right now.

For sure it will once claude-in-a-can is done. I mean, what is the point of a LLM if it can’t shit talk you while helping you solve a problem?

https://bobbyllm.github.io/llama-conductor/blog/claude-in-a-can-1/

SuspciousCarrot78@lemmy.world · edit-2 3 days ago

As was foretold in legend.

Negative

Twas but the working of a moment. I just added -

{“term”: “meat popsicle”, “category”: “fifth_element”, “definition”: “The Fifth Element (1997), 01:04:11. A police officer asks Korben Dallas (Bruce Willis): ‘Sir, are you classified as human?’ Reply: ‘Negative. I am a meat popsicle.’ Not an insult or irony - the straightest possible answer to a stupid question. Adopted as Gen X shorthand for acknowledging one’s own biological inconsequence with maximum economy of feeling.”, “source”: “static”, “confidence”: “high”, “tags”: [“fifth_element”, “pop_culture”, “gen_x”, “snark”, “bruce_willis”]}

– to one file and it was up to speed.

EDIT: dropped in a better definition.

muzzle@lemmy.zip · 4 days ago

How can you not reference this gem of an SCP entry?

PS This sounds super interesting, looking forward to try it.

PPS I am waiting for the day when I can run this on my phone.

SuspciousCarrot78@lemmy.world · edit-2 4 days ago

Because sometimes, people deserve to have their faith rewarded when they go looking :)

Now go look at the about section or the “Some problems This Solves” on the repo, and enjoy the absurdity of sentient yeast :)

PS: Yes, please do try it

PPS: HAHA! You can run it on your phone RIGHT NOW. Well, you can run it on your PC and then access it on your phone via http://127.0.0.1:8088/ when you’re on the same LAN / WIFI. Given that tailscale exists, you could probably make that happen outside of your home too, firewall troubleshooting notwithstanding. (One of my personal use-cases for llama-conductor is exactly that).

Personally, I really like the below app myself (it’s what I use to access llama-conductor via my phone) and am considering forking it and making more streamlined.

https://github.com/Taewan-P/gpt_mobile

There’s an issue with it in that older (pre Android 12) version times out after 30 seconds. ##mentats triple pass can take longer than that on my shit-tier GPU, so I may need some jiggery-pokery. I tried forcing keep alive via llama-conductor but gpt_mobile just sort of ignored me.

Be aware this is not a multi-tenancy rig - it assumes 1 user at a time. You CAN have more people than 1 person access it of course, but stuff you add via !! they may be able to recall via ?? on their end, so don’t plan any extravagant murders in plain sight (!! DIE BART). That was an intentional design decision due to how gpt_mobile works. I’ll harden it once I fork that app; the piping is already in place.

muzzle@lemmy.zip · 3 days ago

I am waiting for the day when you can run this kind of models directly on the phone.

SuspciousCarrot78@lemmy.world · 3 days ago

You can probably do that right now, actually.

https://www.gsmarena.com/fairphone_5-12540.php + https://postmarketos.org/

or

https://www.ubuntu-touch.io/

muzzle@lemmy.zip · 3 days ago

Unfortunately i need to use android.

ZombiFrancis@sh.itjust.works · 3 days ago

“List the article’s concrete claims about permit status and turbine operations, each with support.”

EPA position: these turbines require permits under the Clean Air Act.

Not quite though. The article cited EPA’s policy as per a former EPA enforcement staffer who was explicitly stating the EPA is not requiring that here and has made rules deferring to the state and local authorities. The guy was saying the EPA should be acting, but isn’t. The article was clever with it, but that’s all the more reason.

SuspciousCarrot78@lemmy.world · 3 days ago

Hmm?

“…the EPA has long maintained that such pollution sources require permits under the Clean Air Act” and reiterated that policy on January 15th.

Buckheit is a former official commenting on enforcement failure, not the source of the permitting position. The nuance the model could have flagged better is the gap between EPA’s stated policy and its current enforcement posture under Trump? Those are different things.

Fair critique on the depth, but the attribution isn’t wrong, is it?

https://www.theguardian.com/environment/2026/feb/13/elon-musk-xai-datacenters-air-pollution-mississippi

ZombiFrancis@sh.itjust.works · 3 days ago

Kind of. It isn’t wrong, but it is a crucial omission that it’s interviewing a former EPA enforcement guy (i.e. not current) about current enforcement policy, (which is radically changing under Zeldin.) So the model’s interpretation on whether the state will hold to federal pressure becomes imprecise since it’s really this guy stating there’s actually a lack of federal pressure.

But it does rightfully note information is not in the article to answer, which is neat.

Because… for context not directly in the article, technically if EPA defers to the state, then Mississippi saying temporary permit exemption actually applies here satisfies the permit requirement, which Buckheit has to know. (Which directly explains the lack of federal pressure.) Citing the policy in January was a clever non-answer from the EPA. They’re actually saying state and federal policies are NOT in conflict.

Also, I’m not trying to dismiss any of this, more trying to provide an insight that might help with accuracy. I have a bit of knowledge on this specific subject, so I thought I’d note a point where I can measure an inaccuracy.

These kinda of articles can be really sneaky about claims and statements. Mostly minor and innocuous, but an LLM doesn’t know the difference. Like, this caught that Buckheit is talking about what should be happening under previous admins when he was involved, but that’s specifically not what the EPA is doing anymore, which the LLM appears to have missed in part. Which to me, that part was the primary purpose of the article.

SuspciousCarrot78@lemmy.world · 3 days ago

Fair. I should have fed it a better article. OTOH, I’m confident that this quality of synthesis isn’t native to anything under 70B. So, if the tooling can uplift the reasoning ability of a 4B to that level, that’s pretty good in my book.