• Artisian@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    arrow-down
    1
    ·
    11 days ago

    That’s kinda steryotyping; there are models trained on public domain only content for example. Plenty of academic and non-profit providers with open datasets.

      • Artisian@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        4 days ago

        No. I’ll name three.

        Pleias, an LLM family of models that train on the common corpus, compliant with EU copyright and fair use law. They filtered a public domain dataset for racism and other bias’s, and released the results.

        common canvas is a (suite) of text-to-image models trained on a data they know is well sourced.

        Apertus, public ai is a chat-gpt style bot made in collaboration with the swiss government, with a commitment to using only training data that complies with swiss fair use. They’ve chosen a model design that let’s them remove training data which is improperly labeled, or becomes no longer accessible (ie, by changing robots.txt).

        Not to mention the hundreds of models academics in ML have trained using things like open diffusion and public datasets (see also these hobbyists).

        They don’t have advertising budgets (generally). But you see a steady stream of open models on arXiv.