- cross-posted to:
- programming@programming.dev
- cross-posted to:
- programming@programming.dev
Yesterday Mistral AI released a new language model called Mistral 7B. @justnasty@lemmy.kya.moe already posted the Sliding attention part here in LocalLLaMA, yesterday. But I think the model and the company behind that are even more noteworthy and the release of the model is worth it’s own post.
Mistral 7B is not based on Llama. And they claim it outperforms Llama2 13B on all benchmarks (at it’s size of 7B). It has additional coding abilities and a 8k sequence length. And it’s released under the Apache 2.0 license. So truly an ‘open’ model, usable without restrictions. [Edit: Unfortunately I couldn’t find the dataset or a paper. They call it ‘open-weight’. So my conclusion regarding the open-ness might be a bit premature. We’ll see.]
(It uses Grouped-query attention and Sliding Window Attention.)
Also worth to note: Mistral AI (the company) is based in Paris. They are one of the few big european AI startups and collected $113 million funding in June.
- Details are on Mistral AI’s Announcement
- techcrunch news article including information about the company
- They released an base/foundation model and an instruction-tuned one on HuggingFace
- And llama.cpp is already compatible and GGUF versions out there.
I’ve tried it and it indeed looks promising. It certainly has features that distinguishes it from Llama. And I like the competition. Our world is currently completely dominated by Meta. And if it performs exceptionally well at its size, I hope people pick up on it and fine-tune it for all kinds of specific tasks. (The lack of a dataset and detail regarding the training could be a downside, though. These were not included in this initial release of the model.)
EDIT 2023-10-12: Paper released at: https://arxiv.org/abs/2310.06825 (But I’d say no new information in it, they mostly copied their announcement)
As of now, it is clear they don’t want to publish any details about the training.
I’ve been trying to train this new model, but there are still questions to be answered. What you can do with the new RotatingBufferCache and why training takes up so much memory. Only the inferencing code is available.
I don’t see how the open source community will benefit from the code until their sliding window is clarified for the use of model training.
At least two non-llama papers were written this year, Pythia and Phi, which have already the tools available. I can take any of these models’ weights and put them in another open model, and then continue training there.
My small llama models have already been replaced by Mistral as an end-user. However, this feels like switching from Windows to macOS. Having fine-tunes for open models, which we currently lack, would imply that we rely on the community rather than corporate releases.
That Opening up ChatGPT link was helpful, I haven’t seen elsewhere.