I feel that this 11B model is smarter and hallucinates less than other leaders

#9
by agershun - opened

I feel that this 11B model is smarter and less hallucinate then current leaders in the leaderboard (on December 16, 2023).

I have tried these models with better current scores then upstage/SOLAR-10.7B-Instruct-v1.0:

  • rwitz2/go-bruins-v2.1.1
  • ignos/LeoScorpius-GreenNode-Alpaca-7B-v1
  • Toten5/LeoScorpius-GreenNode-7B-v1

and found that all of them hallucinate more often than the SOLAR does.

Probably, the reason is:
a) in the size of the model and it is more perspective to improve 11B models than 7B.
b) they are overtrained on the test dataset data.

agershun changed discussion title from I feel that this 11B model is smarter and less hallucinate then to I feel that this 11B model is smarter and less hallucinate than other leaders
agershun changed discussion title from I feel that this 11B model is smarter and less hallucinate than other leaders to I feel that this 11B model is smarter and hallucinates less than other leaders
deleted

@agershun I agree that it seems smarter, but have experienced it hallucinating more.

The reduction in hallucinations appears to be an illusion caused by things like its propensity to shy away from long lists filled with details, hence reducing the opportunity for hallucinations, and denying things are true, even when they are (throwing the baby out with the bath water).

When I put it to the test with tricky fringe knowledge it performed worse than all other leading 7b Mistrals. For example, when I ask about the 2 ex-wives of Alan Harper from the show Two and a Half Men this LLM got all 4 names wrong (screen and real names), while 7b Mistrals reliably get 2 of 4 right (first wife Judith). And this wasn't the exception. It reliable hallucinated more on fringe knowledge.

So ironically, despite its larger size, this LLM is far less knowledgeable than the original 7b Mistrals. Hallucinated more at the fringes, provides less information (to avoid hallucinations) and denying ~10x more things aren't true, that actually are true, in an attempt to minimize the frequency of saying things are true that aren't.

In short, it actually hallucinates more, which is why I suspect it's overly tight-lipped, brief, cynical...

@Phil337 Could you try out UNA-Solar by the UNA guy? Or better, the Frostwind tune from Sao10K?

I still have no idea what UNA is, but it seems he finally relented some details in his model card. I have more hope for the Sao10K tune, however, because it's actually trained on the base model.

I'm traveling so can't quite test at the moment. I'm rather invested in this model, due to how transparent the team has been in their testing. Seems like a rare breed in recent climate.

deleted

@Mino24she I haven't tried Sao10K yet, but I did try the UNA-Solar version of instruct and it performed slightly better on my test (e.g. got 3 of 4 names right in my aforementioned question about Alan's ex-wives from 2.5 Men). However, it's plagued by the exact same stubborn denials of facts, as well as excessive censorship and moralizing. But I guess this is expected if UNA is more about the transformers than weights.

I also tried the Uncensored version of Solaris and it performed better (less censorship and moralizing, plus longer responses), but it still performed poorly on fridge knowledge questions like the Alan question above.

@Phil337 The UNA one is based on the instruct tune, which obviously, won't be as practical. The base model looks more organic to me. The Sao10K (of Euryale fame) one (Frostwind) otherwise looks pretty good.

deleted

@Mino24she I tested Frostwind and it still has some censorship, but less than Solar instruct. It also hallucinates more than leading Mistrals at the fringes of knowledge.

So far all the Solar LLMs have been smarter than the top Mistrals, but none have been more knowledgeable, or even as knowledgeable. I don't know what up-scaling is in the context of LLMs, but it appears to only improve the transformers, not the weights.

Good model, have some problems for common sense reasoning and counterfactual reasoning, but isn't impossible to adjust. It's the wright direction, with more accuracy on MMLU benchmark I think the upstage can surpass Mistrals in the future. Have something marvelous in this model.

Agree that reducing the size of the response is an excellent way to decrease hallucinations. However, for solving specific tasks, the length of the response is not a critical factor.

This SOLAR neural network doesn't know a lot, that's true. I usually use the story "Mumu" by Ivan Turgenev for tests, and I have heard so many interesting and diverse stories from different neural networks. But for my tasks, this is not a problem, I want to further train it for my actual material (figuratively speaking, to instill in it the correct version of Mumu). For me, it's more important that it is still capable of making inferences and doesn't give random characteers like ===~=#$== as some other networks do."

This model seems to be better when it comes to RAG, it hallucinates a lot less, this is the most useful model i've loaded on my 8gb vram laptop. You can really rely on it on common, easy language tasks just like chatgpt 3.5. Of course, GPT-4 is better at everything but hey, this is a free, fast decent language model!

For some reason at 16 bit attention it increases in VRam use with each response. Its stable in 8-bit attention but that is sad because you can load it at 16-bit into 24GB VRam and its really fast. The problem is once it hits the ceiling it crashes. I use fastchat with "python -m fastchat.serve.cli --model-path I:\misc\downloaded\Ai_models\models_for_fastchat\upstageSOLAR-10.7B-Instruct-v1.0 --style rich" on a RTX Quadro 6000 Passive. Its stable with --load-8-bit. I will try the other solar models. So far it seems very fast with solid replies.

upstage org

@appvoid You are absolutely right. We designed this model to follow instructions well, including the RAG model. Thank you very much for your comment! :-)

This model seems to be better when it comes to RAG, it hallucinates a lot less, this is the most useful model i've loaded on my 8gb vram laptop. You can really rely on it on common, easy language tasks just like chatgpt 3.5. Of course, GPT-4 is better at everything but hey, this is a free, fast decent language model!

hi mate kind of irrelevant question but how do you run the .safetensor files of this model? Do you convert it to .gguf using https://github.com/ggerganov/llama.cpp/discussions/2948 first or do you have some other method?

This model seems to be better when it comes to RAG, it hallucinates a lot less, this is the most useful model i've loaded on my 8gb vram laptop. You can really rely on it on common, easy language tasks just like chatgpt 3.5. Of course, GPT-4 is better at everything but hey, this is a free, fast decent language model!

hi mate kind of irrelevant question but how do you run the .safetensor files of this model? Do you convert it to .gguf using https://github.com/ggerganov/llama.cpp/discussions/2948 first or do you have some other method?

Hi, i just used gguf version from "the bloke" on lmstudio. What really suprises me though is that i'm using a q5km quantized version and still manages to be good.

I used SOLAR with the following methods:

  1. ollama supports it out of the box
  2. vllm as well
  3. For A100/40 + Jupyter I used this code:
import torch
from datasets import Dataset, load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from trl import SFTTrainer
model_name = "upstage/SOLAR-10.7B-Instruct-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = 'right'
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config,
)

query = """What do I want to ask?"""
conversation = [ {'role': 'user', 'content': query} ] 
prompt = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt").to(base_model.device) 
outputs = base_model.generate(**inputs, use_cache=True, max_length=4096)
output_text = tokenizer.decode(outputs[0]) 
output_text = output_text.split("\n### Assistant:\n", 1)[-1].replace("<s>", "").replace("</s>", "").strip()
print(output_text)

This model seems to be better when it comes to RAG, it hallucinates a lot less, this is the most useful model i've loaded on my 8gb vram laptop. You can really rely on it on common, easy language tasks just like chatgpt 3.5. Of course, GPT-4 is better at everything but hey, this is a free, fast decent language model!

hi mate kind of irrelevant question but how do you run the .safetensor files of this model? Do you convert it to .gguf using https://github.com/ggerganov/llama.cpp/discussions/2948 first or do you have some other method?

Hi, i just used gguf version from "the bloke" on lmstudio. What really suprises me though is that i'm using a q5km quantized version and still manages to be good.

thanks the reply mate, is it this one? https://huggingface.co/TheBloke/SOLAR-10.7B-Instruct-v1.0-uncensored-GGUF

This model seems to be better when it comes to RAG, it hallucinates a lot less, this is the most useful model i've loaded on my 8gb vram laptop. You can really rely on it on common, easy language tasks just like chatgpt 3.5. Of course, GPT-4 is better at everything but hey, this is a free, fast decent language model!

hi mate kind of irrelevant question but how do you run the .safetensor files of this model? Do you convert it to .gguf using https://github.com/ggerganov/llama.cpp/discussions/2948 first or do you have some other method?

Hi, i just used gguf version from "the bloke" on lmstudio. What really suprises me though is that i'm using a q5km quantized version and still manages to be good.

thanks the reply mate, is it this one? https://huggingface.co/TheBloke/SOLAR-10.7B-Instruct-v1.0-uncensored-GGUF

This one:
https://huggingface.co/TheBloke/SOLAR-10.7B-Instruct-v1.0-GGUF

deleted

@Mino24she I remembered you thought Solar has a lot of promise and wanted to find a good Solar fine-tune. Check out the uncensored version linked below. Even when it comes to uncensored prompts it's more verbose and hallucinates less on the fringes of knowledge.

https://huggingface.co/w4r10ck/SOLAR-10.7B-Instruct-v1.0-uncensored

I have tried this model with RAG and sometimes the answers are unrelated both to the question and to the context provided. I have switched to Tulu 2 DPO, which provides less elegant answers, but I find it a lot more reliable for RAG.

Here is the code (with chatbot interface) if anyone is interested:

https://github.com/mirix/retrieval-augmented-generation

For instance, I had it read The Little Prince and asked a question that was within the context and the answer was about The Shawshank Redemption and had nothing whatsoever to do with the question.

I have tried several models with the same script and none of them hallucinated like that.

I thought it could be the template. but SOLAR seem to get most answers right.

Any ideas?

Please, disregard my previous comment. The issue seems indeed to be related to the prompt template and it is solved by using the following wrapper:

query_wrapper_prompt = PromptTemplate(
"### System:"
"Please, check if the anwser can be inferred from the pieces of context provided. If the answer cannot be inferred from the context, just state that the question is out of scope and do not provide any answer.\n"
"### User:"
"{query_str}\n"
"### Assistant:\n"
)

It would be great to see a long-context version of this model, it may also help with the multi-turn conversations that many have mentioned it seems to struggle with.

hunkim pinned discussion

Half a year later and this model is still one of the very best! There have been finetunes like Fimbulvetr in the meantime, and communities like /r/SillyTavern are really happy with the work you shared with the public. Personally I'm extremely amazed how the model adheres to previous messages, which creates both incredible consistency and flexibility for character prompts. Thanks for your fantastic work!

Sign up or log in to comment