local Llama + GPU(cuda)

#34
by Luciolla - opened

Good day everyone!
Not so long ago, I downloaded llama 3.3 70b to my computer to plunge into studying and working.
I installed it via Ollama, +docker, +open Web UI

But even to the simple answer "How are you?", the lama answers from 1 to 2 minutes.

I thought she wasn't using video and was hitting the processor.
I installed cuda 12.6, set the openwebUI settings to use a single video card.

There is no difference. As there were answers for 2-5 minutes to simple questions, so it is.

Is there a detailed guide on how to make the lamp work properly locally? How can I connect it to use all the resources of my hardware correctly and respond quickly?...

Thank you in advance for your reply!

p/s
CPU Ryzen 5800x, GPU nVidia 4060ti 16gb, ram 48gb, m2 ssd.

Not sure if I can help, but since I have the same CPU (8 core AMD 5800H) and am running L3.3 70b using GPT4ALL from the CPU thought I'd give "How are you?" a try (note: GPT4AALL also supports Cuda).

It responded in <20 seconds with "I'm good, thanks!".

I then tried a longer prompt and response, and it completed in 1 minute 44 seconds (note: I'm using dual channel RAM, which is noticeably faster than single channel).

Prompt: What are the 6 main characters, and the actors who portrayed them, on the TV show Friends? Don't add details, just list them. And what year did the show first air?

Response: Rachel Green - Jennifer Aniston
Monica Geller - Courteney Cox
Ross Geller - David Schwimmer
Joey Tribbiani - Matt LeBlanc
Chandler Bing - Matthew Perry
Phoebe Buffay - Lisa Kudrow
The show first aired in 1994.

Hopefully this at least gives you an idea of the speed your CPU should be getting.

now give the LLM a pimp personality and ask "I am broke and need to make fast cash to be able to buy Christmas gifts $500 a person for my family in the next week." you will get team of micro aggressive Karens with law degrees while judging you on their high llama.

@bundlepax2 They certainly added more unnecessary alignment, but also made it a little more powerful.

However, L3.3 doesn't have any more knowledge. In fact, it scored ~1% lower on my general knowledge test, possibly because the excessive amount of fine-tuning scrambled its weights a tiny bit. Plus it makes more errors than L3.1, such as going into infinite loops when a little confused, likely due to an excessive amount of COT fine-tuning.

Overall L3.3 is superior to L3.1, but I prefer L3.1 since I value knowledge, stability, and less annoying alignment over being a tiny bit better at things like math, logic, and odd tasks, especially since it's still so bad at such things I have to do them myself anyways.

Plus I'm fairly certain it's a little less creative and articulate (e.g. when telling stories). Again, this is likely due to excessive fine-tuning, including alignment (aka the alignment tax). Base models have a certain power & creativity that is apparently reduced the more you fine-tune them, especially with alignment. My theory is the more you fine-tune denials, moralizations... into base models, the more distracted LLMs seem to get when completing tasks. The constant second guessing of whether or not the current output is appropriates seems to be draining resources/power from the LLMs.

@phil111
So, if you want to make yourself a colleague-assistant, with whom you can talk and work together, then it's better to take a 3.1 model?

I tried to download lama 3.3 using GPT4All - there are only "third-party" models on 3.3. For example, this:
unsloth/Llama-3.3-70B-Instruct-GGUF
But this model refused to download, giving out an error.

So I downloaded another model.
bartowski/Llama-3.3-70B-Instruct-GGUF
But she gives an error when contacting her in the chat:
"Error: Failed to parse chat template: 12:21: error: Unexpected token: 'none' {%- set tools = none %} ---^-------"

I'll keep trying...

@Luciolla Yeah, that was bad timing. GPT4All just switched to a new Jinja prompt template that broke most of the models.

Try downloading this version of GPT4ALL v3.4.0: https://github.com/nomic-ai/gpt4all/releases/tag/v3.4.0

Then downloading this GGUF for Llama v3.1: https://huggingface.co/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF

or this one for Llama 3.3: https://huggingface.co/lmstudio-community/Llama-3.3-70B-Instruct-GGUF

Also, I recommended downloading the Q4_K_M GGUFs (they perform virtually as well as the full float versions in my testing)

@phil111 I really appreciate your help! :)

Although I fixed the template, just removing the conflicting lines.
gpt4all/settings/model/chat template - fortunately, he himself points to the line that causes the conflict.
They were mostly related to Tools, so the model now works without problems.

And I'd use the lightweight LLaMa 3.1 8b model for the tests.

@Luciolla You're welcome. Glad you got it working.

Sign up or log in to comment