I'm slightly shocked, I admit. Yesterday I chatted for a good 40 minutes with a locally running #LLaMa #LLM bot - and the experience was pretty much flawless.
With the GPTQ-for-LLaMa project the GPU requirements for running the 13B model becomes ~9GB VRAM.
The text-generation-webui project is already since before a proper chatbot interface, now updated to support the above project and model.
This is "state of the art" level human language #AI running locally on a pretty much normal modern workstation.
Note that this is _not_ the same as the CPU-only llama.cpp project. That one is much too slow for such an experience, and I have some doubts regarding the quality of the inference after having compared both.