Running Large Language Models on Your GPU without Ollama: The Ultimate Guide in 2025

Congratulations. You are taking a step in the right direction learning about Large Language Models. This is great. The road ahead is exciting and very perilous. But I can assure you it will be an amazing ride learning about all of the exciting things happening in the wonderful world of AI.

While libraries and applications such as vLLM and Ollama allow us to host LLMs efficiently without any hassle it is always good to go back to basics so as to learn how these huge libraries may be processing things in the backend.

Why GPU Acceleration Matters

Modern LLMs are computational behemoths that require serious processing power. Your GPU is the secret weapon that transforms these models from theoretical concepts into practical, responsive AI assistants. By leveraging GPU acceleration, you can run state-of-the-art language models right on your personal computer, opening up a world of possibilities for developers, researchers, and AI enthusiasts.

Preparing Your Machine: Essential Prerequisites

Before diving into LLM deployment, let’s ensure your machine is ready for the challenge. Here’s a quick checklist:

Hardware Verification: You’ll need a modern GPU with sufficient VRAM (>10GB atleast). NVIDIA GPUs are typically the go-to choice for machine learning workloads.
Checking GPU Compatibility: Open your command prompt and run the nvidia-smi command. This command reveals crucial information about your GPU drivers and CUDA version. The output should look something like,
I am assuming you have some programming knowledge here.

Thu Dec 28 15:58:29 2023

+---------------------------------------------------------------------------------------+

| NVIDIA-SMI 535.98 Driver Version: 535.98 CUDA Version: 12.2 |

|-----------------------------------------+----------------------+----------------------+

| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|=========================================+======================+======================|

| 0 NVIDIA GeForce RTX 3070 ... WDDM | 00000000:01:00.0 Off | N/A |

| N/A 41C P8 11W / 94W | 59MiB / 8192MiB | 0% Default |

| | | N/A |

+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=======================================================================================|

| 0 N/A N/A 8092 C+G ...cal\Postman\app-10.21.0\Postman.exe N/A|

| 0 N/A N/A 8984 C+G C:\Windows\explorer.exe N/A|

+---------------------------------------------------------------------------------------+

Pro Tip: Driver Installation

If nvidia-smi doesn’t return the expected output, head to the official NVIDIA website and download the latest drivers specific to your graphics card. Precision matters here!

Download the latest official NVIDIA drivers

Setting Up Your Development Environment

Choosing Your Tools

For this journey, we’ll be using Visual Studio Code which is a versatile, free, and open-source integrated development environment (IDE)

What we love about Visual Studio Code

Free and open-source
Maintained by Microsoft
Cross-platform compatibility
Intelligent Python integration

Navigating Windows Subsystem for Linux (WSL)

Why WSL? Libraries like accelerate and bitsandbytes prefer a Linux environment. WSL bridges this gap, allowing you to harness the full potential of these libraries on your Windows machine.

WSL Installation Steps

Open command prompt as an administrator
2. List available distributions:
```wsl --list –online```
Install your preferred distribution, for example:
```wsl --install -d Ubuntu-22.04```
Set up username and password

Now you can initiate wsl by simply going into the terminal and writing,

```wsl```

Creating the Perfect Python Environment

Miniconda: Lightweight Package Management

Download Miniconda installer
Run installation script
Pro Tip: Pay attention to installation prompts!

Create a dedicated virtual environment for your LLM project:

conda create -n testingllm python=3.11.5

conda activate testingllm

Essential Libraries for LLM Deployment

Your requirements.txt is your roadmap to success. Key libraries include:

transformers
torch
accelerate
bitsandbytes
fastapi

Here is the full requirements.txt file (can be set as a separate file)

accelerate @ git+https://github.com/huggingface/accelerate.git@dab62832de44c84e80045e4db53e087b71d0fd85

aiofiles==23.2.1

aiohttp==3.8.6

aiosignal==1.3.1

altair==5.2.0

annotated-types==0.6.0

anyio==3.7.1

appdirs==1.4.4

async-timeout==4.0.3

attrs==23.1.0

bitsandbytes==0.41.1

certifi==2023.7.22

charset-normalizer==3.3.0

click==8.1.7

colorama==0.4.6

contourpy==1.2.0

cycler==0.12.1

dataclasses-json==0.6.3

datasets==2.14.5

dill==0.3.7

docker-pycreds==0.4.0

docstring-parser==0.15

easyllm==0.6.2

einops==0.7.0

exceptiongroup==1.2.0

fastapi==0.104.1

ffmpy==0.3.1

filelock==3.12.4

fonttools==4.45.1

frozenlist==1.4.0

fsspec==2023.6.0

gitdb==4.0.10

GitPython==3.1.37

gradio==4.7.1

gradio_client==0.7.0

greenlet==3.0.1

h11==0.14.0

httpcore==1.0.2

httpx==0.25.2

huggingface-hub==0.16.4

idna==3.4

importlib-resources==6.1.1

Jinja2==3.1.2

jsonpatch==1.33

jsonpointer==2.4

jsonschema==4.20.0

jsonschema-specifications==2023.11.2

kiwisolver==1.4.5

langchain==0.0.343

langchain-core==0.0.7

langsmith==0.0.67

markdown-it-py==3.0.0

MarkupSafe==2.1.3

marshmallow==3.20.1

matplotlib==3.8.2

mdurl==0.1.2

mpmath==1.3.0

multidict==6.0.4

multiprocess==0.70.15

mypy-extensions==1.0.0

nanoid==2.0.0

networkx==3.1

numpy==1.26.1

nvidia-cublas-cu12==12.1.3.1

nvidia-cuda-cupti-cu12==12.1.105

nvidia-cuda-nvrtc-cu12==12.1.105

nvidia-cuda-runtime-cu12==12.1.105

nvidia-cudnn-cu12==8.9.2.26

nvidia-cufft-cu12==11.0.2.54

nvidia-curand-cu12==10.3.2.106

nvidia-cusolver-cu12==11.4.5.107

nvidia-cusparse-cu12==12.1.0.106

nvidia-nccl-cu12==2.18.1

nvidia-nvjitlink-cu12==12.2.140

nvidia-nvtx-cu12==12.1.105

orjson==3.9.10

packaging==23.2

pandas==2.1.1

pathtools==0.1.2

peft @ git+https://github.com/huggingface/peft.git@aaa7e9f44a6405af819e721d7ee7fc6dd190c980

Pillow==10.1.0

protobuf==4.24.4

psutil==5.9.6

py-expression-eval==0.3.14

pyarrow==13.0.0

pydantic==2.1.1

pydantic_core==2.4.0

pydub==0.25.1

Pygments==2.16.1

pyparsing==3.1.1

python-dateutil==2.8.2

python-multipart==0.0.6

pytz==2023.3.post1

PyYAML==6.0.1

referencing==0.31.1

regex==2023.10.3

requests==2.31.0

rich==13.6.0

rpds-py==0.13.2

safetensors==0.4.0

scipy==1.11.3

semantic-version==2.10.0

sentencepiece==0.1.99

sentry-sdk==1.32.0

setproctitle==1.3.3

shellingham==1.5.4

shtab==1.6.4

six==1.16.0

smmap==5.0.1

sniffio==1.3.0

SQLAlchemy==2.0.23

starlette==0.27.0

sympy==1.12

tenacity==8.2.3

tokenizers==0.14.1

tomlkit==0.12.0

toolz==0.12.0

torch==2.1.0

tqdm==4.66.1

transformers @ git+https://github.com/huggingface/transformers.git@21dc5859421cf0d7d82d374b10f533611745a8c5

triton==2.1.0

trl==0.7.2

typer==0.9.0

typing-inspect==0.9.0

typing_extensions==4.8.0

tyro==0.5.10

tzdata==2023.3

urllib3==2.0.6

uvicorn==0.24.0.post1

wandb==0.15.12

websockets==11.0.3

xformers==0.0.22.post4

xxhash==3.4.1

yarl==1.9.2

Install them using:

pip install -r requirements.txt

Downloading and Preparing Your LLM

We’ll use Mistral-7B as our example model:

Run the following commands in the WSL terminal

git lfs install

git clone <https://huggingface.co/mistralai/Mistral-7B-v0.1>

Pro Tip: Make some tea while the libraries are loading, watch a movie while the model is loading. Model downloads and installations take time!

Also you can use two different terminals, one for downloading the libraries and the other for downloading the model.

Model Loading and Quantization Techniques

Here’s a sample code snippet demonstrating efficient model loading:

def loadmodel(modelpath="Mistral-7b-v0.1/"):

    nf4_config = BitsAndBytesConfig(

        loadin4bit=True,

        bnb4bitquant_type="nf4",

        bnb4bitcompute_dtype=torch.bfloat16

    )
    tokenizer = AutoTokenizer.frompretrained(modelpath, localfilesonly=True)

    model = AutoModelForCausalLM.from_pretrained(

        model_path,

        localfilesonly=True,

        device_map="auto",

        quantizationconfig=nf4config

    )
    return model, tokenizer

Building a FastAPI Endpoint

Create a simple prediction endpoint:

app = FastAPI()

@app.get("/predict")

async def make_prediction(query:str):

    prediction = predict(model, tokenizer, query)

    return {"prediction": prediction}

Run it with:

uvicorn main:app

Full Code

import transformers

from transformers import AutoTokenizer, AutoModelForCausalLM

from transformers import BitsAndBytesConfig

 from fastapi import FastAPI

def loadmodel(modelpath="Mistral-7b-v0.1/"):

    nf4_config = BitsAndBytesConfig(

        loadin4bit=True,

        bnb4bitquant_type="nf4",

        bnb4bitcompute_dtype=torch.bfloat16

    )

    tokenizer = AutoTokenizer.frompretrained(modelpath, localfilesonly=True)

    model = AutoModelForCausalLM.from_pretrained(

        model_path,

        localfilesonly=True,

        device_map="auto",

        quantizationconfig=nf4config

    )
    return model, tokenizer

 app = FastAPI()

 @app.get("/predict")

async def make_prediction(query:str):

    prediction = predict(model, tokenizer, query)

    return {"prediction": prediction}

Now open up http://127.0.0.1:8000/docs In your browser and there you will find swagger docs to test out the LLM.

Troubleshooting and Best Practices

Monitor GPU memory usage
Use quantization techniques
Manage your virtual environments carefully
Keep your libraries updated

Conclusion: Your Journey into Local LLM Deployment

You’ve now taken a significant step into the world of AI and machine learning. Running LLMs locally is no small feat – it requires technical skill, patience, and a passion for exploration.

Remember, every error is a learning opportunity. Each configuration challenge makes you a better developer and AI enthusiast.

Happy Coding!

Next Steps

Experiment with different models
Explore advanced quantization techniques
Build innovative applications leveraging local LLMs

Disclaimer: The AI landscape evolves rapidly. Always refer to the latest documentation and community resources.

Related Blogs

Explore More

What Is Artificial Intelligence Explained: It’s Not Sci-Fi Anymore!

January 28, 2025

What is Artificial Intelligence? The Smartest Explainer You’ll Read Today!

January 28, 2025

Deep Learning: Comprehensive Guide in 2024

Machine Learning Basics - The Ultimate Guide

January 28, 2025

Machine Learning: An In-Depth Guide for Beginners in 2025

Our Trusted
Partner.

Unlock Valuable Cloud and Technology Credits

Imagine reducing your operational costs by up to $100,000 annually without compromising on the technology you rely on. Through our partnerships with leading cloud and technology providers like AWS (Amazon Web Services), Google Cloud Platform (GCP), Microsoft Azure, and Nvidia Inception, we can help you secure up to $25,000 in credits over two years (subject to approval).

These credits can cover essential server fees and offer additional perks, such as:

Google Workspace accounts
Microsoft accounts
Stripe processing fee waivers up to $25,000
And many other valuable benefits

Why Choose Our Partnership?

By leveraging these credits, you can significantly optimize your operational expenses. Whether you're a startup or a growing business, the savings from these partnerships ranging from $5,000 to $100,000 annually can make a huge difference in scaling your business efficiently.

The approval process requires company registration and meeting specific requirements, but we provide full support to guide you through every step. Start saving on your cloud infrastructure today and unlock the full potential of your business.