- Home
- Services
- IVY
- Portfolio
- Blogs
- About Us
- Contact Us
- Sun-Tue (9:00 am-7.00 pm)
- infoaploxn@gmail.com
- +91 656 786 53
Congratulations. You are taking a step in the right direction learning about Large Language Models. This is great. The road ahead is exciting and very perilous. But I can assure you it will be an amazing ride learning about all of the exciting things happening in the wonderful world of AI.
While libraries and applications such as vLLM and Ollama allow us to host LLMs efficiently without any hassle it is always good to go back to basics so as to learn how these huge libraries may be processing things in the backend.
Modern LLMs are computational behemoths that require serious processing power. Your GPU is the secret weapon that transforms these models from theoretical concepts into practical, responsive AI assistants. By leveraging GPU acceleration, you can run state-of-the-art language models right on your personal computer, opening up a world of possibilities for developers, researchers, and AI enthusiasts.
Before diving into LLM deployment, let’s ensure your machine is ready for the challenge. Here’s a quick checklist:
Thu Dec 28 15:58:29 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.98 Driver Version: 535.98 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3070 ... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 41C P8 11W / 94W | 59MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 8092 C+G ...cal\Postman\app-10.21.0\Postman.exe N/A|
| 0 N/A N/A 8984 C+G C:\Windows\explorer.exe N/A|
+---------------------------------------------------------------------------------------+
If nvidia-smi doesn’t return the expected output, head to the official NVIDIA website and download the latest drivers specific to your graphics card. Precision matters here!
Download the latest official NVIDIA drivers
For this journey, we’ll be using Visual Studio Code which is a versatile, free, and open-source integrated development environment (IDE)
Why WSL? Libraries like accelerate and bitsandbytes prefer a Linux environment. WSL bridges this gap, allowing you to harness the full potential of these libraries on your Windows machine.
Now you can initiate wsl by simply going into the terminal and writing,
```wsl```
Create a dedicated virtual environment for your LLM project:
conda create -n testingllm python=3.11.5
conda activate testingllm
Your requirements.txt is your roadmap to success. Key libraries include:
Here is the full requirements.txt file (can be set as a separate file)
accelerate @ git+https://github.com/huggingface/accelerate.git@dab62832de44c84e80045e4db53e087b71d0fd85
aiofiles==23.2.1 aiohttp==3.8.6 aiosignal==1.3.1 altair==5.2.0 annotated-types==0.6.0 anyio==3.7.1 appdirs==1.4.4 async-timeout==4.0.3 attrs==23.1.0 bitsandbytes==0.41.1 certifi==2023.7.22 charset-normalizer==3.3.0 click==8.1.7 colorama==0.4.6 contourpy==1.2.0 cycler==0.12.1 dataclasses-json==0.6.3 datasets==2.14.5 dill==0.3.7 docker-pycreds==0.4.0 docstring-parser==0.15 easyllm==0.6.2 einops==0.7.0 exceptiongroup==1.2.0 fastapi==0.104.1 ffmpy==0.3.1 filelock==3.12.4 fonttools==4.45.1 frozenlist==1.4.0 fsspec==2023.6.0 gitdb==4.0.10 GitPython==3.1.37 gradio==4.7.1 gradio_client==0.7.0 greenlet==3.0.1 h11==0.14.0 httpcore==1.0.2 httpx==0.25.2 huggingface-hub==0.16.4 idna==3.4 importlib-resources==6.1.1 Jinja2==3.1.2 jsonpatch==1.33 jsonpointer==2.4 jsonschema==4.20.0 jsonschema-specifications==2023.11.2 kiwisolver==1.4.5 langchain==0.0.343 langchain-core==0.0.7 langsmith==0.0.67 markdown-it-py==3.0.0 MarkupSafe==2.1.3 marshmallow==3.20.1 matplotlib==3.8.2 mdurl==0.1.2 mpmath==1.3.0 multidict==6.0.4 multiprocess==0.70.15 mypy-extensions==1.0.0 nanoid==2.0.0 networkx==3.1 numpy==1.26.1 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.18.1 nvidia-nvjitlink-cu12==12.2.140 nvidia-nvtx-cu12==12.1.105 orjson==3.9.10 packaging==23.2 pandas==2.1.1 pathtools==0.1.2 peft @ git+https://github.com/huggingface/peft.git@aaa7e9f44a6405af819e721d7ee7fc6dd190c980 Pillow==10.1.0 protobuf==4.24.4 psutil==5.9.6 py-expression-eval==0.3.14 pyarrow==13.0.0 pydantic==2.1.1 pydantic_core==2.4.0 pydub==0.25.1 Pygments==2.16.1 pyparsing==3.1.1 python-dateutil==2.8.2 python-multipart==0.0.6 pytz==2023.3.post1 PyYAML==6.0.1 referencing==0.31.1 regex==2023.10.3 requests==2.31.0 rich==13.6.0 rpds-py==0.13.2 safetensors==0.4.0 scipy==1.11.3 semantic-version==2.10.0 sentencepiece==0.1.99 sentry-sdk==1.32.0 setproctitle==1.3.3 shellingham==1.5.4 shtab==1.6.4 six==1.16.0 smmap==5.0.1 sniffio==1.3.0 SQLAlchemy==2.0.23 starlette==0.27.0 sympy==1.12 tenacity==8.2.3 tokenizers==0.14.1 tomlkit==0.12.0 toolz==0.12.0 torch==2.1.0 tqdm==4.66.1 transformers @ git+https://github.com/huggingface/transformers.git@21dc5859421cf0d7d82d374b10f533611745a8c5 triton==2.1.0 trl==0.7.2 typer==0.9.0 typing-inspect==0.9.0 typing_extensions==4.8.0 tyro==0.5.10 tzdata==2023.3 urllib3==2.0.6 uvicorn==0.24.0.post1 wandb==0.15.12 websockets==11.0.3 xformers==0.0.22.post4 xxhash==3.4.1 yarl==1.9.2 Install them using: pip install -r requirements.txt
We’ll use Mistral-7B as our example model:
Run the following commands in the WSL terminal
git lfs install
git clone <https://huggingface.co/mistralai/Mistral-7B-v0.1>
Pro Tip: Make some tea while the libraries are loading, watch a movie while the model is loading. Model downloads and installations take time!
Also you can use two different terminals, one for downloading the libraries and the other for downloading the model.
Here’s a sample code snippet demonstrating efficient model loading:
def loadmodel(modelpath="Mistral-7b-v0.1/"): nf4_config = BitsAndBytesConfig( loadin4bit=True, bnb4bitquant_type="nf4", bnb4bitcompute_dtype=torch.bfloat16 ) tokenizer = AutoTokenizer.frompretrained(modelpath, localfilesonly=True) model = AutoModelForCausalLM.from_pretrained( model_path, localfilesonly=True, device_map="auto", quantizationconfig=nf4config ) return model, tokenizer
Create a simple prediction endpoint:
app = FastAPI() @app.get("/predict") async def make_prediction(query:str): prediction = predict(model, tokenizer, query) return {"prediction": prediction}
Run it with:
uvicorn main:app
import transformers from transformers import AutoTokenizer, AutoModelForCausalLM from transformers import BitsAndBytesConfig from fastapi import FastAPI def loadmodel(modelpath="Mistral-7b-v0.1/"): nf4_config = BitsAndBytesConfig( loadin4bit=True, bnb4bitquant_type="nf4", bnb4bitcompute_dtype=torch.bfloat16 ) tokenizer = AutoTokenizer.frompretrained(modelpath, localfilesonly=True) model = AutoModelForCausalLM.from_pretrained( model_path, localfilesonly=True, device_map="auto", quantizationconfig=nf4config ) return model, tokenizer app = FastAPI() @app.get("/predict") async def make_prediction(query:str): prediction = predict(model, tokenizer, query) return {"prediction": prediction}
Now open up http://127.0.0.1:8000/docs In your browser and there you will find swagger docs to test out the LLM.
You’ve now taken a significant step into the world of AI and machine learning. Running LLMs locally is no small feat – it requires technical skill, patience, and a passion for exploration.
Remember, every error is a learning opportunity. Each configuration challenge makes you a better developer and AI enthusiast.
Happy Coding!
Disclaimer: The AI landscape evolves rapidly. Always refer to the latest documentation and community resources.
Imagine reducing your operational costs by up to $100,000 annually without compromising on the technology you rely on. Through our partnerships with leading cloud and technology providers like AWS (Amazon Web Services), Google Cloud Platform (GCP), Microsoft Azure, and Nvidia Inception, we can help you secure up to $25,000 in credits over two years (subject to approval).
These credits can cover essential server fees and offer additional perks, such as:
By leveraging these credits, you can significantly optimize your operational expenses. Whether you're a startup or a growing business, the savings from these partnerships ranging from $5,000 to $100,000 annually can make a huge difference in scaling your business efficiently.
The approval process requires company registration and meeting specific requirements, but we provide full support to guide you through every step. Start saving on your cloud infrastructure today and unlock the full potential of your business.