DeepSeek Self-Host Guide: Complete Setup and Configuration

Table of Contents

What is DeepSeek?

DeepSeek is a series of high-performance large language models designed for code generation, reasoning, and general conversation. DeepSeek models are available for self-hosting, providing enterprises and developers with on-premises AI capabilities while maintaining data privacy and control.

DeepSeek Interface

Features of DeepSeek

Advanced Language Understanding

  • Code Generation: Excellent at programming tasks across multiple languages
  • Mathematical Reasoning: Strong performance on mathematical and logical problems
  • Multilingual Support: Supports Chinese, English, and other languages
  • Context Understanding: Large context windows for complex conversations

Multiple Model Variants

  • DeepSeek-Coder: Specialized for programming and software development
  • DeepSeek-Math: Optimized for mathematical reasoning and problem-solving
  • DeepSeek-V3: Latest general-purpose model with enhanced capabilities
  • DeepSeek-VL: Vision-language model for multimodal tasks

Enterprise Features

  • API Compatibility: OpenAI-compatible API for easy integration
  • High Throughput: Optimized for production workloads
  • Fine-tuning Support: Customize models for specific use cases
  • Quantization Support: Reduced memory usage with minimal quality loss

Why Self Host DeepSeek?

Self-hosting provides complete control over your AI infrastructure and data.

Benefits of Self-Hosting DeepSeek

  • Data Privacy: Keep sensitive conversations and code on your infrastructure
  • Cost Control: No per-token pricing for high-volume usage
  • Customization: Fine-tune models for your specific domain
  • Latency: Reduce response times with local deployment
  • Compliance: Meet regulatory requirements for data handling
  • Offline Operation: Work without internet connectivity

System Requirements

Minimum Requirements (7B Model)

  • GPU: NVIDIA RTX 4090 or equivalent (24GB VRAM)
  • CPU: 8 cores
  • RAM: 32GB
  • Storage: 100GB SSD
  • OS: Linux (Ubuntu 20.04+ recommended)
  • GPU: 4x NVIDIA A100 (80GB each) or 8x RTX 4090
  • CPU: 32+ cores
  • RAM: 128GB+
  • Storage: 500GB+ NVMe SSD
  • Network: High-speed interconnect for multi-GPU setup

Cloud Instance Recommendations

  • AWS: p4d.24xlarge, p3.16xlarge
  • Google Cloud: a2-ultragpu-8g
  • Azure: Standard_ND96asr_v4

Installation Guide

  1. Install NVIDIA Container Toolkit:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
  1. Pull DeepSeek Docker image:
docker pull deepseek/deepseek-llm:latest
  1. Run DeepSeek container:
docker run -d \
  --name deepseek \
  --gpus all \
  -p 8000:8000 \
  -v /path/to/models:/models \
  -e MODEL_PATH=/models/deepseek-llm-67b-chat \
  deepseek/deepseek-llm:latest

Manual Installation

  1. Install Python dependencies:
# Create virtual environment
python3 -m venv deepseek-env
source deepseek-env/bin/activate

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install transformers and other dependencies
pip install transformers==4.36.0
pip install accelerate
pip install bitsandbytes
pip install flash-attn --no-build-isolation
  1. Download DeepSeek models:
# Using Hugging Face Hub
pip install huggingface-hub
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='deepseek-ai/deepseek-llm-67b-chat', local_dir='./models/deepseek-llm-67b-chat')"
  1. Create inference server:
# server.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from flask import Flask, request, jsonify

app = Flask(__name__)

# Load model and tokenizer
model_path = "./models/deepseek-llm-67b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

@app.route('/v1/chat/completions', methods=['POST'])
def chat_completions():
    data = request.json
    messages = data['messages']
    
    # Format messages for DeepSeek
    conversation = ""
    for msg in messages:
        conversation += f"{msg['role']}: {msg['content']}\n"
    
    # Generate response
    inputs = tokenizer.encode(conversation, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=inputs.shape[1] + data.get('max_tokens', 512),
            temperature=data.get('temperature', 0.7),
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
    
    return jsonify({
        "choices": [{
            "message": {"role": "assistant", "content": response},
            "index": 0,
            "finish_reason": "stop"
        }]
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)
  1. Start the server:
python server.py

Using vLLM (High Performance)

  1. Install vLLM:
pip install vllm
  1. Start vLLM server:
python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/deepseek-llm-67b-chat \
  --tensor-parallel-size 4 \
  --dtype bfloat16 \
  --max-model-len 4096 \
  --host 0.0.0.0 \
  --port 8000

Configuration

Model Configuration

# config.yaml
model:
  name: "deepseek-llm-67b-chat"
  path: "/models/deepseek-llm-67b-chat"
  dtype: "bfloat16"
  trust_remote_code: true

server:
  host: "0.0.0.0"
  port: 8000
  max_concurrent_requests: 64

generation:
  max_tokens: 2048
  temperature: 0.7
  top_p: 0.9
  top_k: 50
  repetition_penalty: 1.1

performance:
  tensor_parallel_size: 4
  pipeline_parallel_size: 1
  gpu_memory_utilization: 0.9
  swap_space: 4

Environment Variables

# Model Configuration
MODEL_NAME=deepseek-llm-67b-chat
MODEL_PATH=/models/deepseek-llm-67b-chat
MAX_MODEL_LEN=4096
DTYPE=bfloat16

# Server Configuration
HOST=0.0.0.0
PORT=8000
WORKERS=1
TENSOR_PARALLEL_SIZE=4

# Performance Tuning
GPU_MEMORY_UTILIZATION=0.9
MAX_NUM_BATCHED_TOKENS=8192
MAX_NUM_SEQS=64
SWAP_SPACE=4

# Logging
LOG_LEVEL=INFO
LOG_FILE=/var/log/deepseek.log

# API Configuration
API_KEY=your-secret-api-key
DISABLE_LOG_STATS=false

Multi-GPU Setup

# For 4-GPU setup
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/deepseek-llm-67b-chat \
  --tensor-parallel-size 4 \
  --dtype bfloat16 \
  --host 0.0.0.0 \
  --port 8000

API Usage

Chat Completions

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "deepseek-llm-67b-chat",
    "messages": [
      {"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
    ],
    "max_tokens": 500,
    "temperature": 0.7
  }'

Code Generation Example

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-api-key"
)

response = client.chat.completions.create(
    model="deepseek-llm-67b-chat",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Create a REST API using FastAPI for a todo application"}
    ],
    max_tokens=1000,
    temperature=0.3
)

print(response.choices[0].message.content)

Streaming Responses

stream = client.chat.completions.create(
    model="deepseek-llm-67b-chat",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Performance Optimization

Memory Optimization

# Enable quantization to reduce memory usage
python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/deepseek-llm-67b-chat \
  --quantization awq \
  --dtype half

Batch Processing

# Batch multiple requests for better throughput
requests = [
    {"messages": [{"role": "user", "content": f"Question {i}"}]}
    for i in range(10)
]

responses = []
for req in requests:
    response = client.chat.completions.create(
        model="deepseek-llm-67b-chat",
        **req
    )
    responses.append(response)

Caching Strategy

from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def cached_completion(prompt_hash):
    # Implementation with caching
    pass

def get_completion(prompt):
    prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
    return cached_completion(prompt_hash)

Backup and Maintenance

Model Backup

#!/bin/bash
# Backup DeepSeek models and configuration

# Create backup directory
mkdir -p /backup/deepseek/$(date +%Y%m%d)

# Backup models
cp -r /models /backup/deepseek/$(date +%Y%m%d)/

# Backup configuration
cp config.yaml /backup/deepseek/$(date +%Y%m%d)/
cp .env /backup/deepseek/$(date +%Y%m%d)/

# Create tar archive
tar -czf /backup/deepseek-$(date +%Y%m%d).tar.gz /backup/deepseek/$(date +%Y%m%d)

# Cleanup old backups (keep 7 days)
find /backup -name "deepseek-*.tar.gz" -mtime +7 -delete

Health Monitoring

#!/bin/bash
# Health check script

# Check API endpoint
if curl -f http://localhost:8000/health; then
    echo "DeepSeek API is healthy"
else
    echo "DeepSeek API is down"
    # Restart service
    systemctl restart deepseek
fi

# Check GPU memory usage
nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | while read mem; do
    if [ $mem -gt 75000 ]; then
        echo "High GPU memory usage: ${mem}MB"
    fi
done

Log Management

# Log rotation configuration
# /etc/logrotate.d/deepseek
/var/log/deepseek.log {
    daily
    missingok
    rotate 30
    compress
    delaycompress
    notifempty
    postrotate
        systemctl reload deepseek
    endscript
}

Troubleshooting

Common Issues

Out of Memory Errors

# Reduce model size or increase swap
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# Use gradient checkpointing
python server.py --gradient-checkpointing

Slow Inference

# Check GPU utilization
nvidia-smi -l 1

# Optimize batch size
python -m vllm.entrypoints.openai.api_server \
  --max-num-batched-tokens 4096

Model Loading Issues

# Clear cache
rm -rf ~/.cache/huggingface/transformers/

# Verify model files
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('deepseek-ai/deepseek-llm-67b-chat')"

Performance Monitoring

# monitoring.py
import psutil
import GPUtil
import time

def monitor_resources():
    while True:
        # CPU usage
        cpu_percent = psutil.cpu_percent()
        
        # Memory usage
        memory = psutil.virtual_memory()
        
        # GPU usage
        gpus = GPUtil.getGPUs()
        
        print(f"CPU: {cpu_percent}%, RAM: {memory.percent}%")
        for gpu in gpus:
            print(f"GPU {gpu.id}: {gpu.memoryUtil*100:.1f}% memory, {gpu.load*100:.1f}% usage")
        
        time.sleep(10)

if __name__ == "__main__":
    monitor_resources()

FAQ

Which DeepSeek model should I use?

  • DeepSeek-Coder: Best for programming and code generation tasks
  • DeepSeek-Math: Optimal for mathematical reasoning
  • DeepSeek-V3: Latest general-purpose model for diverse tasks

How much VRAM do I need?

  • 7B model: 16-24GB VRAM
  • 33B model: 64-80GB VRAM (2x A100)
  • 67B model: 128-160GB VRAM (4x A100)

Can I run DeepSeek on CPU only?

Yes, but it will be very slow. GPU acceleration is highly recommended for production use.

How do I fine-tune DeepSeek models?

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./deepseek-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=5e-5,
    warmup_steps=100,
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)

trainer.train()

Is DeepSeek compatible with OpenAI API?

Yes, DeepSeek can be deployed with OpenAI-compatible endpoints, making it easy to integrate with existing applications.

How do I scale DeepSeek horizontally?

Use a load balancer to distribute requests across multiple DeepSeek instances:

upstream deepseek_backend {
    server 127.0.0.1:8000;
    server 127.0.0.1:8001;
    server 127.0.0.1:8002;
}

server {
    listen 80;
    location / {
        proxy_pass http://deepseek_backend;
    }
}

Can I use DeepSeek for commercial purposes?

Check the specific model license. Many DeepSeek models have permissive licenses allowing commercial use.