Setting up inference service

Setting the node infrastructure for Atoma.

In order to spawn the AI inference service to help supporting Atoma's compute requirements, we suggest that nodes rely primarily on Atoma's own AI inference stack, highly optimized to run on decentralized settings. One big advantage of the Atoma's inference stack is the fact that is completely written in Rust and Cuda, without relying on legacy libraries such as PyTorch, making it an ideal choice to run serverless.

That said, we support other inference frameworks (such as vLLM, etc), as long as these provide an OpenAI compatible API.

Install Rust

To install Rust, the reader can follow the instructions on the official website.

Start the environment

The first step is to clone the repository and initialize the respective git submodules dependencies:

$ git clone https://github.com/atoma-network/atoma-infer.git 
$ cd atoma-infer
$ git submodule update --init --recursive

Configuration and .env files

In order for the AI inference service to properly run, it is necessary to specify a few configuration values. These include which model to run inference on, a HuggingFace API key for model downloading, and some CPU and GPU resource utilization values, etc.

The configuration file is generically defined as follows:

[inference]
api_key = "YOUR_HUGGINGFACE_API_KEY" # Your HuggingFace API key
cache_dir = "CACHE_DIRECTORY_FOR_STORING_MODEL_WEIGHTS" # Directory to store the model weights
flush_storage = true # Whether to flush the storage after the model has been loaded, or not
model_name = "HUGGING_FACE_MODEL_ID" # HuggingFace model ID, e.g. "meta-llama/Llama-3.1-405B-Instruct"
device_ids = [0] # List of GPU IDs to use, if you have multiple GPUs, please provide the list of IDs available (e.g. [0, 1] for two available GPU devices, [0, 1, 2, 4], for four available GPU devices, etc)
dtype = "bf16" # Data type to use for inference (this value is model dependent and can be found in the model's config.json file)
num_tokenizer_workers = 4 # Number of workers to use for tokenizing incoming inference requests, leveraging Round-Robin scheduling
revision = "main" # Revision of the model to use, e.g. "main" or "refs/tags/v1.0"

[cache]
block_size = 16 # Block size to use for the vLLM cache memory management (recommended values are 16, 32, 64).
cache_dtype = "bf16" # Most often, it agrees with inference.dtype above
gpu_memory_utilization = 0.7 # Fraction of the GPU memory to use for storing the KV cache, (ideally between 0.7 and 0.9).
swap_space_fraction = 0.1 # Fraction of the GPU memory to use for storing the KV cache during inference, (ideally not more than 0.4)

[scheduler]
max_num_batched_tokens = 1048576 # Maximum number of total batched tokens to use for the vLLM scheduler, this value depends on the node GPU capacity and the model maximum sequence length (from the training process)
max_num_sequences = 128 # Maximum number of batched sequences that the vLLM scheduler can handle (this value depends on the node GPU capacity)
max_model_len = 8192 # Maximum length of a model sequence can have to use for the vLLM scheduler
delay_factor = 0.0 # Delay factor to use for the vLLM scheduler
enable_chunked_prefill = true # Whether to use the chunked prefill feature for the vLLM scheduler
block_size = 16 # Block size to use for the vLLM cache memory management

[validation]
best_of = 1 # Best of n value to use for the vLLM scheduler
max_stop_sequences = 1 # Maximum number of stop sequences to use for the vLLM scheduler
max_top_n_tokens = 1 # Maximum number of top n tokens to use for the vLLM scheduler
max_input_length = 4096 # Maximum input length to use for the vLLM scheduler  
max_total_tokens = 8192 # Maximum total tokens to use for the vLLM scheduler

In order to properly serve requests leveraging OpenAI's API, it is necessary to set up an .env file, in the root folder, as follows:

# Deployment mode
DEPLOYMENT_MODE="production"

# Server
SERVER_ADDRESS="0.0.0.0"
SERVER_PORT="8080" // Available port for the API

Once this file is properly setup, the inference service can be spawned by running the following commands:

  1. For single GPU serving:

$ RUST_LOG=info cargo run --release --features vllm -- --config-path PATH_TO_CONFIG
  1. For multiple GPU serving:

$ RUST_LOG=info cargo run --release --features nccl -- --config-path PATH_TO_CONFIG

Last updated