Configure environment variables by creating .env file:
Start by running
cp.env.example.env
You should then see a file of the form:
# Hugging Face ConfigurationHF_CACHE_PATH=~/.cache/huggingfaceHF_TOKEN=# Required for gated models# Inference Server ConfigurationINFERENCE_SERVER_PORT=50000# External port for vLLM serviceMODEL=meta-llama/Llama-3.1-70B-InstructMAX_MODEL_LEN=4096# Context lengthGPU_COUNT=1# Number of GPUs to useTENSOR_PARALLEL_SIZE=1# Should be equal to GPU_COUNT# Sui ConfigurationSUI_CONFIG_PATH=~/.sui/sui_config# Atoma Node Service ConfigurationATOMA_SERVICE_PORT=3000# External port for Atoma service
You need to fill the HF_TOKEN variable with your HuggingFace api key. See the official [website](https://huggingface.co/docs/hub/security-tokens) for more information on how to set an HF api key.
Configure config.toml, using config.example.toml as template, by running:
cpconfig.example.tomlconfig.toml
You should now have a config.toml file with the following contents
[atoma-service]inference_service_url ="http://vllm:8000"# Internal Docker network URL for inference serviceembeddings_service_url =""multimodal_service_url =""models = ["meta-llama/Llama-3.1-70B-Instruct"] # Replace it with the list of models you want to deployrevisions = [""]service_bind_address ="0.0.0.0:3000"# Bind to all interfaces[atoma-sui]http_rpc_node_addr =""atoma_db =""atoma_package_id =""toma_package_id =""request_timeout = { secs =300, nanos =0 }max_concurrent_requests =10limit =100node_small_ids = [0,1,2] # List of node IDs under controltask_small_ids = [] # List of task IDs under controlsui_config_path ="~/.sui/sui_config/client.yaml"sui_keystore_path ="~/.sui/sui_config/sui.keystore"[atoma-state]database_url ="sqlite:///app/data/atoma.db"
You can set multiple services for your node such as inference, embeddings and multi-modal, by setting the public url.
Create required directories
mkdir-pdatalogs
Start the containers
If you plan to run a chat completions service:
# Build and start all servicesCOMPOSE_PROFILES=chat_completions_vllmdockercomposeup--build# Or run in detached modeCOMPOSE_PROFILES=chat_completions_vllmdockercomposeup-d--build
For text embeddings:
# Build and start all servicesCOMPOSE_PROFILES=embeddings_teidockercomposeup--build# Or run in detached modeCOMPOSE_PROFILES=embeddings_teidockercomposeup-d--build
For image generation:
# Build and start all servicesCOMPOSE_PROFILES=image_generations_mistraldockercomposeup--build# Or run in detached modeCOMPOSE_PROFILES=image_generations_mistraldockercomposeup-d--build
It is possible to run any combination of the above, if a node has enough GPU compute available. For example to run all services simultaneously, simply run:
# Build and start all servicesCOMPOSE_PROFILES=chat_completions_vllm,embeddings_tei,image_generations_mistraldockercomposeup--build# Or run in detached modeCOMPOSE_PROFILES=chat_completions_vllm,embeddings_tei,image_generations_mistraldockercomposeup-d--build
Container Architecture
The deployment consists of two main services:
vLLM Service: Handles the AI model inference
Atoma Node: Manages the node operations and connects to the Atoma Network
Service URLs
vLLM Service: http://localhost:50000 (configured via INFERENCE_SERVER_PORT)
Atoma Node: http://localhost:3000 (configured via ATOMA_SERVICE_PORT)