Inference API
Anura, Lilypad's official AI inference API
Getting Started
Use Anura to start running AI inference job modules on Lilypad's decentralized compute network:
Get an API key from the Anura website.
Find which models we support:
Choose a model and customize your request:
API Endpoints
API Clients
If you are using an API client such as Bruno or Postman, you can use our provided collections below.
Get Available Models
To see which models are available:
Chat Completions API
Streaming Chat Completions
POST /api/v1/chat/completions
This endpoint provides a streaming interface for chat completions using Server-Sent Events (SSE).
Request Headers
Content-Type: application/json
(required)Accept: text/event-stream
(recommended for streaming)Authorization: Bearer YOUR_API_KEY
Request Body
model
Required. ID of the model to use (e.g., "deepseek-r1:7b")
string
messages
Required. Array of message objects representing the conversation
array
stream
Receive responses as streaming JSON objects or a single JSON object.
boolean
options
Additional model parameters listed in the table below
object
Valid Options Parameters and Default Values
mirostat
Enable Mirostat sampling for controlling perplexity. (0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
0
mirostat_eta
Influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive.
0.1
mirostat_tau
Controls the balance between coherence and diversity of the output. A lower value will result in more focused and coherent text.
5
num_ctx
Sets the size of the context window used to generate the next token.
2048
repeat_last_n
Sets how far back for the model to look back to prevent repetition. (0 = disabled, -1 = num_ctx)
64
repeat_penalty
Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient.
1.1
temperature
The temperature of the model. Increasing the temperature will make the model answer more creatively.
0.8
seed
Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt.
0
stop
Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate stop parameters in a modelfile.
num_predict
Maximum number of tokens to predict when generating text. (-1 = infinite generation)
-1
top_k
Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative.
40
top_p
Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text.
0.9
min_p
Alternative to the top_p, and aims to ensure a balance of quality and variety. The parameter p represents the minimum probability for a token to be considered, relative to the probability of the most likely token. For example, with p=0.05 and the most likely token having a probability of 0.9, logits with a value less than 0.045 are filtered out.
0.0
Response Format
The response is a stream of Server-Sent Events (SSE) with the following format:
Processing updates:
Content delivery:
Completion marker:
Response Codes
200 OK
: Request successful, stream begins400 Bad Request
: Invalid request parameters401 Unauthorized
: Invalid or missing API key404 Not Found
: Requested model not found500 Internal Server Error
: Server error processing request
Response Object Fields
The delta event data contains the following fields:
model
The model used for generation
created_at
Timestamp when the response was created
message
Contains the assistant's response
message.role
Always "assistant" for responses
message.content
The generated text content
done_reason
Reason for completion (e.g., "stop", "length")
done
Boolean indicating if generation is complete
total_duration
Total processing time in nanoseconds
load_duration
Time taken to load the model in nanoseconds
prompt_eval_count
Number of tokens in the prompt
prompt_eval_duration
Time taken to process the prompt in nanoseconds
eval_count
Number of tokens generated
eval_duration
Time taken for generation in nanoseconds
Conversation Context
The API supports multi-turn conversations by including previous messages in the request:
This allows for contextual follow-up questions and maintaining conversation history.
Jobs
GET /api/v1/jobs/:id
- Get status and details of a specific job
Get Status/Details of a Job
You can use another terminal to check job status while the job is running.
Get Outputs from a Job
Once your job has run, you should get output like this:
Cowsay
POST /api/v1/cowsay
- Create a new cowsay jobRequest body:
{"message": "text to display"}
GET /api/v1/cowsay/:id/results
- Get results of a cowsay job
Last updated
Was this helpful?