How To Cut LLM API Costs by 50% with One Scalable Workflow 

Introduction & Motivation

 As large language models (LLMs) become more widely used for tasks like evaluation, content generation, and bulk inference, the need for an efficient and scalable interaction with these models has grown. While synchronous API calls, sending one request at a time, work well for small tasks or quick testing, they become inefficient and costly at scale. This is where asynchronous, batch APIs come in, allowing multiple requests to be processed all at once at nearly half the cost. 

However, while major providers like OpenAI, Claude (Anthropic), and Azure OpenAI support batch processing, each of them implements it differently, for example OpenAI and Azure use file-based formats; Claude uses an object-based format. Their request formats, submission methods, status tracking, and result handling vary enough that you often have to write and maintain separate logic for each one. 

This blog presents a solution to that problem: a unified and scalable batch inference pipeline that works seamlessly across different providers, including OpenAI, Claude (Anthropic), and Azure OpenAI. The code is available here: https://github.com/al-reemks/unified-batch-api-pipeline. The pipeline handles provider-specific logic internally, enabling users to interact with a single, unified interface, regardless of the provider. This not only reduces the complexity of managing separate workflows, but also improves consistency and saves time when working with high-volume LLM use cases. Most importantly, integrating batch APIs into the pipeline reduces costs by up to 50%. 

Understanding Batch APIs

Batch APIs allow you to submit many requests all at once and process them asynchronously in the background. Instead of waiting for each individual call to complete, you prepare your inputs, submit them in bulk, and retrieve the results when they’re ready. This approach cuts costs, reduces processing time, and eliminates the constant back-and-forth with the API. Batch processing is especially useful when: 

● Running evaluations, analysis, or content generations on large datasets 

● Working on tasks where immediate responses aren’t required 

● Automatting LLM workflows and optimizing cost efficiency at scale 

Utilizing batch APIs also has several advantages: 

● Asynchronous handling: frees up computer resources 

● Parallelism: processes large volumes faster than synchronous, sequential calls 

● Bulk submission: sends many prompts in one request 

● Cost reduction: can cut API costs often by 50% compared to synchronous calls 

● Scalability: well-suited for high-volume or automated pipelines 

Despite their advantages, batch APIs have some limitations. Not all LLM models support batch processing, and there is often a delay before results are ready. It would take a few minutes for small jobs, but up to 24 hours for larger batches. For example, sending 200 requests could take around 10-20 minutes to complete, while larger batches could take longer. Batch APIs are also not suitable for real-time and interactive use cases. 

Building a Unified Batch Inference API Pipeline

While the idea of batch processing is straightforward, the way it’s implemented varies significantly across providers. OpenAI and Azure use file-based submission formats (requiring .jsonl input files), while Claude (Anthropic) expects direct request objects. They also differ in how they track job status and return results. Because of these inconsistencies, developers often end up writing and maintaining separate logic for each provider, even when performing the same type of task. 

That’s where our unified batch pipeline comes in: it handles provider-specific differences internally from batch submission to results retrieval, so you only need to interact with one consistent interface, regardless of the backend (i.e. provider) you use. 

Initial Development Phase

In the early stages of development, the pipeline was structured around separate classes for each API provider. While this made the differences between providers explicit, the code quickly became repetitive and difficult to maintain, especially since the overall logic between the providers is the same except for their specific methods. 

To simplify this, we decided to restructure the system around a single batch API handler class, BatchAPIHandler, and use internal (private) methods to isolate provider-specific logic and command-line arguments to configure the system. This allowed all public, user-level interactions to remain consistent, while internal details, such as how each provider handles batch submission, polling, or result retrieval are handled behind the scenes. As a result, the user now only needs to interact with one main entry point and doesn’t have to worry about which provider is being used and the low-level implementation. 

Pipeline Overview

The batch pipeline follows the following structure: 

→ user prompt + system prompt 

→ provider-specific batch request conversion 

→ batch submission 

→ polling to check batch status 

→ results download 

Each of these steps are modularized through internal helper functions to avoid repetitive code and to support future extension and modifications. 

Configuration Using Argument Parsers

The system is designed to be configured entirely via command-line arguments. Below are the arguments supported by the pipeline.

ArgumentDescription
–user_input_filerequired=True, help=”Path to user input file containing user questions. Supported extensions: jsonl, txt.”
–batch_api_providerdefault=”openai”, choices=[“openai”, “claude”, “azure”], help=”Which batch api provider to use.”
–api_keydefault=None, help=”If API key is not provided, it’ll read from the environment variable.”
–model_namedefault=”gpt-3.5-turbo-0125″,help=”Name of the model to use. Should match with the provider.”
–system_prompttype=str, default=””, help=”System prompt to pass to the model. If not provided, it defaults to an empty string.”
–user_input_keyrequired=False, help=”Key to select the user prompt field from the jsonl file.”
–temperaturetype=float, default=0.3, help=”Temperature to be used for the model.”
–max_tokenstype=int, help=”Maximum number of tokens.”
–waitaction=”store_true”, help=”Wait for the batch to finish and download results. If omitted, you can check later using an external script.”
–poll_waiting_intervaltype=int, default=60, help=”Time (in seconds) to wait between each polling attempt.”
–max_attemptstype=int, default=None, help=”Maximum number of polling attempts before exiting. If not set, polling continues until the batch finishes.”
–override_converted_batch_inputaction=”store_true”, help=”If set, will overwrite existing converted batch input files. If not set and file exists, the script will raise an error.”
–converted_batch_input_dirdefault=”batch_api_merged_general/converted_batch_input”, help=”Directory to save converted batch input files.”
–batch_results_dirdefault=”batch_api_merged_general/batch_results”,
help=”Directory to save batch results.”
–batch_info_dirdefault=”batch_api_merged_general/batch_info”, help=”Directory to save submitted batch info.”
–converted_txt_to_jsonl_batch_input_dirdefault=”batch_api_merged_general/converted_txt_to_jsonl_batch_input_dir”, help=”Directory to saved converted txt user batch input to jsonl.”

API Provider Support

The pipeline supports batch APIs from OpenAI, Claude (Anthropic), and Azure OpenAI. Users can switch between providers and models using the following command-line arguments: –batch_api_provider and –model_name, without needing to adjust the code.

Installation Requirements

To run the pipeline, make sure you have Python 3.8 or higher and install the required dependencies: 

● pip install openai anthropic requests python-dotenv 

Then, set up a .env file to store your API keys: 

● OPENAI_API_KEY=your_openai_key 

● AZURE_OPENAI_API_KEY=your_azure_key 

● AZURE_OPENAI_API_VERSION=your_azure_version 

● AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com 

● ANTHROPIC_API_KEY=your_claude_key 

● ANTHROPIC_API_VERSION=your_claude_version 

Alternatively, you can pass your API key using the –api-key CLI argument. If not provided, the system will automatically read from the respective environment variable. 

Pipeline Steps

Step 1: Input Preparation

The first step in the pipeline is preparing the input files that will be sent to the model. This involves structuring your user prompts and optionally a system prompt in a format that can be dynamically templated and used across different providers. 

The pipeline supports two types of user input files that are passed using the –user_input_file command-line argument: 

● A .jsonl file where each line is a JSON object containing a user prompt. If there are multiple fields in each object, the user can specify the key of the user prompt using the argument –user_input_key. Otherwise, the system will use the first value in the dictionary by default. 

● A .txt file where each line is treated as a separate user prompt. The file will be automatically converted to a .jsonl internally using the convert_txt_to_jsonl() function. This conversion is required because OpenAI and Azure OpenAI expect batch input in .jsonl format. The converted .txt file will be saved to the following directory: converted_txt_to_jsonl_batch_input_dir

Regardless of the input file type, all prompts are converted into a unified message format expected by most LLM APIs. This format includes: 

● A system message (containing the system prompt) which defaults to an empty string or could be passed directly via the –system_prompt argument 

● A user message (containing the user prompt) 

The final result is a templated input that looks like the following:

"templated_input": [{"role": "system", "content": "system prompt..."},{"role": "user", "content": "user prompt..."}]

Step 2: Request Conversion

Once the inputs are templated, the next step is to convert them into the appropriate request format required by the selected provider. This is handled internally by the convert_prompt_requests() method in theThe BatchAPIHandler class. Users don’t need 

to write provider-specific formatting logic; they simply can call convert_prompt_requests() method. 

For OpenAI and Azure, the method generates .jsonl requests with the expected endpoint and structure. For Claude (Anthropic), it generates request objects directly. Note, the number of maximum tokens and the temperature can be specified using the CLI arguments: –max_tokens and –temperature

batch_user_questions = handler.convert_prompt_requests(user_questions, args.model_name, args.max_tokens, args.temperature)

The converted requests are then saved into a .jsonl file in the converted_batch_input_dir, which is then used to create and submit the batch to the API.

Step 3: Batch Submission

Once the requests are prepared, you can submit the batch using a single method call.

batch_id = handler.submit_batch(batch _input=batch_requests_or_converted_file_path,args.batch_info_dir,args_dict)

This function handles all the provider-specific submission logic internally. If you’re using OpenAI or Azure, the method uploads the batch input file, generates an input file ID, and submits the batch. For Claude (Anthropic), which expects request objects instead, the method sends the list of request objects directly. 

The same function also saves metadata, such as the batch info, batch ID and the arguments used, to the specified directory, so you can easily track and retrieve results later. No matter which provider you’re using, submit_batch() keeps the interaction consistent. You don’t have to write any separate logic or handle special cases. 

Step 4: Polling and Retrieving Results

Once the batch is submitted, you can either wait for it to finish automatically by adding the –wait flag or check its progress later using the standalone external script. 

If the –wait flag is specified, the system uses the poll_batch_status() method to periodically check the status of the batch job. 

The pipeline enables you to track whether the batch has finished processing, how many requests were completed, and how many failed. It also supports configurable polling interval –poll_waiting_interval and maximum number of attempts –max_attempts to avoid waiting indefinitely.

For example:

Status: in_progress | Completed: 420/500 | Failed: 3

The polling logic is handled internally depending on the provider. Once the batch reaches a final state (e.g., ended or completed), the pipeline proceeds to download the results automatically using the download_results() method. Each provider makes results available in a slightly different way, so the pipeline handles this internally and saves the results to the –batch_results_dir directory.

if args.wait:batch = handler.poll_batch_status(batch_id, args.poll_waiting_interval, args.max_attempts)if batch:if (args.batch_api_provider in ["openai", "azure"] and batch.status == "completed") or \(args.batch_api_provider == "claude" and getattr(batch, "processing_status", None) == "ended"):result_filename = f"results_{batched_file_name}_[{batch_id}].jsonl"handler.download_results(batch, args.batch_results_dir, result_filename)else:logger.error("Error downloading results. Please check manually.")else:logger.info(f"{args.batch_api_provider} batch [{batch.id}] submitted successfully. Waiting for results is disabled. Please check manually.")

If the –wait flag is omitted, monitoring will be disabled and you can check the status of the submitted batch and download the results at a later time externally using the batch ID. This is especially useful if: 

● You submitted a large batch and you don’t want to wait 

● You want to track the batch progress later without re-running the entire pipeline 

The external script supports: 

● Inferring the provider based on the batch ID (e.g., msgbatch_* is Claude) 

● Polling for completion if –wait is passed 

● Downloading results into a .jsonl file upon completion and saving them to –batch_results_dir 

To use the script, it’s important to pass the batch ID using the –batch_id argument to track the status of the submitted batch. The script can infer the batch provider based on the submitted ID, but in the case of OpenAI and Azure, you should specify the provider using the –batch_api_provider argument, since both providers have the same batch ID format. Otherwise, the script will raise an error. It’s also required to pass the –user_input_file name, so that the results could be saved appropriately allowing easier tracking of the files. 

The script can be configured via the following command-line arguments: 

ArgumentDescription
–batch_idrequired=True, help=”ID of the submitted batch to check”
–user_input_filerequired=True, help=”Name of user input file (used for result filename)”
–batch_api_providerrequired=False, choices=[“openai”, “claude”, “azure”], help=”Which batch api provider to use”
–waitaction=”store_true”, help=”Wait for batch to finish.”
–poll_waiting_intervaltype=int, default=60, help=”Time (in sec) to wait between each polling attempt.”
–max_attemptstype=int, default=None, help=”Maximum number of polling attempts before exiting. If not set, polling continues until the batch finishes.”
–batch_results_dirdefault=”batch_api/batch_results”, help=”Where to save results if available”

Conclusion

Managing large-scale LLM workloads can be time-consuming, costly, and inconsistent, especially when working across different API providers with varying formats and behaviors. This unified batch inference pipeline tackles that challenge by providing a single, unified system that abstracts away the differences between OpenAI, Azure OpenAI, and Claude. 

Instead of writing separate code for each provider, you can prepare your inputs once, run your jobs in bulk, and retrieve results, all while benefiting from cost savings, automation, and consistency. The pipeline is fully configurable through command-line arguments, making it easy to adapt to your needs without modifying the core logic. 

Whether you’re evaluating prompts, generating data, or running large experiments, this system helps streamline your process and scale efficiently without the overhead of managing provider-specific details.

Spam-free subscription, we guarantee. This is just a friendly ping when new content is out.

Go back

Your message has been sent

Warning
Warning
Warning.