Sandboxing LLM-generated code execution
Pigment provides a central AI platform to organizations for real-time business planning. Pigment AI is based on an agentic architecture that is described in more detail in this blog post.
One of our agents, the Analyst, already had multiple tools to perform simple calculations, such as contribution and variance analysis. In order to add more capabilities, we decided to leverage the code generation feature of LLMs rather than creating a dedicated tool for each capability.
LLM-generated code cannot be trusted by default. It is produced from user-controlled input, which means users may intentionally or accidentally steer the model toward unsafe behavior: reading sensitive data, calling internal services, exfiltrating data, pivoting into internal infrastructure, or exhausting compute resources. From a security perspective, the generated code has to be handled as an untrusted workload.
That requirement led us to build a sandboxed execution environment. In this article, we explain how we went from the initial risk analysis to a proof of concept, and eventually to a production-ready sandbox with support for large datasets.
Constraints
The code execution solution had to respect the following constraints:
- Execute Python code that has been generated in a previous step of an Agent execution.
- Allow to pass data to the code to execute
- The code is executed in a sandboxed environment.
- The solution must work in the local stack (Docker) as well as in staging and production (Kubernetes).
- The impact on performance should be minimal.
- We do not want to introduce a new provider, to avoid having to check it first for price, security, and compliance.
Proving the concept
At the start of the project, we were not sure that code execution was a viable solution: We needed to provide, with a minimal investment, the Product team with a proof of concept to allow them to check if this approach was relevant for our problems.
For the POC, we chose to rely on the llm-sandbox library, which supports the two backends we required, Docker and Kubernetes, helps manage Python code execution, including pip package installation, and provides built-in security features such as package restrictions and code vetting. The library also passed our security review for external dependencies.
The main limitation of llm-sandbox was the lack of support for asynchronous operations. This required encapsulating the calls to the library in asyncio.to_thread and handling the session life cycle manually, as in this example:
@asynccontextmanager
async def _get_sandbox_session():
# ... [snipped] create the session depending on the backend...
await asyncio.to_thread(session.open)
try:
yield session
finally:
await asyncio.to_thread(session.close)
Security for proof-of-concept
The experimentation was done on our staging platform, which provides a production-like environment with a deliberately limited blast radius. It is a completely isolated environment with no production data; its access is restricted to Pigment employees, and breaking it during a deployment or a test is an annoyance, not a disaster.
We relied on two security features of llm-sandbox to implement a security baseline for experimentation in the staging environment.
First, as the library performs the pip package installation in the Python environment in the sandbox, we were able to easily whitelist the packages we needed. In our case, we allowed numpy and pandas.
Second, llm-sandbox provides a regex-based code-vetting feature through security policies: It can generate some language-based regular expressions from some contexts (e.g. detecting module usage) or we can provide it with our own regular expressions to check (for instance, forbidding imports or access to built-ins). The llm-sandbox documentation proposes a handy set of policies to start with. A security policy looks like:
SecurityPattern(
pattern=r"(?m)^\s*(import\s+\w|from\s+\w+\s+import)",
description="Import statements are not allowed",
severity=SecurityIssueSeverity.HIGH,
)
The security policies were advisory: The code had to explicitly call the policy check; each policy could have a description that was returned when a check was not passed. This was quite convenient as we could easily return a message to the agent that called the tool, indicating that the code was unsafe and why. This allowed the LLM to generate new code that fit the restriction.
The specific case of imports is worth detailing. First, imports were prevented by the policies, but we needed to import pandas and numpy. To do so, we simply performed the security checks and we then added a prologue containing the imports. Second, the security checks required the sandbox session to be started, which meant that the container was first created, which took some time and used resources. We quickly noticed that, no matter how the LLM was prompted, it had a tendency to add imports to the generated code and we were wasting time going back and forth between the LLM and the sandbox, creating and deleting a container each time. Adding a simple regex check for imports before creating the containers avoided squandering all these resources.
That said, this code vetting should be treated as a defense-in-depth guardrail, not as the security boundary itself. The problem is inherently difficult: llm-sandbox security policies rely on regex/pattern-based pre-execution checks rather than AST-level analysis, so they cannot reason precisely about every code path, alias, dynamic import, or indirect access pattern. In practice, these rules require continuous hardening as new bypass patterns are identified. For example, allowing a broad library such as pandas may still make lower-level standard-library capabilities reachable indirectly through its internal modules, such as pandas.io.common, which imports modules including os. This may be acceptable for a POC running in a restricted staging environment, but it is far from sufficient as a standalone production security boundary.
Inputs and outputs
Once we had a sandbox with basic security in place, we needed to pass data to this sandbox.
First, we provided a tool to the agent that downloaded the requested data and stored it into a pandas dataframe. This dataframe lived in the execution state of the agent graph and was therefore limited to a few thousand rows of data, which was enough for the POC.
Second, once the sandbox session was created, we were saving the pandas dataframe to a CSV that was then pushed to the sandbox container, thanks to a helper provided by llm-sandbox.
Third, we updated the code prologue (that we added for imports) to load the CSV data into a pandas dataframe. The LLM prompt that was called to generate the code included the name of the variable that contained the pandas dataframe. Later, we added support for multiple data sources (multiple dataframes, multiple CSV files).
Last, the LLM prompt also requested to add a print of the results in the code and to not print anything else. The output of the execution was simply what was displayed in stdout. Here is an example of a generated code (note the lack of imports at the start):
df = df1.copy()
# Top 3 highest population values across all Country-Year combinations
top3 = df.sort_values('Population', ascending=False).head(3).reset_index(drop=True)
total_population = df['Population'].sum()
top3_total = top3['Population'].sum()
share_top3 = top3_total / total_population * 100
print('Top 3 population values (Country-Year):')
print(top3)
print('\nTotal Population (all rows):', total_population)
print('Sum of Top 3:', top3_total)
print('Top 3 share of total:', share_top3)
Production-ready sandbox
The POC was a real success, and it even proved that the upcoming agent release would not be viable without code execution. At that moment, we only had a few weeks to go from a POC to a production-ready solution. The two main missing points were adding in-depth security and finding solutions to handle the expected heavy load.
We considered replacing llm-sandbox with other solutions or approaches (OS-based virtualization, external hosting including sandboxes from LLM providers...). None was giving us the level of control we required.
In-depth security
LLM-generated code could not be treated as trusted, especially because it was derived from user-provided prompts. The regex-based code vetting provided by llm-sandbox was useful as an early rejection mechanism for obviously unsafe or unsupported code, but it was not a complete security boundary. The security boundary must instead be enforced at the infrastructure layer, through runtime isolation, network restrictions, constrained permissions, and dedicated execution environments.
We put some strict restrictions on the workloads that were running in production. Among them:
- Runs on dedicated and isolated workers.
- Kubernetes logical isolation (namespace).
- Data isolation between executions: This ensured that we did not risk leaking data between executions.
- No communication between execution workloads.
- No network access, be it to the internal network or to the internet.
- Resistance to local privilege escalation through a restricted pod security standard.
- Enforcing resource limits.
- No write access to the root file system.
- No credentials shall be mounted.
- Credential-less, permission-less workloads.
- Minimal, distroless image.
- Resistance to kernel exploits using gVisor.
Some of these constraints were added to a pod manifest for the pod that was created from the image. It set limits on CPU and memory, set the root file system as read-only and provided specifically mounted read-write temporary volumes (to store the code and data and have a temp directory). These volumes were RAM disks that disappeared at the end of the execution.
Data isolation between executions was enforced by the lack of communication between running sandboxes but also by terminating the sandbox pod at the end of the execution: A given pod can only be used for a single execution.
Rather than letting llm-sandbox create its own image, we passed our own image (something llm-sandbox directly supports) that implements some of the security constraints mentioned above.
As mentioned, there was a tight deadline, so we sorted the security requirements, balancing the security risk they cover against the implementation costs. This allowed us to progress incrementally and to make sure we would be on time for the general availability of the feature with as many of the most important security features as possible: We chose to cover more straightforward security risks first and to add resistance to the most complex exploits last, as it's a smaller attack surface.
Pre-warming sandbox pods for low-latency LLM agent execution
In the POC, when an LLM agent ran tool calls, we scheduled a fresh isolated pod per execution. This has two impacts on latency: cluster autoscaling (30s+ to provision a node) and image pull on cold nodes (8-12s).
To eliminate this latency for the production release, we set up a placeholder deployment that kept multiple dummy pods permanently running. They looked identical to a real sandbox (same image, same resource reservations). The trick is that each placeholder had a lower scheduling priority than real sandboxes. When a real sandbox had to be started, Kubernetes evicted the low priority dummy pod to make space, which took a few milliseconds. The deployment immediately recreated the evicted placeholder: This was the slow operation, but now performed in the background. The main caveat of this approach is the cost: All the placeholders were actual running pods that are billed by the hosting provider.
The placeholders also kept the sandbox image cached on every node they touched, but a brand-new node had no placeholder yet. An image-warmer DaemonSet closed that gap: an init container pulled the image, a second init container labeled the node with the image version so sandbox pods could require a warm cache via node selector, and a sleep infinity container held the reference so the garbage collector didn't evict it.
With this approach, starting a new sandbox took 1 to 2 seconds on average. This approach only worked if we had enough placeholders in place: We started with a best guess based on the number of expected users and took a wide margin. Setting up monitoring and alerting was key (time to schedule a new sandbox, number of free slots available, etc.), allowing us to tailor the number of placeholder pods to the actual usage.
Supporting large datasets
We actually shipped the first version without support for large datasets: The code execution only had access to a few thousand rows. Surprisingly enough, this covered the majority of use cases, but some of our clients had hundreds of thousands of rows to manipulate, so we decided to add support for them.
The agentic tool (written in Python) was calling another service (written in C#) to fetch the data. We changed the C# service to store the data in Google Cloud Storage instead, and to return an ID to retrieve the data, along with the row count, the column names, and the column types.
We had multiple options to make the data available to the sandbox. Among them:
- Directly fetch data from GCS into the sandbox: This would break the sandbox isolation, giving it access to the internet to reach GCS and, furthermore, giving it access keys to our GCS environments.
- Download the data at the code execution tool level, then push it to the sandbox: This would download the data to upload it again, which seemed inefficient.
- Deploy a small in-cluster service that acts as a proxy between sandbox pods and GCS. The sandbox sends a signed URL to the proxy, which validates that the URL points to an allowed GCS bucket, fetches the data, and streams it back: This would be the most secure approach, clearly putting the sensitive operations into the dedicated proxy, unreachable to the LLM-generated code.
Downloading then pushing data was the simplest approach, which meant it was at the same time the least expensive and the most secure: The service executing the tools already had GCS credentials, and pushing data to the sandbox was already in place, so we were not adding vulnerabilities. The main concern was around the performance of this operation.
We implemented this approach and performed various measurements.
First, we checked the cost of downloading then uploading data to the sandbox:
- We used a dataset with two columns representing a single dimension metric.
- A dataset with 990K rows is downloaded then uploaded in less than a second.
- The download/upload to the sandbox does not vary much with data size: going from a metric with 500K elements to the one with 990K elements adds 0.02 seconds.
Then, we measured the impact of the approach on a small dataset (1000 rows):
- The sandbox execution takes about 0.55s longer with the download/upload step.
- The overall analyst execution takes 0.5s more out of around 25s total, making the download/upload the sole contributor to the slowdown.
This performance was perfectly acceptable for a production environment, which proved that this simplest solution was enough.
Conclusion
Going from an idea to a proof of concept to a production-ready sandbox in a few weeks was a demanding challenge, but the incremental approach (prioritizing security risks by impact, shipping the simplest solution that met our constraints, and deferring complexity like large dataset support) proved to be the right strategy. The result is a low-latency, fully isolated code execution environment that has meaningfully expanded the capabilities of the Pigment Analyst Agent.


