SIGGRAPH 2026
TL;DR: We introduce
Prox-E, a framework
performing
fine-grained 3D shape editing
by abstracting 3D shapes into a primitive-based proxy shape,
editing that shape using a VLM and using the edited proxy shape to
drive a 3D diffusion model.
Text-based 2D image editing models have recently reached an impressive level of maturity, motivating a growing body of work that heavily depends on these models to drive 3D edits. While effective for appearance-based modifications, such 2D-centric 3D editing pipelines often struggle with fine-grained 3D editing, where localized structural changes must be applied while strictly preserving an object’s overall identity. To address this limitation, we propose Prox-E, a training-free framework that enables fine-grained 3D control through an explicit, primitive-based geometric abstraction. Our framework first abstracts an input 3D shape into a compact set of geometric primitives. A pretrained vision–language model (VLM) then edits this abstraction to specify primitive-level changes. These structural edits are subsequently used to guide a 3D generative model, enabling fine-grained, localized modifications while preserving unchanged regions of the original shape. Through extensive experiments, we demonstrate that our method consistently balances identity preservation, shape quality, and instruction fidelity more effectively than various existing approaches, including 2D-based 3D editors and training-based methods.
We present results on Edit3D-Bench, introduced in VoxHammer. Each column shows, in order: input, original proxy, edited proxy, and output. In the proxies, unchanged primitives are gray, edited primitives are blue, and new primitives are pink. Click and drag to orbit the meshes.
make the tail longer
make it wear a red hat
make the pot cubic
make it into a windmill
We present results on the ShapeTalk benchmark, from ShapeTalk / ChangeIt3D. Each column shows, in order: input, original proxy, edited proxy, and output. In the proxies, unchanged primitives are gray, edited primitives are blue, and new primitives are pink. Click and drag to orbit the meshes.
the legs are spaced further apart
the base has more layers
the shade is bigger
the top is more narrow
there is no stretcher connecting the front legs
the table has a sub-table underneath
Modern 2D editors are remarkably good at changing appearance or inserting new semantic content, but fine-grained structural edits pose a different challenge. These edits require reasoning about the geometry of existing parts and how they should change in 3D, often in explicitly metric terms, such as making the legs 1.5x longer or lowering the seat by a specific amount, rather than just synthesizing plausible pixels. This creates a gap between what image-based editors are optimized for and what controllable 3D editing actually demands.
Prox-E turns fine-grained text instructions into controlled 3D edits by first abstracting a shape into simple geometric primitives, editing that abstraction with a VLM, and then using it to guide 3D generation and appearance refinement.
π§± Primitive-based abstraction. Given an input 3D shape and an editing prompt, we first convert the shape into a compact proxy made of superquadrics. This proxy provides a simple, structured representation that is easier to edit than the raw 3D geometry itself.
π VLM-driven proxy editing. We render multi-view images of the proxy and provide them to a VLM together with an image of the original shape, the editing prompt, and the proxy in JSON format. The VLM then updates the primitive parameters directly, producing an edited proxy JSON that reflects the requested structural change.
π Iterative verification and correction. This editing process can be repeated: after obtaining an edited proxy, we render it again and feed the new views back to the VLM, allowing it to verify the edit or make further corrections when needed.
π§ Proxy-guided structure generation. Once the edited proxy is ready, we use it to guide TRELLIS during sparse-structure generation. To do this, we classify primitives into three categories and inject a different source of information for each one.
π‘ Unchanged regions preserve the original shape. Primitives that remain unchanged mark parts of the object that should stay intact. In these regions, we inject inverted latents from the original shape to preserve its structure and identity.
π£ New regions follow the edited proxy. Newly added primitives indicate regions where new structure should emerge. There, we inject the edited proxy, but only for a limited number of denoising steps, so the model can move beyond the coarse primitive geometry and generate a natural result.
π΅ Edited regions are warped from the original. For primitives whose pose or scale changed, we estimate the transformation from the original primitive to the edited one, warp the original shape accordingly, and inject this warped structure. This helps preserve fine geometric details while relocating them to their new position.
π¨ Appearance refinement for final details. After generating the edited sparse structure, we refine appearance and fine details using TRELLISβs appearance branch together with a strong 2D image editor. This lets us introduce high-quality appearance changes while still preserving the original object where needed.
@misc{sella2026proxe,
title = {Prox-E: Fine-Grained {3D} Shape Editing via
Primitive-Based Abstractions},
author = {Sella, Etai and Phung, Hao and Amiel, Nitay and
Litany, Or and Patashnik, Or and Averbuch-Elor, Hadar},
year = {2026},
note = {arXiv and venue fields to be added}
}
Acknowledgements will be updated for the camera-ready version.