Prox-E

Fine-Grained 3D Shape Editing via Primitive-Based Abstractions

1Tel Aviv University, 2Cornell University, 3Technion - Israel Institute of Technology

SIGGRAPH 2026


TL;DR: We introduce Prox-E, a framework performing fine-grained 3D shape editing by abstracting 3D shapes into a primitive-based proxy shape, editing that shape using a VLM and using the edited proxy shape to drive a 3D diffusion model.


Abstract

Text-based 2D image editing models have recently reached an impressive level of maturity, motivating a growing body of work that heavily depends on these models to drive 3D edits. While effective for appearance-based modifications, such 2D-centric 3D editing pipelines often struggle with fine-grained 3D editing, where localized structural changes must be applied while strictly preserving an object’s overall identity. To address this limitation, we propose Prox-E, a training-free framework that enables fine-grained 3D control through an explicit, primitive-based geometric abstraction. Our framework first abstracts an input 3D shape into a compact set of geometric primitives. A pretrained vision–language model (VLM) then edits this abstraction to specify primitive-level changes. These structural edits are subsequently used to guide a 3D generative model, enabling fine-grained, localized modifications while preserving unchanged regions of the original shape. Through extensive experiments, we demonstrate that our method consistently balances identity preservation, shape quality, and instruction fidelity more effectively than various existing approaches, including 2D-based 3D editors and training-based methods.


Results on Edit3D-bench

We present results on Edit3D-Bench, introduced in VoxHammer. Each column shows, in order: input, original proxy, edited proxy, and output. In the proxies, unchanged primitives are gray, edited primitives are blue, and new primitives are pink. Click and drag to orbit the meshes.

make the tail longer

Input
Original Proxy
Edited Proxy
Output

make it wear a red hat

Input
Original Proxy
Edited Proxy
Output

make the pot cubic

Input
Original Proxy
Edited Proxy
Output

make it into a windmill

Input
Original Proxy
Edited Proxy
Output

Results on ShapeTalk

We present results on the ShapeTalk benchmark, from ShapeTalk / ChangeIt3D. Each column shows, in order: input, original proxy, edited proxy, and output. In the proxies, unchanged primitives are gray, edited primitives are blue, and new primitives are pink. Click and drag to orbit the meshes.

the legs are spaced further apart

Input
Original Proxy
Edited Proxy
Output

the base has more layers

Input
Original Proxy
Edited Proxy
Output

the shade is bigger

Input
Original Proxy
Edited Proxy
Output

the top is more narrow

Input
Original Proxy
Edited Proxy
Output

there is no stretcher connecting the front legs

Input
Original Proxy
Edited Proxy
Output

the table has a sub-table underneath

Input
Original Proxy
Edited Proxy
Output

Motivation


2D Editor failures, motivation

Modern 2D editors are remarkably good at changing appearance or inserting new semantic content, but fine-grained structural edits pose a different challenge. These edits require reasoning about the geometry of existing parts and how they should change in 3D, often in explicitly metric terms, such as making the legs 1.5x longer or lowering the seat by a specific amount, rather than just synthesizing plausible pixels. This creates a gap between what image-based editors are optimized for and what controllable 3D editing actually demands.


How does it work?


Prox-E turns fine-grained text instructions into controlled 3D edits by first abstracting a shape into simple geometric primitives, editing that abstraction with a VLM, and then using it to guide 3D generation and appearance refinement.


🧱 Primitive-based abstraction. Given an input 3D shape and an editing prompt, we first convert the shape into a compact proxy made of superquadrics. This proxy provides a simple, structured representation that is easier to edit than the raw 3D geometry itself.


πŸ‘€ VLM-driven proxy editing. We render multi-view images of the proxy and provide them to a VLM together with an image of the original shape, the editing prompt, and the proxy in JSON format. The VLM then updates the primitive parameters directly, producing an edited proxy JSON that reflects the requested structural change.


πŸ” Iterative verification and correction. This editing process can be repeated: after obtaining an edited proxy, we render it again and feed the new views back to the VLM, allowing it to verify the edit or make further corrections when needed.


Placeholder for VLM proxy editing

🧭 Proxy-guided structure generation. Once the edited proxy is ready, we use it to guide TRELLIS during sparse-structure generation. To do this, we classify primitives into three categories and inject a different source of information for each one.


🟑 Unchanged regions preserve the original shape. Primitives that remain unchanged mark parts of the object that should stay intact. In these regions, we inject inverted latents from the original shape to preserve its structure and identity.


🟣 New regions follow the edited proxy. Newly added primitives indicate regions where new structure should emerge. There, we inject the edited proxy, but only for a limited number of denoising steps, so the model can move beyond the coarse primitive geometry and generate a natural result.


πŸ”΅ Edited regions are warped from the original. For primitives whose pose or scale changed, we estimate the transformation from the original primitive to the edited one, warp the original shape accordingly, and inject this warped structure. This helps preserve fine geometric details while relocating them to their new position.


Placeholder for proxy-guided structure generation

🎨 Appearance refinement for final details. After generating the edited sparse structure, we refine appearance and fine details using TRELLIS’s appearance branch together with a strong 2D image editor. This lets us introduce high-quality appearance changes while still preserving the original object where needed.


BibTeX

@misc{sella2026proxe,
  title = {Prox-E: Fine-Grained {3D} Shape Editing via Primitive-Based Abstractions},
  author = {Sella, Etai and Phung, Hao and Amiel, Nitay and Litany, Or and Patashnik, Or and Averbuch-Elor, Hadar},
  year = {2026},
  note = {arXiv and venue fields to be added}
}

Acknowledgements

Acknowledgements will be updated for the camera-ready version.