Clustering is a technique for partitioning superscalar processor's execution resources to simultaneously allow for more in-flight instructions, wider issue width, and more aggressive clock speeds. As either the size of individual clusters or the total number of clusters increases, the distance to the first level data cache increases as well. Although clustering may expose more parallelism by allowing a greater number of instructions to be simultaneously analyzed and issued, the gains may be obliterated if the latencies to memory grow too large. We propose to augment each cluster with a small, fast, simple Level Zero (L0) data cache that is accessed in parallel with a traditional L1 data cache. The difference between our solution and other proposed caching techniques for clustered processors is that we do not support versioning or coherence. This may occasionally result in a load instruction that reads a stale value from the L0 cache, but the common case is a low latency hit in the L0 cache. Our simulation studies show that 4KB, 2-way set associative L0 caches provide a 6.5-12.3% IPC improvement over a wide range of processor configurations.