-
Pre-allocate Slices with Known Capacity When the eventual size of a slice is known, pre-allocating with
make([]T, 0, capacity)creates the underlying array a single time. This critical practice avoids multiple, inefficient reallocations and the expensive process of copying all existing elements to a new, larger array as youappenddata. -
Use the
arenaPackage for Short-Lived Objects New in recent Go versions, thearenapackage provides a safe way to allocate memory that can be freed all at once. This is perfect for functions that create many temporary objects (like during a single request), as it can nearly eliminate GC pressure from that workload. -
Minimize Escape Analysis Failures Escape analysis is the compiler's process to determine if a variable can be allocated on the fast stack. Returning pointers to local variables or storing them in interfaces forces them to "escape" to the slower heap, increasing GC workload. Write code that allows variables to remain on the stack.
-
Use Value Receivers for Small Structs For structs smaller than a CPU word size (e.g., 64 bytes), passing a copy (a value receiver) is often faster than the indirection of a pointer. The value can be passed directly in CPU registers, which is much faster than fetching it from main memory via a pointer reference.
-
Align Struct Fields by Size Ordering struct fields from largest to smallest (e.g.,
int64, thenint32, thenbool) lets the compiler minimize memory waste from padding bytes. This creates a more compact memory layout, which improves data locality and CPU cache performance, especially for large arrays of structs. -
Avoid Unnecessary Pointer Indirection Accessing data directly from memory is always faster than following a pointer to it. Each level of indirection creates a dependency and risks a CPU cache miss, which stalls execution. Flattening nested structures or using values directly can significantly speed up data access.
-
Use Arrays for Fixed-Size Collections An array is a value type that has no header overhead, unlike a slice. If its size is known at compile time, it can be allocated directly on the stack, which is faster and avoids putting pressure on the garbage collector. This makes them ideal for small, fixed-size collections.
-
Minimize Interface Conversions Assigning a concrete type to an interface requires a heap allocation to store its dynamic type and value information. In performance-critical code paths, doing this repeatedly can create significant GC pressure. Using concrete types directly avoids this allocation overhead entirely.
-
Prefer
map[string]struct{}for Sets To implement a set data structure, using an empty struct (struct{}) as the map's value type is the most memory-efficient approach. The empty struct consumes zero bytes of memory, making it far superior to using aboolor another placeholder type that would waste space. -
Use the
mapsStandard Library Package Since Go 1.21, themapspackage provides optimized, generic functions for common map operations likeClone,Copy, andDeleteFunc. Using these functions is often more efficient than writing a manualforloop, as they are implemented with performance in mind. -
Clear Pointers in Slices When Done If a large slice holds pointers to objects you no longer need, the GC cannot reclaim their memory as long as the reference exists. By explicitly setting these slice elements to
nil, you break the reference, allowing the garbage collector to free the underlying memory sooner. -
Return Values, Don't Pass Mutation Pointers Prefer returning a new value from a function (
newVal := process(oldVal)) over passing a pointer to be modified. This functional style often leads to clearer data flow, reduces side effects, and can help variables stay on the stack instead of escaping to the heap. -
Use
runtime.KeepAlivewithunsafeorcgoWhen passing a Go pointer to non-Go code, the Go garbage collector can't see that it's still being used.runtime.KeepAliveacts as a directive to the compiler, ensuring the object remains in memory up to that point, thus preventing premature and catastrophic deallocation. -
Re-evaluate
deferin Tight Loops Whiledeferhistorically had notable overhead, Go 1.23+ has significantly optimized it for common cases. The performance cost is now negligible for most loops. Only remove it after profiling confirms it's a specific, significant bottleneck in your application. -
Use
intfor Indexing and Counts For loop counters, slice indices, and general counting, always use the nativeinttype. It is optimized for the target architecture's word size (32-bit or 64-bit), which makes arithmetic and memory addressing operations faster and more efficient than using fixed-size integers.
-
Trust the Default
GOGCFirst The Go garbage collector has become highly advanced. The defaultGOGC=100setting provides a balanced trade-off for most workloads. Only consider tuning it after profiling reveals a specific need for either lower memory usage or lower GC CPU time. -
Use
GOMEMLIMITin Containers Since its introduction in Go 1.19,GOMEMLIMIThas become the preferred way to manage memory in containerized environments. It provides the runtime with a clear memory ceiling, leading to smoother GC pacing and a much better defense against out-of-memory (OOM) kills. -
Reduce Allocation Rate in Hot Paths The most effective way to improve GC performance is to give it less work to do. Use profiling tools to find the code paths that allocate the most memory and optimize them to reduce allocations. This directly lowers GC frequency and the duration of each GC pause.
-
Avoid Finalizers (
runtime.SetFinalizer) Finalizers are unpredictable, can introduce significant latency, and may delay garbage collection. They remain a feature to avoid. Always prefer explicit resource management with aClose()orRelease()method, often combined withdeferfor safety. -
Disable GC Temporarily with Extreme Caution In rare, highly critical sections that are guaranteed not to allocate memory, you can use
debug.SetGCPercent(-1)to disable the GC. This is a very blunt instrument and must be used with extreme care, re-enabling the GC immediately afterward to avoid OOM errors.
-
Use Buffered Channels Appropriately Buffered channels can decouple sender and receiver goroutines, smoothing out bursts of work and improving throughput. However, an incorrectly sized buffer can introduce unwanted latency or even deadlocks. The size should be chosen based on the expected workload dynamics.
-
Limit Goroutine Creation with Worker Pools Spawning an unlimited number of goroutines can exhaust memory and CPU resources due to scheduling overhead. A worker pool, where a fixed number of goroutines process tasks from a channel, provides a robust backpressure mechanism and prevents system overload.
-
Use
sync.Mapfor Concurrent Read-Heavy Maps For maps with many concurrent reads and infrequent writes,sync.Mapis highly optimized to avoid lock contention. Under this specific access pattern, it can significantly outperform a standard map protected by a globalsync.Mutexby allowing lock-free reads. -
Prefer Atomic Operations Over Mutexes For simple, primitive operations like incrementing a counter or a compare-and-swap, functions in
sync/atomicare much faster than using a mutex. They leverage hardware-level instructions that avoid the overhead of scheduler-managed locking. -
Use
contextfor Cancellation and Timeouts Propagate acontextthrough your application's call stack to handle cancellation, timeouts, and deadlines gracefully. This ensures that when an operation is cancelled, all downstream goroutines stop their work promptly, preventing wasted computation and resource leaks. -
Use
selectwith adefaultfor Non-Blocking Channel Ops Aselectstatement that includes adefaultcase will execute the default path immediately if no other channel operation is ready. This is perfect for implementing non-blocking sends, receives, or for periodically polling the status of a channel without halting execution. -
Employ Fan-In/Fan-Out Patterns For parallelizing a pipeline of work, use the "fan-out" pattern to distribute tasks among multiple worker goroutines. Subsequently, use the "fan-in" pattern to collect the results from these workers into a single channel for aggregation or further processing.
-
Use
errgroupfor Synchronized Task Groups Thegolang.org/x/sync/errgrouppackage simplifies running a group of subtasks in separate goroutines. It provides robust error handling by propagating the first error that occurs and automatically canceling the context for all other goroutines in the group. -
Use
sync.Oncefor Initialization To ensure a specific piece of code, like initializing a singleton, runs exactly once despite numerous concurrent calls, usesync.Onceand itsDomethod. It is far more efficient and safer for one-time initialization than using a mutex and a boolean flag. -
Use Channels for Signaling A channel of empty structs (
chan struct{}) is a low-overhead and idiomatic way to signal events between goroutines. Closing such a channel provides an effective broadcast mechanism, unblocking all goroutines currently waiting to receive from it, which is perfect for signaling "done". -
Understand False Sharing False sharing occurs when goroutines on different CPU cores modify variables located on the same CPU cache line, forcing constant, expensive cache invalidation. Add padding to structs to ensure concurrently accessed fields are on different cache lines, eliminating this contention.
-
Profile for Lock Contention In addition to CPU and memory,
pprofcan identify lock contention, where goroutines are spending excessive time waiting for a mutex. High contention effectively serializes your parallel code; use this profile to find and redesign these bottlenecks, perhaps using finer-grained locking.
-
Enable Profile-Guided Optimization (PGO) PGO, now a mature feature in Go 1.25, uses runtime profiles to guide the compiler's optimization decisions. It can now optimize a wider range of patterns like interface devirtualization and is more impactful than ever. It is highly recommended for production builds.
-
Leverage Function Inlining The compiler automatically inlines small, simple functions, which eliminates function call overhead. Keeping functions short and simple increases the likelihood they will be inlined. Check the compiler's decisions with
-gcflags="-m". -
Enable Bounds Check Elimination The compiler can remove slice bounds checks if it can prove an index is safe (e.g., in a standard
for i := range sloop). Writing idiomatic loops helps the compiler perform this optimization, making loop-heavy code faster. -
Use Build Tags for Platform-Specific Code Use build tags (e.g.,
//go:build amd64) to write optimized code for specific architectures, allowing you to leverage special instruction sets like AVX2 where available, while providing a generic fallback for others. -
Strip Binaries for Smaller Size Use the linker flag
-ldflags="-s -w"to strip debugging information and the DWARF symbol table from your final compiled binary. This significantly reduces its size, leading to faster container image builds, quicker deployments, and lower storage costs. -
Statically Link Binaries Use the environment variable
CGO_ENABLED=0during your build to create a pure Go, statically linked binary. This bundles all dependencies into a single file, which simplifies deployment in minimal environments (like a Dockerscratchimage) and avoids the performance overhead of cgo. -
Analyze Compiler Optimizations Use the build flag
-gcflags="-m"to see what the compiler decided to optimize. This output reveals crucial information, such as which functions were inlined, which variables escaped to the heap, and whether certain bounds checks were eliminated, providing valuable insights for manual tuning.
-
Use
strings.Builderfor String Concatenation Repeatedly using the+operator to build a string is highly inefficient because it creates a new string and copies data at each step.strings.Builderminimizes allocations by using a resizable internal byte buffer, making it dramatically faster for building strings from multiple pieces. -
Choose Appropriate Map Initial Size If you have an estimate of the number of items a map will hold, provide it as a capacity hint:
make(map[K]V, size). This pre-allocates the necessary memory upfront, preventing the map from having to be slowly and incrementally resized and re-hashed as you add elements. -
Cache Computed Values (Memoization) Memoization is a technique where you store the results of expensive function calls in a cache (like a map). When the same inputs occur again, you return the cached result instead of re-computing it, trading a small amount of memory for a significant reduction in computation time.
-
Use Lookup Tables Instead of complex calculations or a multi-case
switchstatement repeatedly, you can pre-compute the results for all possible inputs and store them in a slice or map. This transforms an expensive computation into a very fast, constant-time memory lookup operation. -
Optimize Hot Paths First Don't waste time optimizing code that rarely runs. Use profiling tools like
pprofto identify the "hot paths"—the small number of functions where your program spends the vast majority of its time. Focusing your optimization efforts on these specific bottlenecks will yield the greatest gains. -
Avoid Reflection in Performance-Critical Code Go's
reflectpackage is powerful for dynamic programming but is also notoriously slow because it must inspect and manipulate types at runtime. In performance-sensitive code, avoid reflection and consider faster alternatives like code generation, type assertions, or generics. -
Batch Operations Processing items in batches amortizes fixed overheads from function calls, system calls, or network round-trips. Grouping items (e.g., sending multiple database inserts in one query) is a fundamental strategy for increasing throughput in any system.
-
Use Slice Tricks for Efficient Operations Go's slice mechanics allow for powerful and efficient in-place operations that can avoid new memory allocations. Learning common idioms for operations like in-place filtering or deleting an element from a slice is key to writing high-performance Go code.
-
Use Bitfields for Boolean Flags Instead of using a
structwith multipleboolfields (each taking at least one byte), you can pack many boolean flags into a single integer using bitwise operations. This can dramatically reduce memory usage, especially when you have large collections of objects with many flags. -
Use Generics for Type-Safe, Allocation-Free Containers Since their introduction in Go 1.18, generics have become a powerful tool for writing reusable data structures (e.g., lists, trees) that are both fully type-safe and avoid the heap allocations that were previously necessary with interface-based designs.
-
Hoist Calculations Out of Loops If a calculation inside a loop produces the same result in every single iteration, it's inefficient to recompute it. Perform that calculation once before the loop begins and store the result in a variable. This guarantees the optimization and can also improve the code's clarity.
-
Use Buffered I/O Wrap raw
io.Readerorio.Writerinterfaces withbufioto minimize expensive system calls. Reading or writing in larger, memory-efficient chunks instead of one byte or one small write at a time dramatically improves I/O performance. -
Reuse HTTP Clients A long-lived
http.Clientmaintains a pool of persistent TCP connections, allowing it to reuse them for subsequent requests to the same host (HTTP keep-alive). Creating a new client for each request is inefficient as it forces a new TCP and TLS handshake every time. -
Set Appropriate Timeouts Never make network calls or perform I/O without a timeout. A slow or unresponsive remote service can cause your goroutines to hang indefinitely, consuming system resources like memory and file descriptors. Timeouts are critical for building resilient and responsive applications.
-
Use
io.Copywith Buffer Poolsio.Copyis highly optimized for transferring data between a reader and a writer. You can make it even more efficient by usingio.CopyBufferin conjunction with async.Poolto provide the buffer. This avoids allocating a new memory buffer for every copy operation. -
Implement Connection Pooling Establishing network connections to databases or other services is slow and resource-intensive. A connection pool maintains a set of open, reusable connections, which amortizes the high setup cost across many requests and significantly improves application throughput and latency.
-
Use
io.ReaderFromandio.WriterToWhen you useio.Copy, it internally checks if the source or destination types implement these special interfaces. If they do,io.Copycan delegate to them to perform a more efficient copying strategy, such as using thesendfilesyscall to avoid user-space buffers entirely. -
Avoid
io.ReadAllfor Large Data Reading an entire file or an HTTP request body into memory withio.ReadAllcan easily lead to high memory consumption and out-of-memory errors. Instead, process data in a streaming fashion using the providedio.Readerto keep memory usage low and constant. -
Minimize File System
statCalls Astatsystem call, which checks for a file's existence and metadata, can be surprisingly slow, especially on network file systems or cloud storage. If possible, cache the results of these calls or design your application to avoid redundant checks in tight loops. -
Use the
slogPackage for Structured Logging Theslogpackage, standard since Go 1.21, is designed for high-performance structured logging. It is significantly faster than third-party reflection-based libraries and produces machine-readable output that is easier to parse and query, reducing observability overhead. -
Use Protocol Buffers or gRPC for Internal Services For server-to-server communication, text-based formats like JSON can be slow to serialize and parse. Binary protocols like Protocol Buffers are much more compact and CPU-efficient, leading to lower network latency, reduced bandwidth usage, and higher overall throughput.
-
Understand CPU Cache Lines Sequential memory access is fast because it maximizes the use of data automatically prefetched by the CPU into its cache. Random memory access patterns frequently lead to cache misses, which are very slow as they force the CPU to wait for data from main memory.
-
Leverage SIMD Instructions For heavy numerical computations, such as in machine learning or data analysis, use libraries or Go assembly to leverage Single Instruction, Multiple Data (SIMD) instructions. This allows the CPU to perform the same operation on multiple data points simultaneously, offering massive speedups.
-
Use
cgoSparingly Calling C code from Go viacgohas significant overhead due to the need to switch between Go's goroutine scheduler and the system's thread scheduler. If you must usecgo, try to batch calls into larger chunks to minimize the number of transitions between the two worlds. -
Write Custom Go Assembly For the absolute most performance-critical parts of an application, such as in cryptography or advanced mathematics, you can write functions directly in Go's assembler dialect. This gives you complete control over CPU instructions but is complex, non-portable, and hard to maintain.
-
Use
unsafe.Pointerfor Zero-Copy Conversions With extreme care,unsafe.Pointercan be used to reinterpret a piece of memory as a different type without making a copy, such as converting a[]byteto astring. This is extremely fast but breaks all of Go's type safety guarantees and can lead to subtle, dangerous bugs. -
Tune
GOMAXPROCSThis environment variable controls the maximum number of OS threads that can execute Go code simultaneously. The default, which is the number of available CPU cores, is almost always optimal. Only tune this if you have a specific, measurable reason to do so. -
Use Memory-Mapped Files for Large Data Memory-mapping a file with a library like
golang.org/x/exp/mmapallows you to access its contents as if it were an in-memory slice. The operating system handles loading pages of the file into memory on demand, providing efficient random access to large files without high RAM usage. -
Use
syscallorx/sysfor Direct OS Interaction The standardospackage provides a portable, high-level interface to operating system functionality. For maximum performance in scenarios like high-speed networking, directly using thesyscallorgolang.org/x/syspackages can bypass abstractions and reduce overhead. -
Tune Kernel TCP Settings For network-heavy applications, tuning kernel-level TCP settings like buffer sizes (
net.core.rmem_max,net.core.wmem_max) and enabling modern congestion control algorithms like TCP BBR can significantly improve network throughput and reduce latency for your users. -
Avoid
go:linknameThis compiler directive allows linking to private functions in other packages. While powerful for hacking, it is extremely brittle, breaks Go's encapsulation guarantees, and is likely to fail with new Go versions. It should be avoided in production code in favor of safer alternatives.
-
Design APIs for Batching Provide API endpoints that can accept and process multiple items or operations in a single request. This is far more efficient than forcing clients to make many small, individual requests, as it dramatically reduces network round-trip overhead and allows for more efficient server-side processing.
-
Separate Read/Write Concerns (CQRS) In high-load systems, separating the data models and code paths for reads (queries) and writes (commands) can allow you to optimize each path independently. For example, you can create denormalized, pre-computed read models that are optimized for extremely fast lookups.
-
Design for Statelessness Stateless services are significantly easier to scale horizontally. By avoiding server-side session state, you can add or remove instances without worrying about session affinity or data migration, which allows load balancers to distribute traffic efficiently and simplifies your architecture.
-
Implement Circuit Breakers In a microservices architecture, a circuit breaker pattern prevents a client from repeatedly calling a service that is failing. This stops cascading failures across your system and gives the troubled service time to recover, improving overall system resilience and performance under partial failure.
-
Cache Aggressively A cache is a high-speed data storage layer. Cache data at every possible opportunity: in-memory with a TTL for hot data, in a distributed cache like Redis for shared state, or at the edge with a CDN for static assets. Caching is often the single most effective performance optimization.
-
Use Message Queues for Asynchronous Work For long-running or resource-intensive tasks triggered by a user request, don't make the user wait. Offload the work to a background worker process via a message queue (like RabbitMQ or NATS) and return a response to the user immediately.
-
Denormalize Data for Fast Reads While database normalization is crucial for data integrity, it can lead to slow queries that require many joins. For performance-critical read paths, create denormalized views or tables that are pre-joined and optimized for very fast, simple lookups.
-
Implement Pagination for Large Datasets Never return thousands of data records in a single API response. Implement pagination (either offset-based or cursor-based) to return data in manageable chunks. This improves API response time, reduces memory usage on both the server and client, and lowers network bandwidth.
-
Prefer HTTP/2 or HTTP/3 These newer HTTP versions support multiplexing, which allows multiple requests and responses to be sent over a single TCP connection concurrently. This reduces latency, especially for web applications that load many small assets, and mitigates head-of-line blocking.
-
Consider Materialized Views For complex, slow-running queries that are executed frequently with the same parameters, a materialized view pre-computes and stores the results in a physical table. This transforms the expensive, on-the-fly query into a simple and extremely fast
SELECToperation.
-
Break Down Large Functions Small, well-defined functions are easier for the Go compiler to analyze, optimize, and potentially inline. They are also significantly easier for humans to read, test, and maintain, which indirectly contributes to better-performing and more reliable code over the long term.
-
Defer
UnlockImmediately AfterLockThemutex.Lock(); defer mutex.Unlock()pattern is both idiomatic and safe in Go. It guarantees that the mutex is always released, even if the function panics or has multiple return paths. This is a crucial practice for preventing deadlocks in concurrent programs. -
Use Type Switches Over Interface Method Calls in Some Cases In a tight loop where you need to act on an interface's underlying concrete type, a type switch can sometimes be faster than a standard method call. This is because the compiler may be able to optimize it better than a fully dynamic dispatch.
-
Avoid Global Variables Global variables, especially those that are mutable, can introduce hidden dependencies between different parts of your program and create hard-to-debug contention points in concurrent code. It is far better to pass dependencies explicitly to the functions that need them.
-
Write Benchmarks for Everything Critical Do not optimize based on assumptions. Use Go's built-in
testingpackage to write benchmarks for your critical code paths. This allows you to scientifically measure the impact of your changes, validate improvements, and prevent performance regressions from being introduced in the future. -
Pin Dependency Versions Always use a
go.modandgo.sumfile to lock your project's dependency versions. This ensures that your builds are fully reproducible and prevents unexpected performance changes or bugs that could be introduced by a transitive dependency's silent update. -
Use Table-Driven Tests for Comprehensive Benchmarks The idiomatic table-driven test pattern is also excellent for benchmarking. It makes it easy to measure your code's performance across a wide range of inputs and sizes. This helps you identify edge cases where performance might degrade unexpectedly.
-
Keep Your Go Version Updated Each new major release of Go brings significant improvements to the compiler, runtime, and garbage collector. Simply recompiling your existing code with a newer version of the Go toolchain can often yield a free and effortless performance boost.
-
Write Clear, Simple Code Complex and clever code is hard to reason about, both for humans and for the Go compiler. Simple, idiomatic Go is often the most performant because the compiler understands these common patterns best and can optimize them far more effectively.
-
Use Linters and Static Analysis Tools Tools like
staticcheckcan automatically find common performance issues and anti-patterns in your code, such as unnecessary string-to-byte conversions inside a loop or inefficient uses of standard library functions, that are easy to miss during a manual code review.
-
Use Streaming Encoders and Decoders Instead of loading a large JSON object into memory, use
json.NewEncoder(writer)andjson.NewDecoder(reader). This approach processes the data as a stream, drastically reducing RAM usage for large datasets and improving latency by sending data as it becomes available. -
Generate Code for Serialization Employ tools like
easyjsonorgogo protobufthat generate marshalling and unmarshalling code at compile time. This avoids the runtime performance overhead of reflection used by standard libraries, resulting in significantly faster serialization and deserialization. -
Use Gob for Go-to-Go Communication When communicating between Go services, the standard
encoding/gobpackage is often much faster than JSON or XML. It's a binary encoding format specifically designed for Go data types, making it highly efficient for RPCs or for caching Go structs in systems like Redis. -
Leverage Hardware-Accelerated Crypto Use modern cryptographic algorithms like AES-GCM, as Go’s standard library implementation takes direct advantage of hardware acceleration (AES-NI) available on most modern CPUs. This makes encryption and decryption operations orders of magnitude faster than non-native implementations.
-
Use Streaming Ciphers for Large Data For encrypting large files or network streams, use a streaming cipher implementation like
cipher.Streamfrom thecryptopackage. This allows you to encrypt data in small chunks as it becomes available, avoiding the need to load the entire content into memory before processing. -
Reuse Cryptographic Objects The setup process for cryptographic ciphers can be computationally expensive. For instance, creating a new
cipher.AEADobject involves key setup. Where possible, reuse these objects for multiple encryption or decryption operations with the same key to amortize this setup cost. -
Limit Concurrent File Descriptors When processing many files concurrently, use a semaphore (e.g., a buffered channel of
struct{}) to limit the number of simultaneously open file descriptors. This prevents your application from hitting the operating system'sulimitand avoids performance degradation from resource exhaustion. -
Use
os.ReadDirover DeprecatedioutilFor reading directory contents, always use theos.ReadDirfunction (stable since Go 1.16). It returns a slice ofos.DirEntryand is more efficient because it does not callos.Statfor every file, unlike the older, deprecatedioutil.ReadDir. -
Choose Efficient Compression Algorithms Select a compression algorithm based on your specific needs for speed versus size.
zstdoften provides a better balance of compression ratio and performance compared togzip. The standard library'sflatepackage also offers different levels; a lower level is faster but offers less compression. -
Bypass OS Page Cache with
O_DIRECTWhen Needed For specialized applications like databases that manage their own caching, usingO_DIRECTvia syscalls can bypass the OS page cache. This avoids double-caching data in both your application and the OS, giving you more direct control over I/O, though it requires careful memory alignment.