Top 100 Best Practices for Golang

Memory Management & Allocation

Pre-allocate Slices with Known Capacity When the eventual size of a slice is known, pre-allocating with make([]T, 0, capacity) creates the underlying array a single time. This critical practice avoids multiple, inefficient reallocations and the expensive process of copying all existing elements to a new, larger array as you append data.
Use the arena for Short-Lived Objects This is perfect for functions that create many temporary objects (like during a single request), as it can nearly eliminate GC pressure from that workload. or else you can also implement custom arena code. Custom Arena backed by mmap file function might help if you want storage for those value and gaurantees which comes with it if process fails or you need more space than allocated RAM size. If data persistence is not important, implementing arena with off-heap storage using unsafe package and pointers can be very beneficial as it is faster than GC heap (default) access but be careful about managing and clearing it.
Minimize Escape Analysis Failures Escape analysis is the compiler's process to determine if a variable can be allocated on the fast stack. Returning pointers to local variables or storing them in interfaces forces them to "escape" to the slower heap, increasing GC workload. Write code that allows variables to remain on the stack.
Use Value Receivers for Small Structs For structs smaller than a CPU word size (e.g., 64 bytes), passing a copy (a value receiver) is often faster than the indirection of a pointer. The value can be passed directly in CPU registers, which is much faster than fetching it from main memory via a pointer reference.
Align Struct Fields by Size Ordering struct fields from largest to smallest (e.g., int64, then int32, then bool) lets the compiler minimize memory waste from padding bytes. This creates a more compact memory layout, which improves data locality and CPU cache performance, especially for large arrays of structs.
Avoid Unnecessary Pointer Indirection Accessing data directly from memory is always faster than following a pointer to it. Each level of indirection creates a dependency and risks a CPU cache miss, which stalls execution. Flattening nested structures or using values directly can significantly speed up data access.
Use Arrays for Fixed-Size Collections An array is a value type that has no header overhead, unlike a slice. If its size is known at compile time, it can be allocated directly on the stack, which is faster and avoids putting pressure on the garbage collector. This makes them ideal for small, fixed-size collections.
Low level - SIMD, goto, loop unrolling unsafe Dont shy away from using low level primitives like manual loop unrolling, goto statements, pointers, struct serialization and of heap memory tricks with help of unsafe package. Most importantly use SIMD package to accelerate for loops, batching operations etc. If it is still not possible to achieve perfromance out of this, feel free to create ASM file for different CPU arch where you can optimize everything but be aware of critical vulnaribilities at assembly level.
Prefer map[string]struct{} for Sets To implement a set data structure, using an empty struct (struct{}) as the map's value type is the most memory-efficient approach. The empty struct consumes zero bytes of memory, making it far superior to using a bool or another placeholder type that would waste space.
Use the maps Standard Library Package Since Go 1.21, the maps package provides optimized, generic functions for common map operations like Clone, Copy, and DeleteFunc. Using these functions is often more efficient than writing a manual for loop, as they are implemented with performance in mind. But you may implment your own fast custom hashcode.
Clear Pointers in Slices When Done If a large slice holds pointers to objects you no longer need, the GC cannot reclaim their memory as long as the reference exists. By explicitly setting these slice elements to nil, you break the reference, allowing the garbage collector to free the underlying memory sooner.
Return Values, Don't Pass Mutation Pointers Prefer returning a new value from a function (newVal := process(oldVal)) over passing a pointer to be modified. This functional style often leads to clearer data flow, reduces side effects, and can help variables stay on the stack instead of escaping to the heap.
Use runtime.KeepAlive with unsafe or cgo When passing a Go pointer to non-Go code, the Go garbage collector can't see that it's still being used. runtime.KeepAlive acts as a directive to the compiler, ensuring the object remains in memory up to that point, thus preventing premature and catastrophic deallocation.
Re-evaluate defer in Tight Loops While defer historically had notable overhead, Go 1.23+ has significantly optimized it for common cases. The performance cost is now negligible for most loops. Only remove it after profiling confirms it's a specific, significant bottleneck in your application.
Use int for Indexing and Counts For loop counters, slice indices, and general counting, always use the native int type. It is optimized for the target architecture's word size (32-bit or 64-bit), which makes arithmetic and memory addressing operations faster and more efficient than using fixed-size integers.

Garbage Collection (GC) Tuning

Trust the Default GOGC First The Go garbage collector has become highly advanced. The default GOGC=100 setting provides a balanced trade-off for most workloads. Only consider tuning it after profiling reveals a specific need for either lower memory usage or lower GC CPU time.
Use GOMEMLIMIT in Containers Since its introduction in Go 1.19, GOMEMLIMIT has become the preferred way to manage memory in containerized environments. It provides the runtime with a clear memory ceiling, leading to smoother GC pacing and a much better defense against out-of-memory (OOM) kills.
Reduce Allocation Rate in Hot Paths The most effective way to improve GC performance is to give it less work to do. Use profiling tools to find the code paths that allocate the most memory and optimize them to reduce allocations. This directly lowers GC frequency and the duration of each GC pause. Best way to redesign the code with custom data structures which follow data oriented design along with lockless algorithms for efficient concurrency and object pools
Avoid Finalizers (runtime.SetFinalizer) Finalizers are unpredictable, can introduce significant latency, and may delay garbage collection. They remain a feature to avoid. Always prefer explicit resource management with a Close() or Release() method, often combined with defer for safety.
Disable GC Temporarily with Extreme Caution In rare, highly critical sections that are guaranteed not to allocate memory, you can use debug.SetGCPercent(-1) to disable the GC. This is a very blunt instrument and must be used with extreme care, re-enabling the GC immediately afterward to avoid OOM errors.

Concurrency & Parallelism

Use Buffered Channels Appropriately Buffered channels can decouple sender and receiver goroutines, smoothing out bursts of work and improving throughput. However, an incorrectly sized buffer can introduce unwanted latency or even deadlocks. The size should be chosen based on the expected workload dynamics.
Limit Goroutine Creation with Worker Pools Spawning an unlimited number of goroutines can exhaust memory and CPU resources due to scheduling overhead. A worker pool, where a fixed number of goroutines process tasks from a channel, provides a robust backpressure mechanism and prevents system overload.
Use sync.Map for Concurrent Read-Heavy Maps For maps with many concurrent reads and infrequent writes, sync.Map is highly optimized to avoid lock contention. Under this specific access pattern, it can significantly outperform a standard map protected by a global sync.Mutex by allowing lock-free reads.
Prefer Atomic Operations Over Mutexes For simple, primitive operations like incrementing a counter or a compare-and-swap, functions in sync/atomic are much faster than using a mutex. They leverage hardware-level instructions that avoid the overhead of scheduler-managed locking.
Use context for Cancellation and Timeouts Propagate a context through your application's call stack to handle cancellation, timeouts, and deadlines gracefully. This ensures that when an operation is cancelled, all downstream goroutines stop their work promptly, preventing wasted computation and resource leaks.
Use select with a default for Non-Blocking Channel Ops A select statement that includes a default case will execute the default path immediately if no other channel operation is ready. This is perfect for implementing non-blocking sends, receives, or for periodically polling the status of a channel without halting execution.
Employ Fan-In/Fan-Out Patterns For parallelizing a pipeline of work, use the "fan-out" pattern to distribute tasks among multiple worker goroutines. Subsequently, use the "fan-in" pattern to collect the results from these workers into a single channel for aggregation or further processing.
Use errgroup for Synchronized Task Groups The golang.org/x/sync/errgroup package simplifies running a group of subtasks in separate goroutines. It provides robust error handling by propagating the first error that occurs and automatically canceling the context for all other goroutines in the group.
Use sync.Once for Initialization To ensure a specific piece of code, like initializing a singleton, runs exactly once despite numerous concurrent calls, use sync.Once and its Do method. It is far more efficient and safer for one-time initialization than using a mutex and a boolean flag.
Use Channels for Signaling A channel of empty structs (chan struct{}) is a low-overhead and idiomatic way to signal events between goroutines. Closing such a channel provides an effective broadcast mechanism, unblocking all goroutines currently waiting to receive from it, which is perfect for signaling "done".
Understand False Sharing False sharing occurs when goroutines on different CPU cores modify variables located on the same CPU cache line, forcing constant, expensive cache invalidation. Add padding to structs to ensure concurrently accessed fields are on different cache lines, eliminating this contention.
Profile for Lock Contention In addition to CPU and memory, pprof can identify lock contention, where goroutines are spending excessive time waiting for a mutex. High contention effectively serializes your parallel code; use this profile to find and redesign these bottlenecks, perhaps using finer-grained locking.

Compiler & Build Optimizations

Enable Profile-Guided Optimization (PGO) PGO, now a mature feature in Go 1.25, uses runtime profiles to guide the compiler's optimization decisions. It can now optimize a wider range of patterns like interface devirtualization and is more impactful than ever. It is highly recommended for production builds.
Leverage Function Inlining The compiler automatically inlines small, simple functions, which eliminates function call overhead. Keeping functions short and simple increases the likelihood they will be inlined. Check the compiler's decisions with -gcflags="-m".
Enable Bounds Check Elimination The compiler can remove slice bounds checks if it can prove an index is safe (e.g., in a standard for i := range s loop). Writing idiomatic loops helps the compiler perform this optimization, making loop-heavy code faster.
Use Build Tags for Platform-Specific Code Use build tags (e.g., //go:build amd64) to write optimized code for specific architectures, allowing you to leverage special instruction sets like AVX2 where available, while providing a generic fallback for others.
Strip Binaries for Smaller Size Use the linker flag -ldflags="-s -w" to strip debugging information and the DWARF symbol table from your final compiled binary. This significantly reduces its size, leading to faster container image builds, quicker deployments, and lower storage costs.
Statically Link Binaries Use the environment variable CGO_ENABLED=0 during your build to create a pure Go, statically linked binary. This bundles all dependencies into a single file, which simplifies deployment in minimal environments (like a Docker scratch image) and avoids the performance overhead of cgo.
Analyze Compiler Optimizations Use the build flag -gcflags="-m" to see what the compiler decided to optimize. This output reveals crucial information, such as which functions were inlined, which variables escaped to the heap, and whether certain bounds checks were eliminated, providing valuable insights for manual tuning.

Data Structures & Algorithms

Use strings.Builder for String Concatenation Repeatedly using the + operator to build a string is highly inefficient because it creates a new string and copies data at each step. strings.Builder minimizes allocations by using a resizable internal byte buffer, making it dramatically faster for building strings from multiple pieces.
Choose Appropriate Map Initial Size If you have an estimate of the number of items a map will hold, provide it as a capacity hint: make(map[K]V, size). This pre-allocates the necessary memory upfront, preventing the map from having to be slowly and incrementally resized and re-hashed as you add elements.
Cache Computed Values (Memoization) Memoization is a technique where you store the results of expensive function calls in a cache (like a map). When the same inputs occur again, you return the cached result instead of re-computing it, trading a small amount of memory for a significant reduction in computation time.
Use Lookup Tables Instead of complex calculations or a multi-case switch statement repeatedly, you can pre-compute the results for all possible inputs and store them in a slice or map. This transforms an expensive computation into a very fast, constant-time memory lookup operation. Use this if have more than 20 options in switch case and/or switch case is executed very tight in loop. Dont compromise readability if its not worth the effort.
Optimize Hot Paths First Don't waste time optimizing code that rarely runs. Use profiling tools like pprof to identify the "hot paths"—the small number of functions where your program spends the vast majority of its time. Focusing your optimization efforts on these specific bottlenecks will yield the greatest gains.
Avoid Reflection in Performance-Critical Code Go's reflect package is powerful for dynamic programming but is also notoriously slow because it must inspect and manipulate types at runtime. In performance-sensitive code, avoid reflection and consider faster alternatives like code generation, type assertions, or generics.
Batch Operations Processing items in batches amortizes fixed overheads from function calls, system calls, or network round-trips. Grouping items (e.g., sending multiple database inserts in one query) is a fundamental strategy for increasing throughput in any system.
Use Slice Tricks for Efficient Operations Go's slice mechanics allow for powerful and efficient in-place operations that can avoid new memory allocations. Learning common idioms for operations like in-place filtering or deleting an element from a slice is key to writing high-performance Go code.
Use Bitfields for Boolean Flags Instead of using a struct with multiple bool fields (each taking at least one byte), you can pack many boolean flags into a single integer using bitwise operations. This can dramatically reduce memory usage, especially when you have large collections of objects with many flags.
Use Generics for Type-Safe, Allocation-Free Containers Since their introduction in Go 1.18, generics have become a powerful tool for writing reusable data structures (e.g., lists, trees) that are both fully type-safe and avoid the heap allocations that were previously necessary with interface-based designs.
Hoist Calculations Out of Loops If a calculation inside a loop produces the same result in every single iteration, it's inefficient to recompute it. Perform that calculation once before the loop begins and store the result in a variable. This guarantees the optimization and can also improve the code's clarity.

I/O & Network Optimizations

Use Buffered I/O Wrap raw io.Reader or io.Writer interfaces with bufio to minimize expensive system calls. Reading or writing in larger, memory-efficient chunks instead of one byte or one small write at a time dramatically improves I/O performance.
Reuse HTTP Clients A long-lived http.Client maintains a pool of persistent TCP connections, allowing it to reuse them for subsequent requests to the same host (HTTP keep-alive). Creating a new client for each request is inefficient as it forces a new TCP and TLS handshake every time.
Set Appropriate Timeouts Never make network calls or perform I/O without a timeout. A slow or unresponsive remote service can cause your goroutines to hang indefinitely, consuming system resources like memory and file descriptors. Timeouts are critical for building resilient and responsive applications.
Use io.Copy with Buffer Pools io.Copy is highly optimized for transferring data between a reader and a writer. You can make it even more efficient by using io.CopyBuffer in conjunction with a sync.Pool to provide the buffer. This avoids allocating a new memory buffer for every copy operation.
Implement Connection Pooling Establishing network connections to databases or other services is slow and resource-intensive. A connection pool maintains a set of open, reusable connections, which amortizes the high setup cost across many requests and significantly improves application throughput and latency.
Use io.ReaderFrom and io.WriterTo When you use io.Copy, it internally checks if the source or destination types implement these special interfaces. If they do, io.Copy can delegate to them to perform a more efficient copying strategy, such as using the sendfile syscall to avoid user-space buffers entirely.
Avoid io.ReadAll for Large Data Reading an entire file or an HTTP request body into memory with io.ReadAll can easily lead to high memory consumption and out-of-memory errors. Instead, process data in a streaming fashion using the provided io.Reader to keep memory usage low and constant.
Minimize File System stat Calls A stat system call, which checks for a file's existence and metadata, can be surprisingly slow, especially on network file systems or cloud storage. If possible, cache the results of these calls or design your application to avoid redundant checks in tight loops.
Use the slog Package for Structured Logging The slog package, standard since Go 1.21, is designed for high-performance structured logging. It is significantly faster than third-party reflection-based libraries and produces machine-readable output that is easier to parse and query, reducing observability overhead.
Use Protocol Buffers or gRPC for Internal Services For server-to-server communication, text-based formats like JSON can be slow to serialize and parse. Binary protocols like Protocol Buffers are much more compact and CPU-efficient, leading to lower network latency, reduced bandwidth usage, and higher overall throughput. If you dont wish to depend upon external libs, use the dark magic of unsafe package with unsafe pointers and struct field offsets for zero copy serialization.

System-Level & Unconventional Techniques

Understand CPU Cache Lines Sequential memory access is fast because it maximizes the use of data automatically prefetched by the CPU into its cache. Random memory access patterns frequently lead to cache misses, which are very slow as they force the CPU to wait for data from main memory.
Leverage SIMD Instructions For heavy numerical computations, such as in machine learning or data analysis, use libraries or Go assembly to leverage Single Instruction, Multiple Data (SIMD) instructions. This allows the CPU to perform the same operation on multiple data points simultaneously, offering massive speedups.
Use cgo Sparingly Calling C code from Go via cgo has significant overhead due to the need to switch between Go's goroutine scheduler and the system's thread scheduler. If you must use cgo, try to batch calls into larger chunks to minimize the number of transitions between the two worlds.
Write Custom Go Assembly For the absolute most performance-critical parts of an application, such as in cryptography or advanced mathematics, you can write functions directly in Go's assembler dialect. This gives you complete control over CPU instructions but is complex, non-portable, and hard to maintain.
Use unsafe.Pointer for Zero-Copy Conversions With extreme care, unsafe.Pointer can be used to reinterpret a piece of memory as a different type without making a copy, such as converting a []byte to a string. This is extremely fast but breaks all of Go's type safety guarantees and can lead to subtle, dangerous bugs.
Tune GOMAXPROCS This environment variable controls the maximum number of OS threads that can execute Go code simultaneously. The default, which is the number of available CPU cores, is almost always optimal. Only tune this if you have a specific, measurable reason to do so.
Use Memory-Mapped Files for Large Data Memory-mapping a file with a library like golang.org/x/exp/mmap allows you to access its contents as if it were an in-memory slice. The operating system handles loading pages of the file into memory on demand, providing efficient random access to large files without high RAM usage.
Use syscall or x/sys for Direct OS Interaction The standard os package provides a portable, high-level interface to operating system functionality. For maximum performance in scenarios like high-speed networking, directly using the syscall or golang.org/x/sys packages can bypass abstractions and reduce overhead.
Tune Kernel TCP Settings For network-heavy applications, tuning kernel-level TCP settings like buffer sizes (net.core.rmem_max, net.core.wmem_max) and enabling modern congestion control algorithms like TCP BBR can significantly improve network throughput and reduce latency for your users.
Avoid go:linkname This compiler directive allows linking to private functions in other packages. While powerful for hacking, it is extremely brittle, breaks Go's encapsulation guarantees, and is likely to fail with new Go versions. It should be avoided in production code in favor of safer alternatives.

Architecture & API Design

Design APIs for Batching Provide API endpoints that can accept and process multiple items or operations in a single request. This is far more efficient than forcing clients to make many small, individual requests, as it dramatically reduces network round-trip overhead and allows for more efficient server-side processing.
Separate Read/Write Concerns (CQRS) In high-load systems, separating the data models and code paths for reads (queries) and writes (commands) can allow you to optimize each path independently. For example, you can create denormalized, pre-computed read models that are optimized for extremely fast lookups.
Design for Statelessness Stateless services are significantly easier to scale horizontally. By avoiding server-side session state, you can add or remove instances without worrying about session affinity or data migration, which allows load balancers to distribute traffic efficiently and simplifies your architecture.
Implement Circuit Breakers In a microservices architecture, a circuit breaker pattern prevents a client from repeatedly calling a service that is failing. This stops cascading failures across your system and gives the troubled service time to recover, improving overall system resilience and performance under partial failure.
Cache Aggressively A cache is a high-speed data storage layer. Cache data at every possible opportunity: in-memory with a TTL for hot data, in a distributed cache like Redis for shared state, or at the edge with a CDN for static assets. Caching is often the single most effective performance optimization.
Use Message Queues for Asynchronous Work For long-running or resource-intensive tasks triggered by a user request, don't make the user wait. Offload the work to a background worker process via a message queue (like RabbitMQ or NATS) and return a response to the user immediately.
Denormalize Data for Fast Reads While database normalization is crucial for data integrity, it can lead to slow queries that require many joins. For performance-critical read paths, create denormalized views or tables that are pre-joined and optimized for very fast, simple lookups.
Implement Pagination for Large Datasets Never return thousands of data records in a single API response. Implement pagination (either offset-based or cursor-based) to return data in manageable chunks. This improves API response time, reduces memory usage on both the server and client, and lowers network bandwidth.
Prefer HTTP/2 or HTTP/3 These newer HTTP versions support multiplexing, which allows multiple requests and responses to be sent over a single TCP connection concurrently. This reduces latency, especially for web applications that load many small assets, and mitigates head-of-line blocking.
Consider Materialized Views For complex, slow-running queries that are executed frequently with the same parameters, a materialized view pre-computes and stores the results in a physical table. This transforms the expensive, on-the-fly query into a simple and extremely fast SELECT operation.

Code Structure & Best Practices

Break Down Large Functions Small, well-defined functions are easier for the Go compiler to analyze, optimize, and potentially inline. They are also significantly easier for humans to read, test, and maintain, which indirectly contributes to better-performing and more reliable code over the long term.
Defer Unlock Immediately After Lock The mutex.Lock(); defer mutex.Unlock() pattern is both idiomatic and safe in Go. It guarantees that the mutex is always released, even if the function panics or has multiple return paths. This is a crucial practice for preventing deadlocks in concurrent programs.
Use Type Switches Over Interface Method Calls in Some Cases In a tight loop where you need to act on an interface's underlying concrete type, a type switch can sometimes be faster than a standard method call. This is because the compiler may be able to optimize it better than a fully dynamic dispatch.
Avoid Global Variables Global variables, especially those that are mutable, can introduce hidden dependencies between different parts of your program and create hard-to-debug contention points in concurrent code. It is far better to pass dependencies explicitly to the functions that need them.
Write Benchmarks for Everything Critical Do not optimize based on assumptions. Use Go's built-in testing package to write benchmarks for your critical code paths. This allows you to scientifically measure the impact of your changes, validate improvements, and prevent performance regressions from being introduced in the future.
Pin Dependency Versions Always use a go.mod and go.sum file to lock your project's dependency versions. This ensures that your builds are fully reproducible and prevents unexpected performance changes or bugs that could be introduced by a transitive dependency's silent update.
Use Table-Driven Tests for Comprehensive Benchmarks The idiomatic table-driven test pattern is also excellent for benchmarking. It makes it easy to measure your code's performance across a wide range of inputs and sizes. This helps you identify edge cases where performance might degrade unexpectedly.
Keep Your Go Version Updated Each new major release of Go brings significant improvements to the compiler, runtime, and garbage collector. Simply recompiling your existing code with a newer version of the Go toolchain can often yield a free and effortless performance boost.
Write Clear, Simple Code Complex and clever code is hard to reason about, both for humans and for the Go compiler. Simple, idiomatic Go is often the most performant because the compiler understands these common patterns best and can optimize them far more effectively.
Use Linters and Static Analysis Tools Tools like staticcheck can automatically find common performance issues and anti-patterns in your code, such as unnecessary string-to-byte conversions inside a loop or inefficient uses of standard library functions, that are easy to miss during a manual code review.

File I/O, Encoding, & Crypto

Use Streaming Encoders and Decoders Instead of loading a large JSON object into memory, use json.NewEncoder(writer) and json.NewDecoder(reader). This approach processes the data as a stream, drastically reducing RAM usage for large datasets and improving latency by sending data as it becomes available.
Generate Code for Serialization Employ tools like easyjson or gogo protobuf that generate marshalling and unmarshalling code at compile time. This avoids the runtime performance overhead of reflection used by standard libraries, resulting in significantly faster serialization and deserialization.
Use Gob for Go-to-Go Communication When communicating between Go services, the standard encoding/gob package is often much faster than JSON or XML. It's a binary encoding format specifically designed for Go data types, making it highly efficient for RPCs or for caching Go structs in systems like Redis.
Leverage Hardware-Accelerated Crypto Use modern cryptographic algorithms like AES-GCM, as Go’s standard library implementation takes direct advantage of hardware acceleration (AES-NI) available on most modern CPUs. This makes encryption and decryption operations orders of magnitude faster than non-native implementations.
Use Streaming Ciphers for Large Data For encrypting large files or network streams, use a streaming cipher implementation like cipher.Stream from the crypto package. This allows you to encrypt data in small chunks as it becomes available, avoiding the need to load the entire content into memory before processing.
Reuse Cryptographic Objects The setup process for cryptographic ciphers can be computationally expensive. For instance, creating a new cipher.AEAD object involves key setup. Where possible, reuse these objects for multiple encryption or decryption operations with the same key to amortize this setup cost.
Limit Concurrent File Descriptors When processing many files concurrently, use a semaphore (e.g., a buffered channel of struct{}) to limit the number of simultaneously open file descriptors. This prevents your application from hitting the operating system's ulimit and avoids performance degradation from resource exhaustion.
Use os.ReadDir over Deprecated ioutil For reading directory contents, always use the os.ReadDir function (stable since Go 1.16). It returns a slice of os.DirEntry and is more efficient because it does not call os.Stat for every file, unlike the older, deprecated ioutil.ReadDir.
Choose Efficient Compression Algorithms Select a compression algorithm based on your specific needs for speed versus size. zstd often provides a better balance of compression ratio and performance compared to gzip. The standard library's flate package also offers different levels; a lower level is faster but offers less compression.
Bypass OS Page Cache with O_DIRECT When Needed For specialized applications like databases that manage their own caching, using O_DIRECT via syscalls can bypass the OS page cache. This avoids double-caching data in both your application and the OS, giving you more direct control over I/O, though it requires careful memory alignment.

corporatepiyush/bp-golang.md

Select an option

No results found

Select an option

No results found

Memory Management & Allocation

Garbage Collection (GC) Tuning

Concurrency & Parallelism

Compiler & Build Optimizations

Data Structures & Algorithms

I/O & Network Optimizations

System-Level & Unconventional Techniques

Architecture & API Design

Code Structure & Best Practices

File I/O, Encoding, & Crypto