When Your CS Professor Was Wrong: Arrays Beat Sets (Sometimes)


Or: How I learned to stop worrying and love O(n)

Remember that smug feeling when you learned about Big O notation? When you could finally look at someone using array.includes() and think "pfft, O(n) peasant, I'll use a Set for O(1) lookup"?

Yeah, well, your CPU wants to have a word with you.


The Setup: A Simple Benchmark Gone Wrong


I was building a performance test comparing Arrays vs Sets for unique character collection. Classic textbook scenario: you need to check if elements exist in a collection. Set should dominate, right?

// The "smart" way
function findWithSet(iter) {
  let set = new Set()
  while (set.size < targetCount) {
    const char = iter()
    set.add(char)
    // ... logic
  }
}

// The "naive" way  
function findWithArray(iter) {  
  let arr = []
  while (arr.length < targetCount) {
    const char = iter()
    if (arr.includes(char)) {
      // ... reset logic
    }
    arr.push(char)
  }
}


I expected Set to crush Array. Instead, Array was consistently 2-3x faster.

Record scratch. Freeze frame.


Plot Twist: Your CPU Doesn't Read Textbooks

Here's what actually happens when you search a small array:

Array (10 elements):

Memory: [a][b][c][d][e][f][g][h][i][j]
CPU:    "Cool, I'll load all 10 bytes in one cache line"
Search: SIMD compare across multiple elements simultaneously  
Time:   ~3 CPU cycles

Set (10 elements):

Memory: hash_table -> bucket_array -> linked_lists
CPU:    "Let me hash this, find the bucket, traverse..."
Search: hash(char) -> bucket[hash % size] -> linear probe/chain
Time:   ~15 CPU cycles + memory indirection penalties

The "algorithmically superior" Set is doing more work for small datasets.


The Hardware Reality Check


Modern CPUs are ridiculously optimized for linear memory access:

  1. Cache Lines: Your CPU loads 64 bytes at once. A 50-element array fits entirely in L1 cache.
  2. SIMD Instructions: Compare 16+ bytes simultaneously with single instructions.
  3. Prefetching: CPU predicts linear access patterns and loads ahead.
  4. Branch Prediction: Simple loops are heavily optimized.

Sets, meanwhile, involve:

  • Hash function computation (expensive for single characters)
  • Memory indirection (cache misses)
  • Bucket management overhead
  • Collision resolution complexity


When Theory Meets Practice


I ran progressive tests from 1 to 1000 elements. The results were eye-opening:

  • 1-20 elements: Array wins by 2-4x
  • 20-100 elements: Array still faster
  • 100-300 elements: Roughly equal
  • 300+ elements: Set starts winning
  • 1000+ elements: Set dominates


The crossover point depends on your data, but it's much higher than most developers expect.


Real-World Implications


This matters more than you think:

Use Arrays for:

  • Command line flags/options
  • Small enums or constants
  • Feature toggles
  • Small lookup tables
  • Validation lists under ~100 items

Use Sets for:

  • Large collections (1000+ items)
  • Frequent add/remove operations
  • When you actually need Set semantics
  • Dynamic collections of unknown size


The JavaScript Engine Twist


V8 (Chrome/Node) heavily optimizes small array operations:

// This gets compiled to highly optimized native code
const flags = ['--verbose', '--debug', '--production']
if (flags.includes(userFlag)) { /* blazingly fast */ }

// This involves hash tables and complex data structures
const flagSet = new Set(['--verbose', '--debug', '--production'])  
if (flagSet.has(userFlag)) { /* "theoretically" faster */ }


The engine recognizes small, stable arrays and uses specialized code paths that can be faster than general-purpose data structures.


The Lesson: Profile, Don't Assume


Computer Science theory gives us mental models, but hardware reality is messier:

  • Constant factors often matter more than algorithmic complexity
  • Cache behavior trumps Big O for small datasets
  • Modern CPUs do incredible optimizations for simple patterns
  • The crossover points are data and hardware dependent

Before reaching for the "algorithmically superior" solution, ask:

  • How big is my dataset really?
  • Am I optimizing for the problem I have or the one I imagine?
  • What does the profiler actually say?


Epilogue: Embrace the Contradiction


Your CS professor wasn't wrong - Big O notation is crucial for understanding scalability. But they probably didn't mention that sometimes O(n) beats O(1) because reality is beautifully complex.

The next time someone asks why you're using an array instead of a set for a small collection, just smile and say: "Because my CPU likes its data served contiguously, thank you very much."

P.S. - This is why microbenchmarks exist. Theory guides us, but silicon gets the final vote.

Want to see this in action? I built an interactive benchmark that lets you explore the crossover points yourself. Warning: may cause existential crisis about everything you thought you knew about data structures.


©️ learned from ThePrimeAgen course