laptop, macbook, codes, coding, programming, css, computer, technology, work, computer programming, coding, coding, coding, coding, coding, programming, programming, programming, programming, computer, computer

Elixir Error Handling Explained: Why ‘Let It Crash’ Works

It was 3 AM when the alerts started flooding in. The e-commerce platform was down, customers couldn’t complete purchases, and every minute meant thousands of dollars in lost revenue. The culprit? A database connection timeout that cascaded through the entire system, leaving services in inconsistent states despite hundreds of lines of carefully crafted error handling code.

This scenario plays out every day in companies worldwide. We’ve been taught that good programming means handling every possible error condition, building defensive walls around our code, and preparing for every imaginable failure scenario. But what if I told you this approach is fundamentally flawed?

Welcome to Elixir’s revolutionary “Let It Crash” philosophy — a counterintuitive approach that has enabled systems to achieve 99.999% uptime by embracing failure instead of fighting it.

The Defensive Programming Trap

Traditional programming languages have conditioned us to write defensive code. Let’s examine how a typical database service looks in Java:

public class DatabaseService {
    private Connection connection;
    private int retryCount = 0;
    private static final int MAX_RETRIES = 3;
    
    public String query(String sql) {
        try {
            // Check if connection is valid
            if (connection == null || connection.isClosed()) {
                reconnect();
            }
            
            // Validate input
            if (sql == null || sql.trim().isEmpty()) {
                throw new IllegalArgumentException("SQL cannot be empty");
            }
            
            // Execute query with retry logic
            return executeWithRetry(sql);
            
        } catch (SQLException e) {
            handleSQLException(e);
            return null;
        } catch (IllegalArgumentException e) {
            logger.error("Invalid SQL: " + sql, e);
            return null;
        } catch (Exception e) {
            logger.error("Unexpected error", e);
            // Now what? The service might be in an inconsistent state
            return null;
        }
    }
    
    private String executeWithRetry(String sql) throws SQLException {
        SQLException lastException = null;
        
        for (int i = 0; i < MAX_RETRIES; i++) {
            try {
                PreparedStatement stmt = connection.prepareStatement(sql);
                ResultSet rs = stmt.executeQuery();
                retryCount = 0; // Reset on success
                return processResultSet(rs);
                
            } catch (SQLException e) {
                lastException = e;
                retryCount++;
                
                if (i < MAX_RETRIES - 1) {
                    try {
                        Thread.sleep(1000 * (i + 1)); // Exponential backoff
                        reconnect();
                    } catch (InterruptedException ie) {
                        Thread.currentThread().interrupt();
                        throw new SQLException("Interrupted during retry", ie);
                    }
                }
            }
        }
        
        throw lastException;
    }
    
    private void handleSQLException(SQLException e) {
        // Complex error handling logic
        switch (e.getErrorCode()) {
            case 1205: // Deadlock
                // Log and retry?
                break;
            case 2006: // MySQL server has gone away
                // Reconnect?
                break;
            default:
                // Log and hope for the best?
        }
    }
}Code language: Java (java)

Look at this code. We have:

  • Complex error handling scattered throughout
  • Manual state management that’s prone to bugs
  • No clear recovery strategy for unexpected errors
  • Error handling logic that itself can fail
  • A service that becomes increasingly fragile as edge cases multiply
Go’s Slightly Better Approach

Go encourages explicit error handling, which feels safer:

type DatabaseService struct {
    conn       *sql.DB
    retryCount int
    maxRetries int
}

func (db *DatabaseService) Query(sqlQuery string) (string, error) {
    // Validate input
    if sqlQuery == "" {
        return "", fmt.Errorf("SQL query cannot be empty")
    }
    
    // Check connection
    if err := db.conn.Ping(); err != nil {
        if reconnectErr := db.reconnect(); reconnectErr != nil {
            return "", fmt.Errorf("failed to reconnect: %w", reconnectErr)
        }
    }
    
    // Execute with retry
    for i := 0; i < db.maxRetries; i++ {
        result, err := db.executeQuery(sqlQuery)
        if err == nil {
            db.retryCount = 0
            return result, nil
        }
        
        // Handle specific errors
        if isRetryableError(err) && i < db.maxRetries-1 {
            time.Sleep(time.Duration(i+1) * time.Second)
            continue
        }
        
        return "", fmt.Errorf("query failed after %d retries: %w", i+1, err)
    }
    
    return "", fmt.Errorf("unexpected: should not reach here")
}
func isRetryableError(err error) bool {
    // Complex logic to determine if error is retryable
    // What if we miss an edge case?
    return strings.Contains(err.Error(), "connection reset") ||
           strings.Contains(err.Error(), "timeout")
}Code language: Go (go)

Better than exceptions, but we still have:

  • Extensive error categorization requirements
  • Manual, error-prone state management
  • No automatic recovery from unexpected scenarios
  • Bug-prone error handling logic
Rust’s Type-Safe Complexity

Rust uses Result types for safer error handling:

use std::time::Duration;
use tokio::time::sleep;

pub struct DatabaseService {
    connection: Option<DbConnection>,
    retry_count: u32,
    max_retries: u32,
}
impl DatabaseService {
    pub async fn query(&mut self, sql: &str) -> Result<String, DatabaseError> {
        // Validate input
        if sql.is_empty() {
            return Err(DatabaseError::InvalidInput("SQL cannot be empty".to_string()));
        }
        
        // Ensure connection
        self.ensure_connection().await?;
        
        // Execute with retry logic
        let mut last_error = None;
        
        for attempt in 0..self.max_retries {
            match self.execute_query(sql).await {
                Ok(result) => {
                    self.retry_count = 0;
                    return Ok(result);
                }
                Err(e) if e.is_retryable() && attempt < self.max_retries - 1 => {
                    last_error = Some(e);
                    sleep(Duration::from_secs(attempt + 1)).await;
                    self.reconnect().await?;
                }
                Err(e) => return Err(e),
            }
        }
        
        Err(last_error.unwrap_or(DatabaseError::Unknown))
    }
    
    // Still need complex error categorization and state management...
}
#[derive(Debug)]
pub enum DatabaseError {
    ConnectionLost,
    InvalidInput(String),
    Timeout,
    Unknown,
    // Need to handle every possible error type...
}
impl DatabaseError {
    fn is_retryable(&self) -> bool {
        matches!(self, DatabaseError::ConnectionLost | DatabaseError::Timeout)
        // What if we miss a retryable error type?
    }
}Code language: Rust (rust)

Even with Rust’s type safety:

  • We must explicitly enumerate all possible errors
  • State management across async boundaries is complex
  • Error categorization is still manual and fallible
  • Recovery logic remains our responsibility
The Revolutionary Alternative: Let It Crash aka Elixir

Now, let’s see how Elixir approaches the same problem. Prepare to have your assumptions challenged:

defmodule DatabaseWorker do
  use GenServer
# Client API - Clean and simple
  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end
  def query(sql) do
    GenServer.call(__MODULE__, {:query, sql})
  end
  # Server implementation - No error handling!
  def init(_opts) do
    {:ok, %{connection: establish_connection()}}
  end
  def handle_call({:query, sql}, _from, state) do
    # No defensive programming here - just do the work
    result = execute_dangerous_query(state.connection, sql)
    {:reply, result, state}
  end
  # Helper functions that might crash
  defp establish_connection do
    # This might fail - and that's OK!
    :database.connect("localhost", 5432)
  end
  defp execute_dangerous_query(connection, sql) do
    # This might crash - and that's OK too!
    :database.query(connection, sql)
  end
endCode language: Elixir (elixir)

That’s exactly the point! This code looks “unsafe” to developers from other languages, but it’s actually more robust. Here’s the secret weapon:

The Supervisor: Your Automatic Safety Net
defmodule DatabaseSupervisor do
  use Supervisor
  
  def start_link(opts) do
    Supervisor.start_link(__MODULE__, opts, name: __MODULE__)
  end
  def init(_opts) do
    children = [
      {DatabaseWorker, []}
    ]
    # If DatabaseWorker crashes, restart it immediately
    Supervisor.init(children, strategy: :one_for_one)
  end
endCode language: Elixir (elixir)

When DatabaseWorker crashes (and it will), the supervisor automatically:

  1. Detects the crash instantly (within microseconds)
  2. Starts a fresh DatabaseWorker process
  3. Provides a clean, uncorrupted state
  4. Enables clients to simply retry their requests
See It in Action

Here I made a little demo to show how Elixir recovers compared to other defensive programming based languages.

Making It Smarter: State Recovery

Now let’s enhance our Elixir worker to be even more resilient by adding state persistence:

defmodule SmartDatabaseWorker do
  use GenServer
  require Logger

def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end
  def query(sql) do
    GenServer.call(__MODULE__, {:query, sql})
  end
  def get_stats do
    GenServer.call(__MODULE__, :get_stats)
  end
  def init(_opts) do
    # Recover any previous state
    previous_stats = recover_stats()
    
    state = %{
      connection: establish_connection(),
      query_count: previous_stats.query_count,
      error_count: previous_stats.error_count
    }
    
    Logger.info("DatabaseWorker started with recovered stats: #{inspect(previous_stats)}")
    {:ok, state}
  end
  def handle_call({:query, sql}, _from, state) do
    # Still no defensive programming - just do the work
    result = execute_query(state.connection, sql)
    
    new_state = %{state | query_count: state.query_count + 1}
    backup_stats(new_state)
    
    {:reply, result, new_state}
  end
  def handle_call(:get_stats, _from, state) do
    stats = %{
      query_count: state.query_count,
      error_count: state.error_count,
      uptime: :erlang.system_time(:millisecond)
    }
    {:reply, stats, state}
  end
  # This runs when the process is about to die
  def terminate(reason, state) do
    Logger.info("DatabaseWorker terminating: #{inspect(reason)}")
    backup_stats(state)
    :ok
  end
  defp establish_connection do
    # This might fail - let it crash!
    :database.connect("localhost", 5432)
  end
  defp execute_query(connection, sql) do
    # This might fail too - let it crash!
    :database.query(connection, sql)
  end
  defp backup_stats(state) do
    stats = %{
      query_count: state.query_count,
      error_count: state.error_count,
      last_backup: :erlang.system_time(:millisecond)
    }
    :persistent_term.put({__MODULE__, :stats}, stats)
  end
  defp recover_stats do
    case :persistent_term.get({__MODULE__, :stats}, nil) do
      nil -> %{query_count: 0, error_count: 0}
      stats -> stats
    end
  end
endCode language: Elixir (elixir)

Notice what happened here: We added state persistence with just a few lines of code. When the worker crashes and restarts, it seamlessly continues from where it left off. No complex error handling, no defensive programming — just clean separation of concerns.

The Complete Resilient System
defmodule ResilientDatabaseSystem do
  use Supervisor

  def start_link(opts) do
    Supervisor.start_link(__MODULE__, opts, name: __MODULE__)
  end

  def init(_opts) do
    children = [
      # Primary database worker
      {SmartDatabaseWorker, [name: :primary]},
      
      # Backup database worker  
      {SmartDatabaseWorker, [name: :backup]},
      
      # Load balancer that routes requests
      {DatabaseLoadBalancer, []},
      
      # Health monitor
      {DatabaseHealthMonitor, []}
    ]
    # Different restart strategies for different scenarios
    Supervisor.init(children, strategy: :one_for_one, max_restarts: 5, max_seconds: 60)
  end
endCode language: Elixir (elixir)

Now we have a system where:

  • Individual workers can crash and restart instantly
  • State is preserved across restarts
  • Multiple workers provide redundancy
  • The load balancer routes around failed workers
  • Health monitoring provides observability

All with zero defensive programming in the business logic!

The Fundamental Mindset Shift

The difference between traditional programming and Elixir’s approach is philosophical:

Traditional Approach: “I must prevent all possible errors”
  • Results in complex, brittle code
  • Error handling code often contains bugs
  • Unexpected errors still cause system-wide failures
  • Recovery is manual and slow
  • Code becomes harder to understand and maintain
Elixir Approach: “Errors will happen, design for recovery”
  • Results in simple, robust code
  • Error handling is centralized in supervisors
  • Any error triggers a clean restart
  • Recovery is automatic and lightning-fast
  • Business logic stays focused on business problems
Why This Actually Works Better
  1. Simplicity: Your business logic stays clean and focused on solving business problems, not handling every conceivable error scenario.
  2. Reliability: Fresh restarts eliminate accumulated state corruption that can plague long-running services.
  3. Speed: Supervisor restarts happen in microseconds, not seconds. Your system recovers faster than traditional error handling can even detect the problem.
  4. Isolation: One worker’s crash doesn’t affect others. Failures are contained and don’t cascade.
  5. Observability: Clear separation between normal operation and error handling makes debugging and monitoring much easier.
  6. Maintainability: When you don’t have error handling scattered throughout your codebase, adding features becomes dramatically simpler.
Real-World Impact

This isn’t just theory. Companies using this approach achieve remarkable results:

  • WhatsApp handled 2 billion users with just 50 engineers
  • Telecom systems built on Erlang/Elixir routinely achieve 99.999% uptime
  • Financial platforms process millions of transactions without the complex error handling typical in other languages
  • IoT systems self-heal from network partitions and hardware failures
Addressing Common Concerns

“But what about data loss?” 😭
The supervisor only restarts the process, not the entire application. Database transactions, file writes, and external API calls either complete successfully or fail cleanly. There’s no partial state corruption. 😝

“What about performance?” 🤔
Creating a new process in BEAM takes microseconds and minimal memory. It’s faster than most error handling code executing. 😅

“What about complex business logic?” 😏
Break it into smaller, focused processes. Each handles one concern well rather than trying to anticipate every possible failure. 😜

Getting Started with “Let It Crash”

If you’re coming from traditional languages, start by:

  1. Identify your defensive code: Look for try/catch blocks, retry logic, and state validation
  2. Extract into separate processes: Move each concern into its own GenServer
  3. Add supervisors: Let them handle the error recovery
  4. Simplify your business logic: Remove the defensive programming
  5. Trust the system: It takes time to believe that simpler code can be more reliable
The Bottom Line

“Let It Crash” isn’t about being careless with errors — it’s about being smart about where and how you handle them. Instead of scattering error handling throughout your codebase, you centralize recovery logic in supervisors and keep your business logic clean and focused.

The result? Systems that are simultaneously simpler to understand and more reliable in production. Systems that don’t just handle failure — they thrive because of it.

When you stop fighting errors and start designing for them, you don’t just build more reliable systems. You build systems that are antifragile — they actually get stronger under stress.

This is the power of Elixir’s concurrency model. It’s not just about running multiple tasks simultaneously — it’s about creating systems that embrace failure as a natural part of complex distributed systems and turn it into a competitive advantage.

Ready to let it crash? Your future self (and your operations team) will thank you.

Next time your system goes down at 3 AM, ask yourself: are you fighting failure, or designing for it?


If you liked the blog, please consider buying a coffee 😝 https://buymeacoffee.com/y316nitka Thank you!

Leave a Comment

Your email address will not be published. Required fields are marked *