Elixir Error Handling Explained: Why ‘Let It Crash’ Works

How Elixir’s radical approach to failure creates more reliable systems than defensive programming

It was 3 AM when the alerts started flooding in. The e-commerce platform was down, customers couldn’t complete purchases, and every minute meant thousands of dollars in lost revenue. The culprit? A database connection timeout that cascaded through the entire system, leaving services in inconsistent states despite hundreds of lines of carefully crafted error handling code.

This scenario plays out every day in companies worldwide. We’ve been taught that good programming means handling every possible error condition, building defensive walls around our code, and preparing for every imaginable failure scenario. But what if I told you this approach is fundamentally flawed?

Welcome to Elixir’s revolutionary “Let It Crash” philosophy — a counterintuitive approach that has enabled systems to achieve 99.999% uptime by embracing failure instead of fighting it.

The Defensive Programming Trap

Traditional programming languages have conditioned us to write defensive code. Let’s examine how a typical database service looks in Java:

public class DatabaseService {
    private Connection connection;
    private int retryCount = 0;
    private static final int MAX_RETRIES = 3;
    
    public String query(String sql) {
        try {
            // Check if connection is valid
            if (connection == null || connection.isClosed()) {
                reconnect();
            }
            
            // Validate input
            if (sql == null || sql.trim().isEmpty()) {
                throw new IllegalArgumentException("SQL cannot be empty");
            }
            
            // Execute query with retry logic
            return executeWithRetry(sql);
            
        } catch (SQLException e) {
            handleSQLException(e);
            return null;
        } catch (IllegalArgumentException e) {
            logger.error("Invalid SQL: " + sql, e);
            return null;
        } catch (Exception e) {
            logger.error("Unexpected error", e);
            // Now what? The service might be in an inconsistent state
            return null;
        }
    }
    
    private String executeWithRetry(String sql) throws SQLException {
        SQLException lastException = null;
        
        for (int i = 0; i < MAX_RETRIES; i++) {
            try {
                PreparedStatement stmt = connection.prepareStatement(sql);
                ResultSet rs = stmt.executeQuery();
                retryCount = 0; // Reset on success
                return processResultSet(rs);
                
            } catch (SQLException e) {
                lastException = e;
                retryCount++;
                
                if (i < MAX_RETRIES - 1) {
                    try {
                        Thread.sleep(1000 * (i + 1)); // Exponential backoff
                        reconnect();
                    } catch (InterruptedException ie) {
                        Thread.currentThread().interrupt();
                        throw new SQLException("Interrupted during retry", ie);
                    }
                }
            }
        }
        
        throw lastException;
    }
    
    private void handleSQLException(SQLException e) {
        // Complex error handling logic
        switch (e.getErrorCode()) {
            case 1205: // Deadlock
                // Log and retry?
                break;
            case 2006: // MySQL server has gone away
                // Reconnect?
                break;
            default:
                // Log and hope for the best?
        }
    }
}Code language: Java (java)

Look at this code. We have:

Complex error handling scattered throughout
Manual state management that’s prone to bugs
No clear recovery strategy for unexpected errors
Error handling logic that itself can fail
A service that becomes increasingly fragile as edge cases multiply

Go’s Slightly Better Approach

Go encourages explicit error handling, which feels safer:

type DatabaseService struct {
    conn       *sql.DB
    retryCount int
    maxRetries int
}

func (db *DatabaseService) Query(sqlQuery string) (string, error) {
    // Validate input
    if sqlQuery == "" {
        return "", fmt.Errorf("SQL query cannot be empty")
    }
    
    // Check connection
    if err := db.conn.Ping(); err != nil {
        if reconnectErr := db.reconnect(); reconnectErr != nil {
            return "", fmt.Errorf("failed to reconnect: %w", reconnectErr)
        }
    }
    
    // Execute with retry
    for i := 0; i < db.maxRetries; i++ {
        result, err := db.executeQuery(sqlQuery)
        if err == nil {
            db.retryCount = 0
            return result, nil
        }
        
        // Handle specific errors
        if isRetryableError(err) && i < db.maxRetries-1 {
            time.Sleep(time.Duration(i+1) * time.Second)
            continue
        }
        
        return "", fmt.Errorf("query failed after %d retries: %w", i+1, err)
    }
    
    return "", fmt.Errorf("unexpected: should not reach here")
}
func isRetryableError(err error) bool {
    // Complex logic to determine if error is retryable
    // What if we miss an edge case?
    return strings.Contains(err.Error(), "connection reset") ||
           strings.Contains(err.Error(), "timeout")
}Code language: Go (go)

Better than exceptions, but we still have:

Extensive error categorization requirements
Manual, error-prone state management
No automatic recovery from unexpected scenarios
Bug-prone error handling logic

Rust’s Type-Safe Complexity

Rust uses Result types for safer error handling:

use std::time::Duration;
use tokio::time::sleep;

pub struct DatabaseService {
    connection: Option<DbConnection>,
    retry_count: u32,
    max_retries: u32,
}
impl DatabaseService {
    pub async fn query(&mut self, sql: &str) -> Result<String, DatabaseError> {
        // Validate input
        if sql.is_empty() {
            return Err(DatabaseError::InvalidInput("SQL cannot be empty".to_string()));
        }
        
        // Ensure connection
        self.ensure_connection().await?;
        
        // Execute with retry logic
        let mut last_error = None;
        
        for attempt in 0..self.max_retries {
            match self.execute_query(sql).await {
                Ok(result) => {
                    self.retry_count = 0;
                    return Ok(result);
                }
                Err(e) if e.is_retryable() && attempt < self.max_retries - 1 => {
                    last_error = Some(e);
                    sleep(Duration::from_secs(attempt + 1)).await;
                    self.reconnect().await?;
                }
                Err(e) => return Err(e),
            }
        }
        
        Err(last_error.unwrap_or(DatabaseError::Unknown))
    }
    
    // Still need complex error categorization and state management...
}
#[derive(Debug)]
pub enum DatabaseError {
    ConnectionLost,
    InvalidInput(String),
    Timeout,
    Unknown,
    // Need to handle every possible error type...
}
impl DatabaseError {
    fn is_retryable(&self) -> bool {
        matches!(self, DatabaseError::ConnectionLost | DatabaseError::Timeout)
        // What if we miss a retryable error type?
    }
}Code language: Rust (rust)

Even with Rust’s type safety:

We must explicitly enumerate all possible errors
State management across async boundaries is complex
Error categorization is still manual and fallible
Recovery logic remains our responsibility

The Revolutionary Alternative: Let It Crash aka Elixir

Now, let’s see how Elixir approaches the same problem. Prepare to have your assumptions challenged:

defmodule DatabaseWorker do
  use GenServer
# Client API - Clean and simple
  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end
  def query(sql) do
    GenServer.call(__MODULE__, {:query, sql})
  end
  # Server implementation - No error handling!
  def init(_opts) do
    {:ok, %{connection: establish_connection()}}
  end
  def handle_call({:query, sql}, _from, state) do
    # No defensive programming here - just do the work
    result = execute_dangerous_query(state.connection, sql)
    {:reply, result, state}
  end
  # Helper functions that might crash
  defp establish_connection do
    # This might fail - and that's OK!
    :database.connect("localhost", 5432)
  end
  defp execute_dangerous_query(connection, sql) do
    # This might crash - and that's OK too!
    :database.query(connection, sql)
  end
endCode language: Elixir (elixir)

Wait, where’s all the error handling?

That’s exactly the point! This code looks “unsafe” to developers from other languages, but it’s actually more robust. Here’s the secret weapon:

The Supervisor: Your Automatic Safety Net

defmodule DatabaseSupervisor do
  use Supervisor
  
  def start_link(opts) do
    Supervisor.start_link(__MODULE__, opts, name: __MODULE__)
  end
  def init(_opts) do
    children = [
      {DatabaseWorker, []}
    ]
    # If DatabaseWorker crashes, restart it immediately
    Supervisor.init(children, strategy: :one_for_one)
  end
endCode language: Elixir (elixir)

When DatabaseWorker crashes (and it will), the supervisor automatically:

Detects the crash instantly (within microseconds)
Starts a fresh DatabaseWorker process
Provides a clean, uncorrupted state
Enables clients to simply retry their requests

See It in Action

Here I made a little demo to show how Elixir recovers compared to other defensive programming based languages.

Making It Smarter: State Recovery

Now let’s enhance our Elixir worker to be even more resilient by adding state persistence:

defmodule SmartDatabaseWorker do
  use GenServer
  require Logger

def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end
  def query(sql) do
    GenServer.call(__MODULE__, {:query, sql})
  end
  def get_stats do
    GenServer.call(__MODULE__, :get_stats)
  end
  def init(_opts) do
    # Recover any previous state
    previous_stats = recover_stats()
    
    state = %{
      connection: establish_connection(),
      query_count: previous_stats.query_count,
      error_count: previous_stats.error_count
    }
    
    Logger.info("DatabaseWorker started with recovered stats: #{inspect(previous_stats)}")
    {:ok, state}
  end
  def handle_call({:query, sql}, _from, state) do
    # Still no defensive programming - just do the work
    result = execute_query(state.connection, sql)
    
    new_state = %{state | query_count: state.query_count + 1}
    backup_stats(new_state)
    
    {:reply, result, new_state}
  end
  def handle_call(:get_stats, _from, state) do
    stats = %{
      query_count: state.query_count,
      error_count: state.error_count,
      uptime: :erlang.system_time(:millisecond)
    }
    {:reply, stats, state}
  end
  # This runs when the process is about to die
  def terminate(reason, state) do
    Logger.info("DatabaseWorker terminating: #{inspect(reason)}")
    backup_stats(state)
    :ok
  end
  defp establish_connection do
    # This might fail - let it crash!
    :database.connect("localhost", 5432)
  end
  defp execute_query(connection, sql) do
    # This might fail too - let it crash!
    :database.query(connection, sql)
  end
  defp backup_stats(state) do
    stats = %{
      query_count: state.query_count,
      error_count: state.error_count,
      last_backup: :erlang.system_time(:millisecond)
    }
    :persistent_term.put({__MODULE__, :stats}, stats)
  end
  defp recover_stats do
    case :persistent_term.get({__MODULE__, :stats}, nil) do
      nil -> %{query_count: 0, error_count: 0}
      stats -> stats
    end
  end
endCode language: Elixir (elixir)

Notice what happened here: We added state persistence with just a few lines of code. When the worker crashes and restarts, it seamlessly continues from where it left off. No complex error handling, no defensive programming — just clean separation of concerns.

The Complete Resilient System

defmodule ResilientDatabaseSystem do
  use Supervisor

  def start_link(opts) do
    Supervisor.start_link(__MODULE__, opts, name: __MODULE__)
  end

  def init(_opts) do
    children = [
      # Primary database worker
      {SmartDatabaseWorker, [name: :primary]},
      
      # Backup database worker  
      {SmartDatabaseWorker, [name: :backup]},
      
      # Load balancer that routes requests
      {DatabaseLoadBalancer, []},
      
      # Health monitor
      {DatabaseHealthMonitor, []}
    ]
    # Different restart strategies for different scenarios
    Supervisor.init(children, strategy: :one_for_one, max_restarts: 5, max_seconds: 60)
  end
endCode language: Elixir (elixir)

Now we have a system where:

Individual workers can crash and restart instantly
State is preserved across restarts
Multiple workers provide redundancy
The load balancer routes around failed workers
Health monitoring provides observability

All with zero defensive programming in the business logic!

The Fundamental Mindset Shift

The difference between traditional programming and Elixir’s approach is philosophical:

Traditional Approach: “I must prevent all possible errors”

Results in complex, brittle code
Error handling code often contains bugs
Unexpected errors still cause system-wide failures
Recovery is manual and slow
Code becomes harder to understand and maintain

Elixir Approach: “Errors will happen, design for recovery”

Results in simple, robust code
Error handling is centralized in supervisors
Any error triggers a clean restart
Recovery is automatic and lightning-fast
Business logic stays focused on business problems

Why This Actually Works Better

Simplicity: Your business logic stays clean and focused on solving business problems, not handling every conceivable error scenario.
Reliability: Fresh restarts eliminate accumulated state corruption that can plague long-running services.
Speed: Supervisor restarts happen in microseconds, not seconds. Your system recovers faster than traditional error handling can even detect the problem.
Isolation: One worker’s crash doesn’t affect others. Failures are contained and don’t cascade.
Observability: Clear separation between normal operation and error handling makes debugging and monitoring much easier.
Maintainability: When you don’t have error handling scattered throughout your codebase, adding features becomes dramatically simpler.

Real-World Impact

This isn’t just theory. Companies using this approach achieve remarkable results:

WhatsApp handled 2 billion users with just 50 engineers
Telecom systems built on Erlang/Elixir routinely achieve 99.999% uptime
Financial platforms process millions of transactions without the complex error handling typical in other languages
IoT systems self-heal from network partitions and hardware failures

Addressing Common Concerns

“But what about data loss?” 😭
The supervisor only restarts the process, not the entire application. Database transactions, file writes, and external API calls either complete successfully or fail cleanly. There’s no partial state corruption. 😝

“What about performance?” 🤔
Creating a new process in BEAM takes microseconds and minimal memory. It’s faster than most error handling code executing. 😅

“What about complex business logic?” 😏
Break it into smaller, focused processes. Each handles one concern well rather than trying to anticipate every possible failure. 😜

Getting Started with “Let It Crash”

If you’re coming from traditional languages, start by:

Identify your defensive code: Look for try/catch blocks, retry logic, and state validation
Extract into separate processes: Move each concern into its own GenServer
Add supervisors: Let them handle the error recovery
Simplify your business logic: Remove the defensive programming
Trust the system: It takes time to believe that simpler code can be more reliable

The Bottom Line

“Let It Crash” isn’t about being careless with errors — it’s about being smart about where and how you handle them. Instead of scattering error handling throughout your codebase, you centralize recovery logic in supervisors and keep your business logic clean and focused.

The result? Systems that are simultaneously simpler to understand and more reliable in production. Systems that don’t just handle failure — they thrive because of it.

When you stop fighting errors and start designing for them, you don’t just build more reliable systems. You build systems that are antifragile — they actually get stronger under stress.

This is the power of Elixir’s concurrency model. It’s not just about running multiple tasks simultaneously — it’s about creating systems that embrace failure as a natural part of complex distributed systems and turn it into a competitive advantage.

Ready to let it crash? Your future self (and your operations team) will thank you.

Next time your system goes down at 3 AM, ask yourself: are you fighting failure, or designing for it?

If you liked the blog, please consider buying a coffee 😝 https://buymeacoffee.com/y316nitka Thank you!

0 Shares