Building Fault-Tolerant Invoice Settlement Systems: The Elixir Way — OTP Makes It Simple — Part 3

Why Elixir Changes Everything

Before we dive into code, let’s understand what makes Elixir fundamentally different. While Go makes you build reliability mechanisms manually, Elixir was born from the Erlang ecosystem — designed for telecommunications systems that needed 99.9999999% uptime (that’s 31 milliseconds of downtime per year).

The Erlang/OTP philosophy is simple: “Let it crash, but make crashes harmless.”

The Mental Shift: Processes vs Threads

In our Go implementation, we worried about:

  • State persistence across crashes
  • Manual retry logic
  • Worker pool management
  • Health monitoring
  • Graceful error recovery

In Elixir, we think differently:

  • Each invoice job becomes a lightweight process
  • If a process crashes, its supervisor restarts it
  • State is managed automatically through GenServer
  • The system self-heals without manual intervention

Elixir Architecture: Supervision Trees

The Elixir Implementation: Let’s Build It

1. The Application Structure

In Elixir, we organize our application using OTP principles:

invoice_settlement/
├── lib/
│   ├── invoice_settlement/
│   │   ├── application.ex          # Main supervisor
│   │   ├── scheduler.ex            # Job scheduling
│   │   ├── invoice_worker.ex       # Individual job processor  
│   │   ├── services/
│   │   │   ├── pdf_service.ex
│   │   │   ├── email_service.ex
│   │   │   └── payment_service.ex
│   │   └── repo.ex                 # Database interface
│   └── invoice_settlement.ex
└── mix.exsCode language: Markdown (markdown)

2. Application Supervisor — The Root of Reliability

# lib/invoice_settlement/application.ex
defmodule InvoiceSettlement.Application do
  use Application
  def start(_type, _args) do
    children = [
      # Database connection pool
      InvoiceSettlement.Repo,
      # Service supervisors - if any service crashes, only it restarts
      {InvoiceSettlement.Services.PDFService, []},
      {InvoiceSettlement.Services.EmailService, []},
      {InvoiceSettlement.Services.PaymentService, []},
      # Job scheduler - manages when to create invoice jobs
      {InvoiceSettlement.Scheduler, []},
      # Dynamic supervisor for invoice workers
      # This allows us to spawn workers on-demand
      {DynamicSupervisor, 
       name: InvoiceSettlement.WorkerSupervisor, 
       strategy: :one_for_one}
    ]
    # If any critical component fails, restart everything
    # But this rarely happens due to fault isolation
    opts = [strategy: :one_for_one, name: InvoiceSettlement.Supervisor]
    Supervisor.start_link(children, opts)
  end
endCode language: Elixir (elixir)

What Just Happened?

  • We defined our entire system architecture in ~20 lines
  • Each component is isolated — one crash doesn’t bring down the system
  • OTP automatically handles process lifecycles
  • No manual worker pool management needed

3. The Invoice Worker — Where Magic Happens

# lib/invoice_settlement/invoice_worker.ex
defmodule InvoiceSettlement.InvoiceWorker do
  use GenServer
  require Logger
  alias InvoiceSettlement.{Repo, Services}
  # Client API - How other processes interact with us
  def start_link(job_params) do
    GenServer.start_link(__MODULE__, job_params)
  end
  def process_invoice(pid) do
    GenServer.call(pid, :process_invoice)
  end
  # Server Implementation - The actual work happens here
  def init(job_params) do
    # Store our state - this is automatically managed by GenServer
    state = %{
      job_id: job_params.job_id,
      vendor_id: job_params.vendor_id,
      settlement_period: job_params.settlement_period,
      current_step: :initialized,
      invoice_data: nil,
      pdf_path: nil,
      email_id: nil,
      payment_id: nil,
      retry_count: 0
    }
    Logger.info("Invoice worker started for vendor #{job_params.vendor_id}")
    # Automatically start processing - no external coordination needed
    {:ok, state, {:continue, :start_processing}}
  end
  def handle_continue(:start_processing, state) do
    # Update job status in database
    update_job_status(state.job_id, "running")
    case process_step(state, :generate_invoice_data) do
      {:ok, new_state} ->
        {:noreply, new_state, {:continue, :continue_processing}}
      {:error, reason} ->
        handle_error(state, reason)
    end
  end
  def handle_continue(:continue_processing, state) do
    next_step = get_next_step(state.current_step)
    case next_step do
      :completed ->
        Logger.info("Invoice #{state.job_id} completed successfully")
        update_job_status(state.job_id, "completed")
        {:stop, :normal, state}
      step when is_atom(step) ->
        case process_step(state, step) do
          {:ok, new_state} ->
            {:noreply, new_state, {:continue, :continue_processing}}
          {:error, reason} ->
            handle_error(state, reason)
        end
    end
  end
  # The business logic
  defp process_step(state, :generate_invoice_data) do
    Logger.info("Generating invoice data for vendor #{state.vendor_id}")
    case generate_invoice_data(state.vendor_id, state.settlement_period) do
      {:ok, invoice_data} ->
        new_state = %{state | 
          current_step: :generate_invoice_data,
          invoice_data: invoice_data
        }
        {:ok, new_state}
      {:error, reason} ->
        {:error, "Invoice data generation failed: #{inspect(reason)}"}
    end
  end
  defp process_step(state, :generate_pdf) do
    Logger.info("Generating PDF for invoice #{state.job_id}")
    case Services.PDFService.generate_pdf(state.invoice_data) do
      {:ok, pdf_path} ->
        new_state = %{state | 
          current_step: :generate_pdf,
          pdf_path: pdf_path
        }
        {:ok, new_state}
      {:error, reason} ->
        {:error, "PDF generation failed: #{inspect(reason)}"}
    end
  end
  defp process_step(state, :send_email) do
    Logger.info("Sending email for invoice #{state.job_id}")
    case Services.EmailService.send_invoice_email(state.vendor_id, state.pdf_path) do
      {:ok, email_id} ->
        new_state = %{state | 
          current_step: :send_email,
          email_id: email_id
        }
        {:ok, new_state}
      {:error, reason} ->
        {:error, "Email sending failed: #{inspect(reason)}"}
    end
  end
  defp process_step(state, :initiate_payment) do
    Logger.info("Initiating payment for invoice #{state.job_id}")
    case Services.PaymentService.initiate_payment(state.invoice_data) do
      {:ok, payment_id} ->
        new_state = %{state | 
          current_step: :initiate_payment,
          payment_id: payment_id
        }
        {:ok, new_state}
      {:error, reason} ->
        {:error, "Payment initiation failed: #{inspect(reason)}"}
    end
  end
  # Error handling with automatic retry
  defp handle_error(state, reason) do
    Logger.error("Invoice #{state.job_id} failed at step #{state.current_step}: #{reason}")
    new_retry_count = state.retry_count + 1
    if new_retry_count <= 3 do
      Logger.info("Retrying invoice #{state.job_id} (attempt #{new_retry_count})")
      # Exponential backoff - wait longer between retries
      retry_delay = :timer.seconds(new_retry_count * 5)
      new_state = %{state | retry_count: new_retry_count}
      # Schedule retry after delay
      Process.send_after(self(), :retry_processing, retry_delay)
      {:noreply, new_state}
    else
      Logger.error("Invoice #{state.job_id} failed permanently after #{new_retry_count} attempts")
      update_job_status(state.job_id, "failed")
      # Send to dead letter queue for manual investigation
      send_to_dead_letter_queue(state, reason)
      {:stop, :failed, state}
    end
  end
  # Handle retry messages
  def handle_info(:retry_processing, state) do
    Logger.info("Retrying processing for invoice #{state.job_id}")
    {:noreply, state, {:continue, :start_processing}}
  end
  # Helper functions
  defp get_next_step(:generate_invoice_data), do: :generate_pdf
  defp get_next_step(:generate_pdf), do: :send_email
  defp get_next_step(:send_email), do: :initiate_payment
  defp get_next_step(:initiate_payment), do: :completed
  defp generate_invoice_data(vendor_id, settlement_period) do
    # Your business logic here
    # This could fail, but the GenServer will handle it gracefully
    {:ok, %{vendor_id: vendor_id, total_amount: 1500.00, line_items: []}}
  end
  defp update_job_status(job_id, status) do
    # Update database - if this fails, the process will crash and restart
    # But that's okay! The supervisor will restart us and we'll retry
    Repo.update_job_status(job_id, status)
  end
  defp send_to_dead_letter_queue(state, reason) do
    # Send failed jobs for manual investigation
    Logger.error("Sending job #{state.job_id} to dead letter queue: #{reason}")
  end
endCode language: Elixir (elixir)

Of course, we could add many things to make it more reliable but it’s the simplest code which actually works and way reliable than .NET back then.

What’s Beautiful About This?

  • No Manual State Management: GenServer automatically manages process state
  • Automatic Retry Logic: Built into the error handling, with exponential backoff
  • Fault Isolation: If this worker crashes, it only affects this one invoice
  • Self-Healing: The supervisor will restart the worker if it crashes permanently
  • Clean Business Logic: The actual invoice processing logic is straightforward

4. The Scheduler — Creating Jobs Automatically

# lib/invoice_settlement/scheduler.ex
defmodule InvoiceSettlement.Scheduler do
  use GenServer
  require Logger
  alias InvoiceSettlement.{Repo, InvoiceWorker}
  def start_link(_) do
    GenServer.start_link(__MODULE__, %{}, name: __MODULE__)
  end
  def init(state) do
    # Schedule the first check
    schedule_job_creation()
    {:ok, state}
  end
  def handle_info(:create_jobs, state) do
    Logger.info("Checking for vendors that need invoice settlement")
    # Get vendors that need invoices based on their settlement schedules
    vendors_needing_invoices = Repo.get_vendors_needing_settlement()
    Enum.each(vendors_needing_invoices, fn vendor ->
      create_invoice_job(vendor)
    end)
    # Schedule next check
    schedule_job_creation()
    {:noreply, state}
  end
  defp create_invoice_job(vendor) do
    # Create job record in database
    {:ok, job} = Repo.create_invoice_job(%{
      vendor_id: vendor.id,
      settlement_period: vendor.settlement_period,
      status: "pending"
    })
    # Start a worker process for this job
    job_params = %{
      job_id: job.id,
      vendor_id: vendor.id,
      settlement_period: vendor.settlement_period
    }
    # The DynamicSupervisor will manage this worker's lifecycle
    case DynamicSupervisor.start_child(
      InvoiceSettlement.WorkerSupervisor,
      {InvoiceWorker, job_params}
    ) do
      {:ok, _pid} ->
        Logger.info("Started invoice worker for vendor #{vendor.id}")
      {:error, reason} ->
        Logger.error("Failed to start invoice worker: #{inspect(reason)}")
    end
  end
  defp schedule_job_creation do
    # Check every hour for new jobs to create
    Process.send_after(self(), :create_jobs, :timer.hours(1))
  end
endCode language: Elixir (elixir)

5. Services — Simple and Focused

# lib/invoice_settlement/services/pdf_service.ex
defmodule InvoiceSettlement.Services.PDFService do
  use GenServer
  require Logger
  def start_link(_) do
    GenServer.start_link(__MODULE__, %{}, name: __MODULE__)
  end
  def generate_pdf(invoice_data) do
    GenServer.call(__MODULE__, {:generate_pdf, invoice_data})
  end
  def init(state) do
    {:ok, state}
  end
  def handle_call({:generate_pdf, invoice_data}, _from, state) do
    Logger.info("Generating PDF for vendor #{invoice_data.vendor_id}")
    # Simulate PDF generation that might fail
    case do_generate_pdf(invoice_data) do
      {:ok, pdf_path} ->
        Logger.info("PDF generated successfully: #{pdf_path}")
        {:reply, {:ok, pdf_path}, state}
      {:error, reason} ->
        Logger.error("PDF generation failed: #{inspect(reason)}")
        {:reply, {:error, reason}, state}
    end
  end
  defp do_generate_pdf(invoice_data) do
    # Your actual PDF generation logic
    # This could call external services, use libraries like PdfGenerator, etc.
    # Simulate potential failure
    if :rand.uniform(10) == 1 do
      {:error, "PDF template rendering failed"}
    else
      pdf_path = "/tmp/invoice_#{invoice_data.vendor_id}_#{System.system_time()}.pdf"
      {:ok, pdf_path}
    end
  end
endCode language: Elixir (elixir)

The Dramatic Difference: Comparing Complexity

The “Let It Crash” Philosophy in Action

Here’s what happens when things go wrong in our Elixir system:

Scenario 1: PDF Generation Fails

In Go:

  • Worker catches the error
  • Updates database with failure status
  • Implements retry logic with backoff
  • Monitors for stuck jobs
  • Manual recovery required if retries fail

In Elixir:

# PDF generation crashes the worker process
def handle_call({:generate_pdf, data}, _from, state) do
  # If this crashes, the entire process dies
  pdf_path = SomeLibrary.generate_pdf!(data)  # Might crash!
  {:reply, {:ok, pdf_path}, state}
end
# But that's GOOD! The supervisor immediately restarts the worker:
# 1. DynamicSupervisor detects the crash
# 2. Restarts the InvoiceWorker with the same job parameters  
# 3. Worker retries from the beginning
# 4. If it keeps crashing, supervisor escalates to parentCode language: Elixir (elixir)

Scenario 2: Database Connection Lost

In Go:

  • Every database call needs error handling
  • Connection pool management
  • Retry logic for each operation
  • Graceful degradation strategy

In Elixir:

# Database connection is managed by a separate process
# If DB process crashes, only it restarts - workers continue
# Connection pooling is handled by DBConnection automatically
# Workers that need DB access will get errors, crash, and restart
# System self-heals when DB connection is restoredCode language: Elixir (elixir)

Scenario 3: Email Service Goes Down

In Go:

  • Circuit breaker pattern implementation
  • Fallback mechanisms
  • Queue management for failed emails
  • Manual monitoring and alerts

In Elixir:

# Email service is its own supervised process
# If external email API is down, EmailService crashes
# Supervisor restarts EmailService
# Invoice workers that need email will crash and retry
# When email service is restored, everything automatically resumesCode language: Elixir (elixir)

Monitoring and Observability — Almost Free

# lib/invoice_settlement/telemetry.ex
defmodule InvoiceSettlement.Telemetry do
  # Elixir gives you incredible observability out of the box
  def setup_metrics do
    # Process count monitoring
    :telemetry.attach("process_count", [:vm, :process_count], &handle_process_count/4, nil)
    # Memory usage monitoring  
    :telemetry.attach("memory_usage", [:vm, :memory], &handle_memory/4, nil)
    # Custom business metrics
    :telemetry.attach("invoice_processed", [:invoice, :processed], &handle_invoice_processed/4, nil)
  end
defp handle_process_count(_event, measurements, _metadata, _config) do
    # You can see exactly how many processes are running
    # How many are workers, supervisors, etc.
    Logger.info("Active processes: #{measurements.total}")
  end
  defp handle_invoice_processed(_event, measurements, metadata, _config) do
    # Track business metrics automatically
    Logger.info("Invoice #{metadata.job_id} processed in #{measurements.duration}ms")
  end
endCode language: Elixir (elixir)

Observer Tool — Elixir comes with a built-in GUI that shows you:

  • Live process tree visualization
  • Memory usage per process
  • Message queue lengths
  • Crash and restart history
  • Load distribution across CPU cores

Error Recovery Patterns

# Supervisor strategies - choose your fault tolerance level
# 1. :one_for_one (default)
# If one child crashes, restart only that child
children = [
  {InvoiceWorker, job1},
  {InvoiceWorker, job2},  # If this crashes...
  {InvoiceWorker, job3}   # ...these keep running
]
# 2. :one_for_all  
# If one child crashes, restart ALL children
# Use for tightly coupled processes
# 3. :rest_for_one
# If one child crashes, restart it and all children started after it
# Use for dependency chains
# 4. Dynamic supervisors
# Start and stop children on demand
DynamicSupervisor.start_child(WorkerSupervisor, {InvoiceWorker, job_params})Code language: Elixir (elixir)

Hot Code Updates — The Ultimate Reliability

One of Elixir’s most powerful features is hot code updates. You can deploy new code without stopping the system:

# Deploy new version of InvoiceWorker without downtime
# 1. Load new code
# 2. Existing workers finish their current jobs with old code
# 3. New workers start with new code
# 4. Graceful transition with zero downtime
# This is impossible in most other languages!Code language: Elixir (elixir)

Performance Characteristics

<!DOCTYPE html>
<html>
<head>
    <title>Elixir Performance Profile</title>
    <style>
        body { font-family: Arial, sans-serif; margin: 20px; background: #f5f5f5; }
        .perf-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(250px, 1fr)); gap: 20px; margin: 20px 0; }
        .perf-card { 
            background: white; padding: 20px; border-radius: 10px; 
            box-shadow: 0 2px 10px rgba(0,0,0,0.1); text-align: center;
        }
        .metric { font-size: 2em; font-weight: bold; color: #663399; }
        .good { color: #48bb78; }
        .excellent { color: #38a169; }
        .amazing { color: #2f855a; }
    </style>
</head>
<body>
    <div>
        <div>
            <div>2M+</div>
            <p>Concurrent processes on single machine</p>
        </div>
        <div>
            <div>2-3μs</div>
            <p>Process creation time</p>
        </div>
        <div>
            <div>309B</div>
            <p>Memory per process (minimum)</p>
        </div>
        <div>
            <div>~1ms</div>
            <p>Context switching overhead</p>
        </div>
    </div>
    <div style="background: white; padding: 20px; border-radius: 10px; margin: 20px 0;">
        <h3>Real-world Impact</h3>
        <p>A single Elixir node can handle thousands of concurrent invoice jobs, each running in its own isolated process. If you need more capacity, Elixir scales horizontally across multiple machines with built-in distribution.</p>
    </div>
</body>
</html>Code language: HTML, XML (xml)

The Development Experience

Testing — Much Simpler

defmodule InvoiceWorkerTest do
  use ExUnit.Case
  test "invoice worker handles PDF generation failure gracefully" do
    # Start a worker with test parameters
    {:ok, pid} = InvoiceWorker.start_link(%{
      job_id: 123,
      vendor_id: 456,
      settlement_period: "weekly"
    })
    # Mock PDF service to fail
    # Process will crash and restart automatically
    # Test can verify the retry behavior
    # No need to test complex error handling logic
    # OTP handles that for us!
  end
endCode language: Elixir (elixir)

Debugging — Process Inspector

# In production, you can inspect any process:
pid = Process.whereis(InvoiceWorker)
# See current state
:sys.get_state(pid)
# See message queue
Process.info(pid, :message_queue_len)
# See memory usage
Process.info(pid, :memory)
# Trace messages
:sys.trace(pid, true)Code language: Elixir (elixir)

Deployment and Operations

# Creating releases is built into Elixir
mix release
# The release includes:
# - Hot code updates capability
# - Remote console access
# - Health check endpoints  
# - Graceful shutdown handling
# - Cluster formation
# Deploy with zero downtime:
# 1. Upload new release
# 2. Connect to running system
# 3. Hot upgrade to new version
# 4. No restart required!Code language: Elixir (elixir)

Summary: Why Elixir Changes the Game

What We Built vs What We Got

We Wrote:

  • ~150 lines of business logic
  • ~35 lines of supervision setup
  • ~50 lines of service definitions

We Got for Free:

  • Automatic process supervision
  • Built-in retry and recovery
  • Process isolation and fault tolerance
  • Hot code updates
  • Distributed system capabilities
  • Excellent monitoring and debugging tools
  • Graceful shutdown handling
  • Memory management
  • Garbage collection per process

The Real Business Impact

Time to Market: 80% faster development
Operational Complexity: 90% reduction in infrastructure code
Debugging: Built-in tools show you exactly what’s happening
Scalability: Linear scaling across multiple machines
Reliability: System self-heals from most failure modes

What About the Trade-offs?

Learning Curve: Functional programming and OTP concepts take time to master
Ecosystem: Smaller than Go’s, but growing rapidly
Performance: Slightly lower raw throughput than Go, but excellent concurrency
Hiring: Fewer Elixir developers in the market (though this is changing)


In Part 3, we’ll do a detailed side-by-side comparison and help you decide which approach is right for your team and use case. We’ll also look at real-world examples of companies using each approach and what they learned.

The fundamental question isn’t “which is faster?” but rather “which lets you build reliable systems with less effort and stress?”

Coming up in Part 4: The same system design in Rust and then in part 5 we will do the detailed comparison and if you still like the series then we can do some benchmarking as well. Part-4 still in progress, once it’s here, I’ll update the link, so maybe subscribe if you are intrested.

Leave a Comment

Your email address will not be published. Required fields are marked *