Image Pipeline Optimization

Organization: Home Partners of America (HPA)
Timeline: Approx. 2 weeks

Results #

AWS cost savings of approx. $8,000/month: compute spend dropped from ~$8,300 to less than $30/month.
Runtime of image-processing ETL batch jobs reduced 99.5% from ~80 hours to ~5 minutes per task.
Resolved intermittent, customer-facing issue of production home listings missing photos—eliminating a crucial sales blocker.
Silenced noisy alarms plaguing data engineering team, bringing peace.
Enabled machine learning initiatives (e.g., listing factor inference and enrichment) that were previously blocked.

Problem #

HPA operated a complex single-family rental investment platform, relying on massive data ingestion from MLS feeds to evaluate property viability and display listings. By 2023, the engineering team had scaled from 5 to 50 MLS feeds, but the underlying image processing pipeline had not kept pace with growth. This pipeline downloaded photos at URLs provided by the feeds, created resized versions, and saved to internal S3 stores for website and data science use.

The legacy system processed images sequentially, creating a severe bottleneck. Tasks deployed to AWS ECS Fargate, originally intended to run for a few minutes at a time, frequently ran for over 80 hours. This caused:

Wasted Spend: Hundreds of ECS tasks were running simultaneously, resulting in approximately $8,300/month in wasted compute costs (based on 2022 rates).
Operational Risk: Website listings went live without photos for days until either a scheduled job happened to populate them, or the website team requested an ad-hoc pull for surfaced listings missing photos.
Technical Debt: The system relied on “band-aid” fixes, such as Lambda functions that forcibly terminated long-running tasks, rather than architectural improvements.

Solution #

I audited the pipeline and identified that the workload was almost entirely I/O bound (network latency during downloads), not CPU- or memory-bound. Instead of rewriting the entire architecture, I optimized the logic within the existing AWS ECS Fargate environment using multithreading.

By discretizing the ETL process so that each image URL was processed independently, the modular function could be used with concurrent.futures and thus many images could be downloaded and processed simultaneously¹, reducing runtime by ~200x.

This performance gain also allowed the RAM allocated to tasks to be reduced, further minimizing costs.

Technical nit: not truly simultaneously due to the Python GIL. ↩︎