Data Ingestion Architecture¶

This document describes the end-to-end data ingestion pipeline for Open Estate AI, from scraping real estate data to making it searchable through vector embeddings.

Overview¶

The ingestion architecture consists of two main pipelines working together:

Scraping Pipeline - Collects real estate data from various sources
Data Lake Pipeline - Processes, transforms, and indexes data for search

Complete Ingestion Flow¶

flowchart TB
    subgraph Scraping["📥 Scraping Layer"]
        direction TB
        User([👤 User/Client])

        subgraph AppRunner["🚀 AWS App Runner"]
            direction LR
            API["🌐 FastAPI<br/>Port 8080"]
            Agent["🤖 AI Agent<br/>Scraper"]
            MCP["⚙️ MCP Server<br/>Scraping Logic"]
            Browser["🌍 Playwright<br/>Headless Browser"]
        end

        External["🏢 Data Sources<br/><i>UP RERA, etc.</i>"]
    end

    subgraph DataLake["☁️ AWS Data Lake Infrastructure"]
        direction TB

        subgraph Storage["💾 Storage Layer"]
            direction LR
            S3Raw[("📦 S3 Raw Bucket<br/>JSON Data<br/><i>year/month/day</i>")]
            S3Vectors[("🔍 S3 Vectors<br/>Embeddings<br/><i>384 dimensions</i>")]
        end

        subgraph Processing["⚙️ Processing Pipeline"]
            direction TB
            Lambda["🔧 Lambda Function<br/>Ingest & Transform<br/><i>Python 3.12</i>"]
            SageMaker["🤖 SageMaker Endpoint<br/>all-MiniLM-L6-v2<br/><i>Embedding Model</i>"]
        end

        subgraph Search["🔎 Search Layer"]
            direction TB
            VectorIndex["📊 Vector Index<br/>property-research<br/><i>Cosine Similarity</i>"]
        end
    end

    subgraph Applications["🎯 Application Layer"]
        SearchAgent["🔍 Search Agent<br/><i>Multi-Agent System</i>"]
        UIApp["💻 UI Applications<br/><i>Web, Mobile</i>"]
    end

    User -->|"1️⃣ HTTP GET<br/>?max_projects=N"| API
    API -->|"2️⃣ Coordinate"| Agent
    Agent -->|"3️⃣ Call MCP Tool"| MCP
    MCP -->|"4️⃣ Launch"| Browser
    Browser -->|"5️⃣ Scrape HTML"| External
    External -->|"6️⃣ Return Data"| Browser
    Browser -->|"7️⃣ Parse JSON"| MCP
    MCP -->|"8️⃣ Return Results"| Agent
    Agent -->|"9️⃣ Upload NDJSON"| S3Raw

    S3Raw -->|"🔟 Trigger<br/>S3 Event"| Lambda
    Lambda -->|"1️⃣1️⃣ Read<br/>raw_text"| S3Raw
    Lambda -->|"1️⃣2️⃣ Get<br/>Embeddings"| SageMaker
    SageMaker -->|"1️⃣3️⃣ Return<br/>Vectors"| Lambda
    Lambda -->|"1️⃣4️⃣ Store<br/>with Metadata"| S3Vectors
    S3Vectors -->|"1️⃣5️⃣ Index"| VectorIndex

    SearchAgent -->|"1️⃣6️⃣ Query<br/>Natural Language"| VectorIndex
    UIApp -->|"1️⃣6️⃣ Query<br/>Natural Language"| VectorIndex
    VectorIndex -->|"1️⃣7️⃣ Return<br/>Similar Properties"| SearchAgent
    VectorIndex -->|"1️⃣7️⃣ Return<br/>Similar Properties"| UIApp

    classDef userStyle fill:#4A90E2,stroke:#2E5C8A,stroke-width:3px,color:#fff
    classDef apiStyle fill:#FF6B6B,stroke:#C92A2A,stroke-width:3px,color:#fff
    classDef agentStyle fill:#A855F7,stroke:#7C3AED,stroke-width:3px,color:#fff
    classDef mcpStyle fill:#10B981,stroke:#059669,stroke-width:3px,color:#fff
    classDef browserStyle fill:#F59E0B,stroke:#D97706,stroke-width:3px,color:#fff
    classDef storageStyle fill:#3B82F6,stroke:#1E40AF,stroke-width:3px,color:#fff
    classDef processStyle fill:#F59E0B,stroke:#D97706,stroke-width:3px,color:#fff
    classDef searchStyle fill:#8B5CF6,stroke:#6D28D9,stroke-width:3px,color:#fff
    classDef externalStyle fill:#14B8A6,stroke:#0F766E,stroke-width:3px,color:#fff
    classDef appStyle fill:#EC4899,stroke:#BE185D,stroke-width:3px,color:#fff
    classDef scrapingGroupStyle fill:#FFE6E6,stroke:#FF6B6B,stroke-width:3px,color:#C92A2A
    classDef appRunnerStyle fill:#232F3E,stroke:#FF9900,stroke-width:3px,color:#FF9900
    classDef datalakeStyle fill:#FFF9E6,stroke:#FF9900,stroke-width:3px,color:#232F3E
    classDef storageGroupStyle fill:#E6F7FF,stroke:#1890FF,stroke-width:3px,color:#003A70
    classDef processGroupStyle fill:#FFF7E6,stroke:#FA8C16,stroke-width:3px,color:#AD4E00
    classDef searchGroupStyle fill:#F0E6FF,stroke:#9254DE,stroke-width:3px,color:#531DAB
    classDef appGroupStyle fill:#FCE7F3,stroke:#EC4899,stroke-width:3px,color:#9F1239

    class User userStyle
    class API apiStyle
    class Agent agentStyle
    class MCP mcpStyle
    class Browser browserStyle
    class S3Raw,S3Vectors storageStyle
    class Lambda,SageMaker processStyle
    class VectorIndex searchStyle
    class External externalStyle
    class SearchAgent,UIApp appStyle
    class Scraping scrapingGroupStyle
    class AppRunner appRunnerStyle
    class DataLake datalakeStyle
    class Storage storageGroupStyle
    class Processing processGroupStyle
    class Search searchGroupStyle
    class Applications appGroupStyle

Pipeline Components¶

1. Scraping Layer¶

Purpose: Collect real estate data from various sources (UP RERA, property portals, etc.)

Components:¶

FastAPI Service (Port 8080)
HTTP endpoint for triggering scrapes
Accepts max_projects parameter
Returns scraping status and metadata
AI Agent Scraper
Coordinates the scraping workflow
Uses AWS Bedrock (Claude Haiku) for intelligent scraping
Manages retry logic and error handling
MCP Server (Model Context Protocol)
Implements scraping logic
Handles pagination and data extraction
Returns lightweight metadata to avoid token limits
Playwright Browser
Headless browser automation
Memory-optimized for cloud deployment
Extracts structured data from web pages

Infrastructure:¶

AWS App Runner: Serverless container service (1 vCPU, 2GB RAM)
AWS ECR: Container image registry
IAM Roles: Bedrock and S3 access permissions

Data Format:¶

Each scraped record includes a raw_text field for vector search:

"Project Name: X | RERA Number: Y | Promoter: Z | Type: A | District: B | ..."

2. Storage Layer¶

Purpose: Store raw and processed data in appropriate formats

Components:¶

S3 Raw Bucket
Stores scraped data in NDJSON format
Partitioned by date: year=YYYY/month=MM/day=DD/
Versioning enabled for data recovery
Automatic encryption (SSE-S3)
S3 Vectors Bucket
Specialized vector database
Stores 384-dimensional embeddings
Metadata includes project details
Optimized for similarity search

Data Partitioning:¶

s3://bucket/prefix/year=2025/month=11/day=16/timestamp_runid.json

3. Processing Pipeline¶

Purpose: Transform raw data into searchable vectors

Components:¶

Lambda Function
Triggered automatically by S3 events
Runtime: Python 3.12
Memory: 1024 MB
Timeout: 300 seconds
Parallel processing: 10 workers (ThreadPoolExecutor)
SageMaker Endpoint
Model: sentence-transformers/all-MiniLM-L6-v2
Instance: ml.t2.medium
Output: 384-dimensional vectors
Optimized for semantic similarity

Processing Steps:¶

Read NDJSON file from S3 Raw
Parse each record's raw_text field
Generate embeddings via SageMaker
Store vectors with metadata in S3 Vectors
Retry failed records (up to 3 attempts)

Error Handling:¶

Transient errors (timeout, throttling): Retry
Permanent errors (invalid JSON): Skip and log
Partial failures: Continue processing remaining records

4. Search Layer¶

Purpose: Enable fast semantic search over property data

Components:¶

Vector Index
Name: property-research
Dimensions: 384
Distance Metric: Cosine similarity
Indexed automatically from S3 Vectors

Query Capabilities:¶

Natural language queries: "3 bedroom apartments in Noida"
Semantic similarity matching
Metadata filtering (price, location, type)
Top-K results with similarity scores

Data Flow Details¶

Step-by-Step Process¶

Phase 1: Data Collection (Steps 1-9) 1. User triggers scrape via HTTP API 2. FastAPI routes request to AI Agent 3. Agent invokes MCP Server tool 4. MCP launches Playwright browser 5. Browser scrapes target website (UP RERA) 6. Website returns HTML data 7. Browser parses and extracts JSON 8. MCP returns results to Agent 9. Agent uploads NDJSON to S3 Raw bucket

Phase 2: Processing (Steps 10-15) 10. S3 event triggers Lambda function 11. Lambda reads raw_text from each record 12. Lambda sends text to SageMaker endpoint 13. SageMaker returns vector embeddings 14. Lambda stores vectors with metadata in S3 Vectors 15. Vectors are automatically indexed

Phase 3: Search (Steps 16-17) 16. Applications query vector index with natural language 17. System returns similar properties based on semantic similarity

References¶

Contributing¶

For questions, improvements, or bug reports: - Open an issue on GitHub - Join our Slack community

License¶

See repository root for license information.