Comprehensive System Analysis
Verification-first analysis by 6-agent team · Branch: 10014_1 · 2026-03-03 · Every claim verified against code with file:line references
1. Executive Summary
The Industrial Labs Portal is a multi-tenant SaaS platform for private-fund performance analytics. It ingests fund data (Excel/CSV), runs Bayesian MCMC estimation on a PhoenixNAP Ray cluster, and delivers posterior results as interactive reports and downloadable Excel workbooks.
Architecture Overview
graph LR
Browser["Browser / Excel Add-in"] --> ALB["ALB :443\nportal.theaportfolioanalytics.com"]
ALB --> EC2a["EC2 #1\nuvicorn :8080"]
ALB --> EC2b["EC2 #2\nuvicorn :8080"]
EC2a --> RDSProxy["RDS Proxy\nportal_app + postgres pools"]
EC2b --> RDSProxy
RDSProxy --> Aurora["Aurora PostgreSQL\nus-east-2"]
EC2a --> S3["S3\nindustrial-labs-client-files"]
EC2b --> S3
EC2a --> Cognito["AWS Cognito\nTOTP MFA"]
EC2a -.->|Tailscale VPN| Cluster["PhoenixNAP\nRay Cluster"]
EC2a --> SM["Secrets Manager"]
Key Technology Stack
| Layer | Technology | Details |
| Web Framework | FastAPI + Uvicorn | main.py:26 — 1 worker per instance (not 4 as docs claim) |
| Database | Aurora PostgreSQL + RDS Proxy | Dual connection pool: superuser + RLS-enforced database.py:114-150 |
| Compute | JAX + NumPyro + Ray | MCMC NUTS sampling on bare metal model.py (~7200 lines) |
| Storage | S3 (industrial-labs-client-files) | config.py:69 |
| Auth | AWS Cognito + TOTP MFA | OAuth2 code flow with RS256 JWT auth_routes.py |
| Billing | Stripe | Checkout + webhooks + custom invoicing stripe_billing.py |
| VPN | Tailscale (WireGuard) | Portal to cluster: 100.75.86.48 |
| Admin UI | AG Grid v31.0.1 + Tailwind | Server-side pagination admin/base.html |
| Excel Add-in | Office.js | API key auth, bidirectional sync excel-addin/ |
2. Stats Dashboard
Corrections to previous docs: Old reports claimed 14 RLS tables (actual: 18), 30+ tables (actual: 33 active + 3 deprecated), 4 uvicorn workers (actual: 1), S3 bucket thea-portfolio-analytics-storage (actual: industrial-labs-client-files).
3. Security Findings
Discovered by adversarial cross-checking between 6 analysis agents. Each finding independently confirmed by 2+ agents.
| Sev | Finding | Location | Details | Confirmed By |
| P0 |
require_role("admin") passes for ANY user |
billing_routes.py:540 |
"admin" is not in ROLE_HIERARCHY (dependencies.py:185-190). get("admin",0) returns 0, so any authenticated user passes. Endpoint POST /api/admin/backfill-invoice-pdfs exposed to all. Uses superuser pool (bypasses RLS), no org_id filter. Limited blast radius: PDF regeneration only. |
verifier portal-core admin-client |
| P1 |
All billing routes lack RBAC |
billing_routes.py |
create-checkout-session, update-subscription, create-evergreen-fund, generate-forecast use session-only auth with NO role checks. Any viewer can create checkouts and modify subscriptions. |
portal-core verifier |
| P1 |
Cluster logging silently fails (missing auth token) |
run_job.py:69-70 |
_portal_log_worker() does NOT send X-Cluster-Token header but model_run_routes.py:551 requires it. All cluster log delivery fails 403. Fix: add req.add_header('X-Cluster-Token', os.getenv('CLUSTER_API_TOKEN','')) |
cluster-model verifier |
| P1 |
No CSRF middleware despite docs claiming complete |
main.py |
SCALE_PLAN Phase 1c marked "Session Timeout & CSRF" as COMPLETED. Session timeout done (max_age=3600) but NO CSRFMiddleware in main.py. |
verifier |
| P1 |
RLS startup verification gap |
main.py:160-191 |
Startup checks RLS on 12 tables but 18 actually have RLS. 6 tables from migrations 005, 012, 013, 019 never verified. |
verifier database |
| P2 |
Unauthenticated endpoints bypass RLS |
portal_routes.py:569-777 |
/api/funds/{hash}/last-posteriors and /api/funds/{hash}/market-stats have NO auth. market-stats uses superuser pool to bypass RLS. |
portal-core admin-client verifier |
| P2 |
Cluster token timing attack |
model_run_routes.py:13-20 |
Uses == not hmac.compare_digest(). Migration Lambda correctly uses hmac. |
portal-core |
| P2 |
/health/all leaks infrastructure details |
health_routes.py |
Exposes S3 bucket name and Ray address without authentication. |
infrastructure |
| P2 |
Session cookie secure flag not set |
main.py:48 |
SessionMiddleware does not set https_only=True. Cookies sent over HTTP. |
portal-core |
| P2 |
6 files bypass s3_helpers.py |
database.py, admin_routes.py, job_worker.py, portal_worker.py |
Create own boto3.client('s3') instead of using centralized s3_helpers. Inconsistent config. |
verifier |
| P2 |
Stale S3 bucket in run_job.py |
run_job.py:1958 |
Default industrial-labs-portal-uploads differs from industrial-labs-client-files everywhere else. |
portal-core |
| P2 |
Dual-timeout inconsistency |
portal_worker.py:47 vs run_job.py |
Portal: 3600s, Cluster: 7200s. Jobs can appear timed out but still complete. |
infrastructure |
| P3 |
12 undocumented API routes |
config_history_routes.py |
All check is_super_admin() — no auth gap, just docs gap. |
verifier |
Resolved (Non-Issues)
| Claim | Resolution | Resolved By |
RESOLVED 0-indexed vs 1-indexed bug in compute_estimation_status() |
NOT A BUG. Full trace: portal_worker.py:749 sets end_week = row_count (1-indexed). Passes through cluster unchanged. database.py:9151 compares last_estimated_week == current_row_count = 55 == 55 = CORRECT. SYSTEM_REFERENCE.md claim is outdated. |
database infrastructure |
4. Portal Core
Analyzed by: portal-core | Full findings: findings_portal_core.md
4.1 Application Entry Point & Startup
FastAPI app at main.py:26-30, titled "Thea Portfolio Analytics Portal" v0.1.0. Served via Uvicorn on port 8080 (run.sh:12).
Startup Sequence (main.py:125-330)
- Pool reset (
main.py:133-135) — avoids stale psycopg2 connections after fork
- Table creation (
main.py:139-156) — 15+ ensure_* functions
- RLS verification (
main.py:160-191) — checks 12 of 18 RLS tables GAP
- Invoice backfill (
main.py:201-321) — regenerates all invoice PDFs at startup
- Public IP logging (
main.py:324-329)
4.2 Middleware Chain (5 Layers)
| # | Middleware | Location | Purpose |
| 1 | AddinNoCacheMiddleware | main.py:65-76 | No-cache for /static/addin/* |
| 2 | SlowAPIMiddleware | main.py:54 | Rate limiting (100/min default) |
| 3 | LoggingMiddleware | main.py:49, middleware.py:24-71 | Request ID, user/org, timing |
| 4 | SessionMiddleware | main.py:48 | Starlette sessions (max_age=3600s) |
| 5 | CORSMiddleware | main.py:33-47 | Portal + Office/Excel origins |
4.3 Authentication (3 Mechanisms)
| Mechanism | Users | How | Location | Issues |
| Cognito Session | Browser | OAuth2 code flow, RS256 JWT, TOTP MFA, session cookie 1hr | auth_routes.py:44-427 | Cookie secure flag not set |
| API Key | Excel | sk_live_ prefix, SHA-256 hash in DB, per-key rate limits | dependencies.py:86-91 | Key in URL query param |
| Cluster Token | Cluster | X-Cluster-Token header, shared secret | model_run_routes.py:13-20 | == not hmac |
RBAC Hierarchy (dependencies.py:185-190)
| Role | Level | Access |
| super_admin | 4 | All + admin dashboard. Always verified against DB. |
| org_admin | 3 | Org management, API keys, billing |
| member | 2 | Upload, settings, billing |
| viewer | 1 | Dashboard, reports (read-only) |
4.4 Background Workers
| Worker | Mechanism | Purpose |
| PortalWorker | SSH/SCP to cluster | Production job submission |
| JobWorker | Ray direct via Tailscale | Alternative path |
| ForecastWorker | Background thread | JAX scenario forecasting |
5. Database Layer
Analyzed by: database | Full findings: findings_database.md (1400+ lines)
5.1 Dual Connection Pool
| Pool | Variable | Role | RLS | Config |
| Superuser | _pool | postgres | Bypasses | min=2, max=20 database.py:129-137 |
| Org-Scoped | _org_pool | portal_app | Enforced | min=2, max=20 database.py:140-150 |
get_org_db(org_id) sets SET LOCAL app.current_org_id per transaction (database.py:218). Multi-tenancy enforced at 3 layers: PostgreSQL RLS, application WHERE clauses, dual pool separation.
5.2 Table Inventory (33 Active + 3 Deprecated)
| Subsystem | Tables | RLS Count |
| Auth/Users | organizations, users, api_keys, password_reset_tokens, access_logs, admin_logs | 3 |
| Fund Data | uploads, upload_funds, fund_identities, fund_identity_aliases, fund_data_versions, fund_events, fund_estimation_history | 3 |
| Model Runs | model_runs, model_run_logs, model_run_results, model_jobs, model_config_history | 1 |
| Deliverables | deliverables, posterior_results | 2 |
| Billing | product_selections, custom_invoices | 2 |
| Market/Forecast | fund_market_configs, ticker_returns_cache, fund_market_config_history, simulation_results | 2 |
| Cluster | compute_clusters, cluster_operations | 0 |
| Evergreen | evergreen_funds, evergreen_fund_members | 2 |
| Misc | organization_videos, organization_branding | 1 |
Deprecated: subscriptions → organizations.subscription_status, funds → fund_identities, organization_fund_access → product_selections. migration 010
No stored procedures/triggers — all logic in Python (database.py ~12,400 lines). No automated migration runner — 20 SQL files applied manually.
5.3 Identity Hash System (3+3 Tiers)
| Hash | Formula | Purpose | Location |
file_identity_hash | SHA256("file|{org_id.lower().strip()}|{filename.strip()}") | File within org | database.py:2572-2586 |
fund_identity_hash | SHA256("fund|{org_id.lower().strip()}|{filename.strip()}|{fund_name.strip()}") | Primary identifier: uploads→versions→posteriors→deliverables | database.py:2589-2608 |
data_hash | SHA256("shape:{shape}|cols:{sorted_cols}|"+csv_data) | Content deduplication | database.py:2541-2569 |
Extended: evergreen_identity_hash, eg_subfund_identity_hash, config_hash (SOC 2 tamper detection).
fund_identities registry: PK: fund_identity_hash VARCHAR(64). 6 FK constraints. Alias system for renames (max chain 5). migration 005
5.4 RLS System (18 Tables)
All use: USING (organization_id::text = current_setting('app.current_org_id', true)). Fail-safe: unset = empty = matches nothing.
Migration 003 (12): users, uploads, subscriptions, deliverables, api_keys, access_logs, organization_videos, organization_fund_access, fund_data_versions, product_selections, custom_invoices, model_jobs
Later (6): posterior_results & fund_identities (005), fund_market_configs (012), simulation_results (013), evergreen_funds & evergreen_fund_members (019)
Note: evergreen_fund_members uses subquery through evergreen_funds (different pattern).
6. Cluster & Model Engine
Analyzed by: cluster-model | Full findings: findings_cluster_model.md
6.1 Codebase (~25,000 lines)
| File | Lines | Purpose |
| model/model.py | ~7200 | Core NumPyro MCMC model |
| model/model_hierarchical.py | ~6000 | Joint multi-fund (WIP) |
| run_job.py | ~3100 | Headless execution entry point |
| fund_processor.py | ~3100 | Preprocessing, OLS, GARCH |
| model/model_evergreen.py | ~1700 | Staggered entry/exit |
| fund_rollup_utils.py | ~1600 | Rollup fund synthesis |
| cluster_post_processor.py | ~1200 | Posteriors, S3, finalize |
| internal_report_generator.py | ~600 | Plotly HTML reports |
6.2 Model Modes
Routing at run_job.py:931:
| Mode | Description |
| single | Per-fund independent estimation (default) |
| rollup_only | Aggregate into synthetic Fund_0 |
| empirical_bayes | Fund_0 priors inform individual fits |
| joint_global | True hierarchical joint estimation via plate |
| evergreen | Staggered entry/exit model |
Portal always forces full parallelization: override_external=True, weeks_before_batch=9999 (run_job.py:1216-1226).
6.3 NumPyro MCMC Model
- Priors: 2 categories + initial estimates + per-parameter overrides
- 5 Likelihoods: fund return, comparable asset, distributions, GP NAVs (t-2 lag), secondary+third-party
- Data scoping: Closure pattern (NOT scan carry)
- 2 Return forms: R_to_r (log), direct_r
- 4+ Noise scaling: F*sqrt(h_t), sigma_F, kappa constant, kappa time-varying
- Conditional likelihoods:
handlers.mask()
6.4 Cluster-Portal Communication
| Channel | Direction | Auth | Status |
| Logging API | Cluster→Portal | BROKEN Missing token | Fails 403 run_job.py:69 |
| Finalize API | Cluster→Portal | X-Cluster-Token | Working cluster_post_processor.py:1095 |
| S3 Direct | Cluster→S3 | IAM keys | Working |
| Pre-signed URLs | Cluster→Portal→S3 | X-Cluster-Token | Results pickle only (NOT HTML) |
7. Infrastructure
Analyzed by: infrastructure | Full findings: findings_infrastructure.md (1000+ lines)
7.1 AWS Architecture
| Service | Details | Source |
| Region | us-east-2 (NOT us-east-1 as aws_setup.md claims) | config.py:68 |
| ALB | HTTPS:443 → EC2:8080, ACM cert | deploy-to-asg.sh:425 |
| ASG | portal-asg, template lt-00c1bb1f01be3e06a | deploy-to-asg.sh:20,307 |
| Aurora | industrial-labs-portal.cvowe8wi8fif.us-east-2.rds.amazonaws.com | .env.example:2 |
| RDS Proxy | industrial-labs-portal-proxy.proxy-... | deploy-to-asg.sh:259 |
| S3 | industrial-labs-client-files | config.py:69 |
| Cognito | TOTP MFA, hosted UI, RS256 JWT | config.py:72-76 |
| Secrets Manager | 3 secrets: portal (13 keys), phoenixnap, tailscale | config.py:17 |
7.2 Deployment (deploy-to-asg.sh)
- Get ASG instance IPs (
:137-150)
- SQL migrations on first instance via direct Aurora (
:76-128)
- SCP portal code (
:163-195)
- Rsync cluster to PhoenixNAP — IP from DB dynamically (
:279-298)
- Rolling restart with ALB health check (
:200-260)
- Bake Golden AMI, keep 3 most recent (
:306-413)
No CI/CD. deploy-to-ec2.sh is legacy (still in repo).
7.3 Tailscale & PhoenixNAP
EC2 → PhoenixNAP via Tailscale WireGuard. IP: 100.75.86.48. Ray ports 6379/10001/8265 via VPN only.
PhoenixNAP: bare metal, Terraform-provisioned. 8-80 CPUs, 32-256GB RAM ($0.35-$1.09/hr).
Jobs: SCP pickle → SSH run_job.py. Systemd depends on tailscaled.
8. Admin & Client
Analyzed by: admin-client | Full findings: findings_admin_client.md
8.1 Templates (27 total)
Two bases: base.html (client, Tailwind) and admin/base.html (dark, AG Grid v31.0.1). 10 admin, 11 client, 6 standalone auth, 1 partial.
Notable: dashboard.html 351KB, cluster.html 2.1MB inline JS. cluster.html.bak 25K+ lines still present.
8.2 Admin Dashboard
9 pages, 20+ endpoints. All require super_admin. AG Grid with server-side pagination via createPaginatedGrid() (admin/base.html:72-265).
- Dashboard: stats + fund selections grid
- Organizations: list + detail with user role management
- Jobs: model run CRUD + cancel
- Cluster: Ray provisioning, worker management
- Deliverables: pipeline management
- Fund Timeline: re-versioning, purge
- Audit Log: all actions with IP/user-agent
8.3 Excel Add-in & Stripe Billing
Excel Add-in
Office.js TaskPaneApp. sk_live_* API key in localStorage. 4 manifests (v1.1.0.0). Bidirectional sync, auto-pull posteriors, regime builder.
Stripe Billing
SOC 2 compliant (no card data stored). Single line item (20-item cap workaround). Custom cumulative invoice PDFs. Webhooks: checkout.completed, subscription.updated/deleted, invoice.paid/failed. Evergreen fund CRUD.
9. Flow: Upload → Deliverable
graph TD
A["User uploads Excel\nupload_routes.py:149"] --> B["S3 storage\nindustrial-labs-client-files"]
B --> C["fund_data_versions\ndatabase.py:2435"]
C --> D["Identity hashes\ndatabase.py:2572-2608"]
D --> E["model_run created\nmodel_run_routes.py:24"]
E --> F["PortalWorker\nportal_worker.py:309"]
F --> G["SCP pickle to cluster"]
G --> H["run_job.py:2757\nRay init, FAIL LOUDLY"]
H --> I["fund_processor.py\nOLS, GARCH"]
I --> J["model.py MCMC\nNUTS sampling"]
J --> K["cluster_post_processor.py\nPosteriors to S3 parquet"]
K --> L["Finalize API\nmodel_run_routes.py:683"]
L --> M["posterior_results\n+ deliverables rows"]
Critical: Finalize API is the primary path from cluster to DB. Failure = data loss. Secondary idempotent fallback at portal_worker.py:1553.
10. Flow: Auth & RBAC
Full trace: flow_trace_2_auth_flow.md (771 lines)
graph TD
A["Login\nauth_routes.py:44"] --> B["Cognito OAuth2\nTOTP MFA"]
B --> C["Callback\nauth_routes.py:331"]
C --> D["JWT RS256 verify\nauth.py:46-59"]
D --> E["Session set\nuser, role, org_id"]
E --> F["require_auth\ndependencies.py:168"]
F --> G["require_role\ndependencies.py:193"]
G --> H["SET LOCAL org_id\ndatabase.py:218"]
H --> I["RLS scoped query"]
11. Flow: Excel Add-in
graph TD
A["Excel loads add-in\nmanifest.xml"] --> B["TaskPane\ntaskpane.html"]
B --> C["API key entered\nsk_live_*"]
C --> D["Key validated\nSHA-256 lookup"]
D --> E["Upload workbook\nexcel_api_routes.py:129"]
E --> F["S3 + hashes\ndatabase.py"]
F --> G["Poll status\nexcel_api_routes.py:482"]
G --> H["Pull posteriors\nauto-fill sheets"]
12. Flow: Admin Run
graph TD
A["Admin submits\nadmin/jobs.html"] --> B["model_run created\nstatus: pending"]
B --> C["process-pending\nbackground thread"]
C --> D["PortalWorker\nSCP + SSH"]
D --> E["run_job.py\nRay + MCMC"]
E --> F["Finalize API\nX-Cluster-Token"]
F --> G["DB writes:\nposteriors, deliverables"]
G --> H["Visible in\nadmin + client"]
Timeouts: Portal polls 3600s, cluster watchdog 7200s. GAP Jobs can appear timed out but still complete.
13. Doc vs Code Verification
160+ claims verified across SYSTEM_REFERENCE.md, system_architecture_report.html, SCALE_PLAN_1_26_2026.md, RECENT_TASKS_2_6_2026.md
Most reliable: system_architecture_report.html. Least reliable: SYSTEM_REFERENCE.md (wrong S3 bucket, wrong hash formula, stale deploy script, all line numbers wrong).
14. Known Issues & Discrepancies
Doc Fixes Needed
| # | Doc | Issue | Fix |
| 1 | SYSTEM_REFERENCE.md | Wrong S3 bucket | industrial-labs-client-files |
| 2 | SYSTEM_REFERENCE.md | Wrong hash formula | Add fund| prefix + normalization |
| 3 | SYSTEM_REFERENCE.md | Stale deploy script | Remove or link to actual |
| 4 | aws_setup.md | Wrong region | us-east-2 |
| 5 | system_architecture_report | 14 RLS tables | 18 |
| 6 | system_architecture_report | 4 uvicorn workers | 1 |
| 7 | SCALE_PLAN | CSRF complete | Not implemented |
Code Fixes Needed
| # | Sev | Fix | Location |
| 1 | P0 | Change require_role("admin") to require_super_admin() | billing_routes.py:540 |
| 2 | P1 | Add RBAC to billing routes | billing_routes.py |
| 3 | P1 | Add X-Cluster-Token to log worker | run_job.py:70 |
| 4 | P1 | Add CSRF middleware | main.py |
| 5 | P1 | Add 6 missing tables to RLS startup check | main.py:160 |
| 6 | P2 | Use hmac.compare_digest() for cluster token | model_run_routes.py:20 |
| 7 | P2 | Set https_only=True on SessionMiddleware | main.py:48 |
| 8 | P2 | Fix stale S3 bucket name | run_job.py:1958 |
15. Methodology
Verification-First Approach
- Phase 1 — Raw Discovery: 5 domain analysts independently documented their code sections
- Phase 2 — Adversarial Verification: Verifier checked every doc claim. Domain analysts cross-verified.
- Phase 3 — Cross-Checking: Agents challenged each other at system boundaries. Disagreements resolved by tracing data flows.
- Phase 4 — Synthesis: All findings compiled into this document.
Agent Team
| Agent | Scope | Key Outputs |
| portal-core | Routes, middleware, auth, RBAC | findings_portal_core.md, flow_trace_2_auth_flow.md |
| database | database.py, migrations, RLS, hashes | findings_database.md (1400+ lines) |
| cluster-model | Cluster app, MCMC, Ray | findings_cluster_model.md, findings_end_to_end_flows.md |
| infrastructure | Deploy scripts, AWS, Tailscale | findings_infrastructure.md (1000+ lines) |
| admin-client | Templates, Excel, Stripe | findings_admin_client.md, flow_trace_3_excel_addin.md |
| verifier | All docs vs all code | findings_verification.md, findings_boundary_crosscheck.md |
Principle: The CODE is the source of truth. Docs are hypotheses. Every claim verified with file:line references. Disagreements resolved by tracing actual data flows.
Source Files (UNDERSTANDING_THIS_PROJECT/)
- findings_portal_core.md, findings_database.md, findings_cluster_model.md
- findings_infrastructure.md, findings_admin_client.md, findings_verification.md
- findings_boundary_crosscheck.md, findings_end_to_end_flows.md
- flow_trace_2_auth_flow.md, flow_trace_3_excel_addin.md, flow_trace_4_admin_run.md