Spaces:

moccaram
/

DataSynthis_ML_JobTask

Sleeping

App Files Files Community

moccaram commited on 6 days ago

Commit

8ba081b

verified ·

1 Parent(s): 93c1d7b

Replace v1 demo with v2 XGBoost-backed Gradio app (reference-backed rebuild)

Browse files

Upgrades the Space to the v2 pipeline from github.com/moccaram/DataSynth. Real Gradio inference (not the hello-world template), XGBoost trained on triple-barrier labels + fractionally-differenced features, prominent caveat about ~36% directional accuracy when acting.

Files changed (31) hide show

.gitattributes +1 -0
DataSynthis_ML_JobTask.ipynb +0 -0
README.md +16 -36
X_test_lstm.npy +0 -3
X_train_lstm.npy +0 -3
X_val_lstm.npy +0 -3
app.py +13 -15
feature_scaler.pkl → app_screenshot.png +2 -2
arima_model.pkl +0 -3
arima_order.pkl +0 -3
data/raw/AAPL_stock_data_2010_2024.csv +0 -0
data/raw/SPY_stock_data_2010_2024.csv +0 -0
data_preparation_metadata.json +0 -59
lstm_model.h5 +0 -3
requirements.txt +7 -0
src/__init__.py +7 -0
src/app.py +190 -0
src/cv.py +81 -0
src/data.py +52 -0
src/eval.py +76 -0
src/features.py +109 -0
src/labeling.py +219 -0
src/models/__init__.py +0 -0
src/models/arima_model.py +64 -0
src/models/baselines.py +89 -0
src/models/lstm_model.py +179 -0
src/models/xgb_model.py +59 -0
src/train.py +102 -0
y_test_lstm.npy +0 -3
y_train_lstm.npy +0 -3
y_val_lstm.npy +0 -3

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+app_screenshot.png filter=lfs diff=lfs merge=lfs -text

DataSynthis_ML_JobTask.ipynb DELETED Viewed

The diff for this file is too large to render. See raw diff

README.md CHANGED Viewed

@@ -1,47 +1,27 @@
 ---
-title: DataSynthis ML JobTask
-emoji: 🐢
-colorFrom: green
 colorTo: gray
 sdk: gradio
-sdk_version: 5.48.0
 app_file: app.py
 pinned: false
-license: apache-2.0
-short_description: Stock price forecasting ML demo for DataSynthis internship
 ---
-# 📈 DataSynthis ML JobTask
-Stock Price Forecasting with Baseline, Statistical, and ML Models
-## 🚀 Project Overview
-This project demonstrates a complete **time-series forecasting pipeline** using daily stock price data (2010–2024). It was developed as part of the **DataSynthis ML Internship Task**.
-We cover the full workflow:
-1. **Baseline Models** → Naïve Forecast, Simple Exponential Smoothing (SES)
-2. **Statistical Model** → ARIMA
-3. **ML / DL Models** → Prophet, LSTM
-4. **Evaluation** → Rolling-window accuracy metrics (RMSE, MAPE)
-5. **Deployment** → Interactive demo with Gradio (via Hugging Face Spaces)
-## 🛠️ Features
-- Data preprocessing & feature engineering (lags, volatility, RSI, MACD, Bollinger Bands, etc.)
-- Feature validation & pruning (correlation, VIF, outlier checks)
-- Unified comparison of models with a performance summary table
-- Visualizations: trends, normalized comparisons, total returns
-- Exportable datasets for reproducibility
-## 📊 Deliverables
-- **Notebook**: End-to-end workflow (data → models → evaluation)
-- **Models**: Naïve, SES, ARIMA, Prophet, LSTM
-- **Visualizations**: stock trends, indicators, correlations, performance plots
-- **Deployment**: Hugging Face Space with Gradio app
-## 📂 Repository Structure
-📁 DataSynthis_ML_JobTask
-├── app.py # Gradio demo app
-├── data/ # Preprocessed & engineered datasets
-├── notebooks/ # Jupyter notebooks with full pipeline
-├── models/ # Trained ARIMA / Prophet / LSTM models
-├── outputs/ # Plots, summary tables, feature files
-├── README.md # This file

 ---
+title: AAPL Triple-Barrier Direction Classifier
+emoji: 📊
+colorFrom: blue
 colorTo: gray
 sdk: gradio
+sdk_version: "4.44.0"
 app_file: app.py
 pinned: false
+license: mit
 ---
+# AAPL Triple-Barrier Direction Classifier (educational)
+Reference-backed financial-ML demo. XGBoost classifier trained on
+fractionally-differenced features and triple-barrier labels (López de Prado,
+*Advances in Financial Machine Learning*, Ch.3 + Ch.5).
+**This is an educational portfolio artifact, not a trading signal.**
+Test-set accuracy ~38% on a 3-class label set (random = 33%, p<0.05 in 3 of 5
+purged folds). Directional accuracy *when the model picks a side* is ~36% —
+worse than coin-flip. Do not trade real money on this.
+![Gradio interface](app_screenshot.png)
+Full source, technical writeup, and lessons-learned:
+[github.com/moccaram/DataSynth](https://github.com/moccaram/DataSynth).

X_test_lstm.npy DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:28e28884d7ade2318c01ffa836f14fe66dad42ffd29bcf7c39c589bc9d2ff5b4
-size 2739488

X_train_lstm.npy DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:2bd3342b5569749c14cba69cbc1aae53369ccaaaf0502fc74de1a84c7495788c
-size 17228768

X_val_lstm.npy DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:88464f5ecbfcb80a36d3b7113599d3088f25bc11a38317e343f9868ed907704a
-size 1191968

app.py CHANGED Viewed

@@ -1,15 +1,13 @@
-import gradio as gr
-def greet(name):
-    return "Hello " + name + "!!"
-demo = gr.Interface(
-    fn=greet,
-    inputs="text",
-    outputs="text",
-    title="👋 Greeting Demo",
-    description="Enter your name to receive a warm greeting."
-)
-if __name__ == "__main__":
-    demo.launch()

+"""Hugging Face Spaces entry point. Delegates to src.app for the real interface."""
+import sys
+from pathlib import Path
+# Make src/ importable when the Space launches this file from the repo root.
+sys.path.insert(0, str(Path(__file__).resolve().parent))
+from src.app import build_interface
+if __name__ == "__main__":
+    demo = build_interface()
+    demo.launch()

feature_scaler.pkl → app_screenshot.png RENAMED Viewed

File without changes

arima_model.pkl DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:d6787effc883e371477f02eecc8f5e48e9148a6b286e48af1eeee4f072eb04d9
-size 5295051

arima_order.pkl DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:efc90090103f31c21431c8a3d1ae6c66ca453551649bbf4488b706172c4277a4
-size 20

data/raw/AAPL_stock_data_2010_2024.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

data/raw/SPY_stock_data_2010_2024.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

data_preparation_metadata.json DELETED Viewed

@@ -1,59 +0,0 @@
-{
-  "dataset": {
-    "total_days": 3572,
-    "date_range": "2010-10-19 to 2024-12-27",
-    "features": 13,
-    "target": "target_return"
-  },
-  "split": {
-    "train_days": 2821,
-    "val_days": 251,
-    "test_days": 499,
-    "train_pct": 78.97536394176932,
-    "val_pct": 7.026875699888017,
-    "test_pct": 13.96976483762598
-  },
-  "features": [
-    "hl_range",
-    "log_return",
-    "spy_return",
-    "co_range",
-    "return_lag2",
-    "return_lag5",
-    "volatility_20d",
-    "volume_change",
-    "day_cos",
-    "day_of_week",
-    "day_sin",
-    "month_cos",
-    "rolling_beta"
-  ],
-  "prophet_regressors": [
-    "hl_range",
-    "spy_return",
-    "volatility_20d",
-    "rolling_beta",
-    "volume_change",
-    "co_range",
-    "day_cos",
-    "day_sin"
-  ],
-  "lstm_sequence_length": 60,
-  "last_prices": {
-    "train": 178.08999633789062,
-    "val": 128.41000366210938,
-    "test": 257.8299865722656
-  },
-  "files_created": [
-    "feature_scaler.pkl",
-    "train_prophet.csv",
-    "val_prophet.csv",
-    "test_prophet.csv",
-    "X_train_lstm.npy",
-    "y_train_lstm.npy",
-    "X_val_lstm.npy",
-    "y_val_lstm.npy",
-    "X_test_lstm.npy",
-    "y_test_lstm.npy"
-  ]
-}

lstm_model.h5 DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:d2e60dea878818cb88f7cd864b68daf9be6c10c80cea8ab0537e3662c48ed041
-size 1535336

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+gradio>=4.0
+matplotlib>=3.8
+numpy>=1.26,<3
+pandas>=2.1
+scikit-learn>=1.3
+scipy>=1.11
+xgboost>=2.0

src/__init__.py ADDED Viewed

	@@ -0,0 +1,7 @@

+"""DataSynth — reference-backed stock forecasting pipeline.
+Anchored to:
+- AFML (López de Prado) Ch.3 (labeling), Ch.5 (FFD), Ch.7 (purged CV)
+- Goodfellow et al. Ch.10 §10.11 (RNN optimization)
+- Jansen, *Machine Learning for Algorithmic Trading* Ch.19 (RNNs for time series)
+"""

src/app.py ADDED Viewed

	@@ -0,0 +1,190 @@

+"""Gradio demo — AAPL triple-barrier direction classifier (educational).
+Loads the XGBoost model (the headline winner in this study, mean test accuracy
+~38% vs 33% random) and lets the user pick any date in the available range to
+inspect the next-10-day direction prediction with class probabilities.
+This is a *portfolio artifact*. The directional accuracy when the model
+actually picks a side is ~36% — worse than random. Do not trade on this.
+"""
+from __future__ import annotations
+import io
+import sys
+import warnings
+from pathlib import Path
+warnings.filterwarnings("ignore")
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(ROOT))
+from src.data import load_aapl_with_spy, get_daily_vol
+from src.features import frac_diff_ffd
+from src.labeling import cusum_filter, get_events, get_bins, drop_labels
+from src.models.xgb_model import XGBTripleBarrier
+CLASS_LABELS = {-1: "DOWN (stop-loss first)", 0: "FLAT (time-out, no signal)", 1: "UP (profit-taking first)"}
+def build_features_and_labels():
+    """Rebuild the full feature matrix + triple-barrier labels at startup."""
+    df = load_aapl_with_spy()
+    close = df["Adj Close"]
+    log_returns = np.log(close).diff().dropna()
+    daily_vol = get_daily_vol(close, span=100)
+    features = pd.DataFrame(index=df.index)
+    features["frac_diff_close"] = frac_diff_ffd(np.log(close).to_frame("c"), 0.4, thres=1e-5)["c"]
+    features["frac_diff_volume"] = frac_diff_ffd(
+        np.log(df["Volume"].replace(0, np.nan)).to_frame("v"), 0.4, thres=1e-5
+    )["v"]
+    features["hl_range"] = (df["High"] - df["Low"]) / df["Close"]
+    features["spy_return"] = np.log(df["SPY_Close"]).diff()
+    features["volatility_20d"] = log_returns.rolling(20).std()
+    features["rolling_beta"] = (
+        log_returns.rolling(30).cov(features["spy_return"])
+        / features["spy_return"].rolling(30).var()
+    )
+    features["day_of_week"] = df.index.dayofweek
+    features["vol_regime"] = daily_vol / daily_vol.rolling(252, min_periods=60).median()
+    features = features.dropna()
+    t_events = cusum_filter(np.log(close), threshold=float(daily_vol.median()))
+    events = get_events(
+        close=close, t_events=t_events, pt_sl=(2.0, 2.0),
+        target=daily_vol, min_ret=0.005, num_days=10,
+    )
+    labels = get_bins(events, close)
+    events_with_labels = events.join(labels[["bin"]])
+    events_with_labels = drop_labels(events_with_labels, min_pct=0.05)
+    labels = labels.loc[events_with_labels.index]
+    aligned = features.index.intersection(labels.index)
+    return df, close, features, labels.loc[aligned, "bin"].astype(int), features.loc[aligned]
+print("Loading data and training XGBoost (one-time, ~10 sec)...")
+DF, CLOSE, FEATURES_FULL, Y_TRAIN, X_TRAIN_ALIGNED = build_features_and_labels()
+from sklearn.preprocessing import StandardScaler
+SCALER = StandardScaler().fit(X_TRAIN_ALIGNED.values)
+MODEL = XGBTripleBarrier(random_state=42)
+MODEL.fit(
+    pd.DataFrame(SCALER.transform(X_TRAIN_ALIGNED.values), index=X_TRAIN_ALIGNED.index, columns=X_TRAIN_ALIGNED.columns),
+    Y_TRAIN.values,
+)
+print(f"Model trained on {len(X_TRAIN_ALIGNED)} labeled events. Ready.")
+VALID_DATES = FEATURES_FULL.index
+DEFAULT_DATE = VALID_DATES[-1]
+def predict(date_str: str):
+    try:
+        date = pd.Timestamp(date_str)
+    except Exception:
+        return "Invalid date format. Use YYYY-MM-DD.", None, None
+    available = FEATURES_FULL.index[FEATURES_FULL.index <= date]
+    if len(available) == 0:
+        return f"No features available on or before {date.date()}. Try a later date.", None, None
+    use_date = available[-1]
+    x_row = FEATURES_FULL.loc[[use_date]]
+    x_scaled = pd.DataFrame(SCALER.transform(x_row.values), index=x_row.index, columns=x_row.columns)
+    proba = MODEL.predict_proba(x_scaled)[0]
+    pred_class = int(MODEL.classes_[np.argmax(proba)])
+    proba_df = pd.DataFrame(
+        {"class": [CLASS_LABELS[c] for c in MODEL.classes_], "probability": [f"{p:.1%}" for p in proba]}
+    )
+    end_idx = DF.index.get_loc(use_date)
+    start_idx = max(0, end_idx - 59)
+    chart_data = DF["Adj Close"].iloc[start_idx : end_idx + 1]
+    fig, ax = plt.subplots(figsize=(8, 3.5))
+    ax.plot(chart_data.index, chart_data.values, color="black", lw=1.0)
+    ax.scatter([chart_data.index[-1]], [chart_data.iloc[-1]], color="red", s=40, zorder=3, label=f"As-of: {use_date.date()}")
+    ax.set_title(f"AAPL adjusted close — 60 days ending {use_date.date()}")
+    ax.set_ylabel("Price ($)")
+    ax.legend(loc="best")
+    ax.grid(alpha=0.3)
+    plt.tight_layout()
+    summary = (
+        f"**As-of date:** {use_date.date()}  \n"
+        f"**Last close:** ${chart_data.iloc[-1]:.2f}  \n"
+        f"**Prediction (next 10 trading days):** {CLASS_LABELS[pred_class]}  \n"
+        f"**Confidence (max class probability):** {proba.max():.1%}"
+    )
+    return summary, proba_df, fig
+def build_interface():
+    import gradio as gr
+    caveat = """
+> ⚠️ **This is an educational portfolio artifact, NOT a trading signal.**
+>
+> Under 5-fold purged k-fold cross-validation (López de Prado, *AFML*, Ch.7), this XGBoost
+> classifier reaches mean accuracy ~38% on a 3-class triple-barrier label set (random baseline
+> = 33%, p<0.05 in 3 of 5 folds). However, **directional accuracy *when the model picks a side*
+> is ~36% — worse than coin flip**. The model is mildly informative about "will something
+> happen vs nothing" but uninformative about "up vs down." Do not trade real money on this.
+"""
+    with gr.Blocks(title="AAPL Triple-Barrier Direction Classifier") as demo:
+        gr.Markdown("# AAPL Triple-Barrier Direction Classifier (educational)")
+        gr.Markdown(caveat)
+        gr.Markdown(
+            "Reference-backed financial-ML pipeline: triple-barrier labeling "
+            "(AFML Ch.3), fractional differentiation (Ch.5), purged k-fold CV (Ch.7), "
+            "XGBoost classifier. Repo: this folder."
+        )
+        with gr.Row():
+            with gr.Column(scale=1):
+                date_input = gr.Textbox(
+                    label="As-of date (YYYY-MM-DD)",
+                    value=str(DEFAULT_DATE.date()),
+                    info=f"Valid range: {VALID_DATES[0].date()} → {VALID_DATES[-1].date()}",
+                )
+                predict_btn = gr.Button("Predict next 10-day direction", variant="primary")
+                summary_md = gr.Markdown()
+                proba_table = gr.Dataframe(headers=["class", "probability"], label="Class probabilities")
+            with gr.Column(scale=2):
+                chart = gr.Plot(label="60-day price context")
+        predict_btn.click(
+            fn=predict, inputs=[date_input], outputs=[summary_md, proba_table, chart]
+        )
+        gr.Markdown(
+            "---\n"
+            "Headline result table (mean over 5 purged folds):\n\n"
+            "| Model     | Accuracy | Beat random (p<0.05) | Dir.acc when acting |\n"
+            "|-----------|----------|----------------------|---------------------|\n"
+            "| Majority  | 35.0%    | 0/5 folds            | N/A                 |\n"
+            "| SES       | 36.8%    | 2/5 folds            | always abstains     |\n"
+            "| ARIMA     | 36.8%    | 2/5 folds            | always abstains     |\n"
+            "| LSTM      | 35.8%    | 2/5 folds            | 33% (worse than 50%) |\n"
+            "| **XGBoost** | **37.8%** | **3/5 folds**     | 36% (worse than 50%) |\n"
+        )
+    return demo
+if __name__ == "__main__":
+    app = build_interface()
+    app.launch(server_name="127.0.0.1", server_port=7860, inbrowser=False, share=False)

src/cv.py ADDED Viewed

	@@ -0,0 +1,81 @@

+"""Purged k-fold cross-validation — AFML Ch.7 (BonusPDF pp.62-67).
+Standard k-fold leaks information in finance because labels span time intervals.
+If a training label's interval ``[t_i, t1_i]`` overlaps a test label's interval
+``[t_j, t1_j]``, the two share underlying price information and the train/test
+boundary is fictitious. ``PurgedKFold`` drops the offending training samples;
+an additional ``pctEmbargo`` buffer drops samples immediately *after* each test
+fold to prevent reverse leakage from the test set into a later train fold.
+This is a port of AFML Snippets 7.2-7.3 (BonusPDF pp.65-66). The canonical class
+inherits from sklearn's ``_BaseKFold`` so it works as a drop-in replacement.
+"""
+from __future__ import annotations
+import numpy as np
+import pandas as pd
+from scipy import stats
+from sklearn.model_selection._split import _BaseKFold
+class PurgedKFold(_BaseKFold):
+    """K-fold CV with purging + optional embargo. AFML Snippet 7.3 (BonusPDF p.66)."""
+    def __init__(self, n_splits: int = 5, t1: pd.Series | None = None, pct_embargo: float = 0.0):
+        if not isinstance(t1, pd.Series):
+            raise ValueError("`t1` must be a pd.Series of label-end timestamps")
+        super().__init__(n_splits, shuffle=False, random_state=None)
+        self.t1 = t1
+        self.pct_embargo = pct_embargo
+    def split(self, X, y=None, groups=None):
+        if not X.index.equals(self.t1.index):
+            raise ValueError("X.index must equal t1.index")
+        indices = np.arange(X.shape[0])
+        embargo_size = int(X.shape[0] * self.pct_embargo)
+        test_ranges = [(arr[0], arr[-1] + 1) for arr in np.array_split(indices, self.n_splits)]
+        for i, j in test_ranges:
+            t0 = self.t1.index[i]
+            test_indices = indices[i:j]
+            max_t1_in_test = self.t1.iloc[test_indices].max()
+            max_t1_pos = self.t1.index.searchsorted(max_t1_in_test)
+            # left train: rows whose label ended before test starts
+            left_train = self.t1.index.searchsorted(self.t1[self.t1 <= t0].index)
+            # right train: rows starting after max-t1 + embargo
+            if max_t1_pos < X.shape[0]:
+                right_train = indices[max_t1_pos + embargo_size :]
+            else:
+                right_train = np.array([], dtype=int)
+            train_indices = np.concatenate([left_train, right_train])
+            yield train_indices, test_indices
+def get_embargo_times(times: pd.DatetimeIndex, pct_embargo: float) -> pd.Series:
+    """AFML Snippet 7.2 (BonusPDF p.65). Map each timestamp to its embargo end."""
+    step = int(times.shape[0] * pct_embargo)
+    if step == 0:
+        return pd.Series(times, index=times)
+    embargo = pd.Series(times[step:], index=times[:-step])
+    return pd.concat([embargo, pd.Series(times[-1], index=times[-step:])])
+def binomial_pvalue(n_correct: int, n_total: int, p_null: float = 0.5) -> float:
+    """One-sided binomial p-value: ``P(X >= n_correct | n=n_total, p=p_null)``.
+    Used to test whether observed accuracy or directional accuracy exceeds the
+    null. For three-class targets, pass ``p_null=1/3``; for binary direction
+    after dropping 0-labels, pass ``p_null=0.5``.
+    """
+    return float(stats.binomtest(n_correct, n_total, p=p_null, alternative="greater").pvalue)
+def proportion_ci(n_correct: int, n_total: int, alpha: float = 0.05) -> tuple[float, float]:
+    """Wilson 95% CI for an accuracy proportion. More accurate than normal-approx for small n."""
+    if n_total == 0:
+        return (np.nan, np.nan)
+    ci = stats.binomtest(n_correct, n_total).proportion_ci(
+        confidence_level=1 - alpha, method="wilson"
+    )
+    return float(ci.low), float(ci.high)

src/data.py ADDED Viewed

	@@ -0,0 +1,52 @@

+"""Data loaders for the AAPL/SPY pipeline + EWM daily volatility (AFML Snippet 3.1).
+The CSVs under ``data/raw/`` have a column-header bug: the header reads
+``Open,High,Low,Close,Adj Close,Volume`` but the underlying yfinance frame was
+saved after a ``sort_index(axis=1)`` so the actual column order is alphabetical:
+``Adj Close, Close, High, Low, Open, Volume``. We override the headers on load.
+"""
+from __future__ import annotations
+from pathlib import Path
+import numpy as np
+import pandas as pd
+DATA_DIR = Path(__file__).resolve().parent.parent / "data" / "raw"
+ACTUAL_COLUMN_ORDER = ["Date", "Adj Close", "Close", "High", "Low", "Open", "Volume", "company_name"]
+def load_ohlcv(ticker: str, data_dir: Path | None = None) -> pd.DataFrame:
+    """Load a single-ticker OHLCV CSV from ``data/raw/``, fixing the column order."""
+    data_dir = data_dir or DATA_DIR
+    path = data_dir / f"{ticker}_stock_data_2010_2024.csv"
+    df = pd.read_csv(path, header=0, names=ACTUAL_COLUMN_ORDER, skiprows=1)
+    df["Date"] = pd.to_datetime(df["Date"])
+    df = df.set_index("Date").sort_index()
+    return df[["Open", "High", "Low", "Close", "Adj Close", "Volume"]]
+def load_aapl_with_spy() -> pd.DataFrame:
+    """Merged AAPL + SPY frame for market-relative features. Index = trading dates."""
+    aapl = load_ohlcv("AAPL")
+    spy = load_ohlcv("SPY")[["Adj Close", "Volume"]].rename(
+        columns={"Adj Close": "SPY_Close", "Volume": "SPY_Volume"}
+    )
+    return aapl.join(spy, how="inner")
+def get_daily_vol(close: pd.Series, span: int = 100) -> pd.Series:
+    """EWM daily-return volatility — AFML Snippet 3.1 (BonusPDF p.26).
+    Used to set the horizontal barrier widths in triple-barrier labeling. Output
+    is forward-fill safe: NaNs only at the leading edge before EWM warmup.
+    """
+    returns = close.pct_change()
+    return returns.ewm(span=span).std()
+def cumulative_returns_path(close: pd.Series, t0, t1) -> pd.Series:
+    """Return path from t0 to t1 expressed as ``close/close[t0] - 1``."""
+    return close.loc[t0:t1] / close.loc[t0] - 1

src/eval.py ADDED Viewed

	@@ -0,0 +1,76 @@

+"""Evaluation metrics with statistical significance — triple-barrier era.
+The original notebook reported directional accuracy without binomial p-values;
+49.9% over 499 days is statistically indistinguishable from 50%. This module
+makes that explicit by attaching a p-value to every accuracy figure.
+Metric conventions
+------------------
+- For 3-class labels ``{-1, 0, +1}``, the null is uniform random: ``p_null=1/3``.
+- For *directional accuracy when acting*, restrict to predictions ``in {-1, +1}``
+  (i.e. ignore "no-action" 0 predictions), compare to ``p_null=1/2``.
+- Both metrics use a one-sided binomial test (we only care if it beats chance).
+"""
+from __future__ import annotations
+import numpy as np
+import pandas as pd
+from sklearn.metrics import accuracy_score, confusion_matrix
+from .cv import binomial_pvalue
+def directional_accuracy_when_acting(
+    y_true: np.ndarray, y_pred: np.ndarray
+) -> tuple[float, int, int]:
+    """Accuracy conditioned on the model predicting a non-zero direction.
+    Returns ``(accuracy, n_correct, n_acting)``. If ``n_acting`` is 0, returns
+    ``(nan, 0, 0)``.
+    """
+    acting_mask = y_pred != 0
+    n_acting = int(acting_mask.sum())
+    if n_acting == 0:
+        return float("nan"), 0, 0
+    correct = int(((y_pred == y_true) & acting_mask).sum())
+    return correct / n_acting, correct, n_acting
+def fold_metrics(y_true: np.ndarray, y_pred: np.ndarray) -> dict:
+    """Per-fold metric bundle. Designed to be one row in the comparison CSV."""
+    y_true = np.asarray(y_true)
+    y_pred = np.asarray(y_pred)
+    n = len(y_true)
+    acc = accuracy_score(y_true, y_pred)
+    n_acc_correct = int((y_true == y_pred).sum())
+    dir_acc, n_dir_correct, n_acting = directional_accuracy_when_acting(y_true, y_pred)
+    return {
+        "n_test": n,
+        "accuracy": acc,
+        "binom_p_acc": binomial_pvalue(n_acc_correct, n, p_null=1 / 3),
+        "n_acting": n_acting,
+        "dir_acc_when_acting": dir_acc,
+        "binom_p_dir": (
+            binomial_pvalue(n_dir_correct, n_acting, p_null=0.5) if n_acting > 0 else float("nan")
+        ),
+    }
+def summarize_results(results: pd.DataFrame) -> pd.DataFrame:
+    """Aggregate per-fold rows to per-model summary with mean ± std."""
+    keep = ["accuracy", "binom_p_acc", "dir_acc_when_acting", "binom_p_dir"]
+    grouped = results.groupby("model")[keep]
+    summary = grouped.agg(["mean", "std"])
+    summary.columns = [f"{c}_{stat}" for c, stat in summary.columns]
+    summary["n_folds"] = results.groupby("model").size()
+    return summary.reset_index()
+def confusion_table(y_true: np.ndarray, y_pred: np.ndarray, labels=(-1, 0, 1)) -> pd.DataFrame:
+    """Confusion matrix as a labeled DataFrame (rows=true, cols=pred)."""
+    cm = confusion_matrix(y_true, y_pred, labels=list(labels))
+    return pd.DataFrame(
+        cm, index=[f"true_{c}" for c in labels], columns=[f"pred_{c}" for c in labels]
+    )

src/features.py ADDED Viewed

	@@ -0,0 +1,109 @@

+"""Fractional differentiation — AFML Ch.5 §5.4 (BonusPDF p.46).
+Why this module exists
+----------------------
+Log-returns achieve stationarity but destroy memory: the binomial weights
+``(1-B)^d`` collapse to ``[1, -1, 0, 0, ...]`` at ``d=1``. For ``d ∈ (0, 1)``
+the weights decay as a long power-law tail, so the series stays stationary
+while retaining a long memory of past prices (Table 5.1 in AFML shows most
+liquid futures reach ADF stationarity at ``d < 0.6``, and the majority at
+``d < 0.3``).
+This is a port of AFML Snippets 5.1, 5.3, 5.4 (BonusPDF pp.48, 51, 53).
+"""
+from __future__ import annotations
+import numpy as np
+import pandas as pd
+from scipy.special import gamma
+def get_ffd_weights(d: float, thres: float = 1e-5, max_size: int = 1024) -> np.ndarray:
+    """Binomial-series weights for the fractional-differencing operator ``(1-B)^d``.
+    Cuts the series off once ``|w_k| < thres``. Uses ``scipy.special.gamma`` for
+    a vectorized closed form rather than the recursive loop in AFML Snippet 5.1
+    — same values, faster and avoids accumulated float error in long series.
+    Returns
+    -------
+    np.ndarray of shape ``(n,)`` ordered from oldest to newest:
+        ``[w_{n-1}, w_{n-2}, ..., w_1, w_0]`` so the dot product with
+        ``series[t-n+1 : t+1]`` is the differenced value at ``t``.
+    """
+    k = np.arange(max_size)
+    with np.errstate(invalid="ignore", divide="ignore"):
+        w = (-1) ** k * gamma(d + 1) / (gamma(k + 1) * gamma(d - k + 1))
+    w = np.nan_to_num(w, nan=0.0, posinf=0.0, neginf=0.0)
+    cutoff = np.argmax(np.abs(w) < thres) if np.any(np.abs(w) < thres) else max_size
+    if cutoff == 0:
+        cutoff = max_size
+    return w[:cutoff][::-1]
+def frac_diff_ffd(series: pd.Series | pd.DataFrame, d: float, thres: float = 1e-5) -> pd.DataFrame:
+    """Fixed-width fractional differencing — AFML Snippet 5.3 (BonusPDF p.51).
+    The fixed-width window keeps weights stable through time (unlike the
+    expanding-window variant in Snippet 5.2 which downweights early observations).
+    """
+    if isinstance(series, pd.Series):
+        series = series.to_frame()
+    w = get_ffd_weights(d, thres=thres)  # shape (width+1,)
+    width = len(w) - 1
+    out = {}
+    for col in series.columns:
+        s = series[[col]].ffill().dropna()
+        if len(s) <= width:
+            out[col] = pd.Series(index=s.index[width:], dtype=float)
+            continue
+        values = s[col].to_numpy()
+        # Vectorized: build a (n_out, width+1) sliding-window matrix and dot with w
+        from numpy.lib.stride_tricks import sliding_window_view
+        windows = sliding_window_view(values, width + 1)
+        diffed = windows @ w
+        out[col] = pd.Series(diffed, index=s.index[width:])
+    return pd.concat(out, axis=1)
+def find_min_d(series: pd.Series, d_range=(0.0, 1.0), n_steps: int = 11, thres: float = 1e-5) -> pd.DataFrame:
+    """Sweep ``d`` and return ADF stat + correlation — AFML Snippet 5.4 (BonusPDF p.53).
+    Use to pick the smallest ``d`` for which the FFD-differenced log-price passes
+    the ADF stationarity test at 95% (statistic < critical value ≈ -2.86).
+    Returns a frame indexed by ``d`` with columns: ``adf_stat, p_value, n_obs,
+    crit_95, corr_with_original``.
+    """
+    from statsmodels.tsa.stattools import adfuller
+    log_series = np.log(series.dropna()).to_frame(name=series.name or "value")
+    results = {}
+    for d in np.linspace(d_range[0], d_range[1], n_steps):
+        diffed = frac_diff_ffd(log_series, d, thres=thres).dropna()
+        if len(diffed) < 50:
+            continue
+        col = diffed.columns[0]
+        adf = adfuller(diffed[col], maxlag=1, regression="c", autolag=None)
+        aligned = log_series.loc[diffed.index, col]
+        corr = float(aligned.corr(diffed[col]))
+        results[round(d, 3)] = {
+            "adf_stat": adf[0],
+            "p_value": adf[1],
+            "n_obs": adf[3],
+            "crit_95": adf[4]["5%"],
+            "corr_with_original": corr,
+        }
+    return pd.DataFrame(results).T.rename_axis("d")
+def rolling_zscore(series: pd.Series, window: int = 252, min_periods: int | None = None) -> pd.Series:
+    """Rolling z-score with leak-free statistics (uses only the trailing window).
+    Stronger than a single fit-on-train ``StandardScaler`` because regime shifts
+    don't carry stale means forward into the test set.
+    """
+    min_periods = min_periods or max(window // 4, 20)
+    mu = series.rolling(window=window, min_periods=min_periods).mean()
+    sd = series.rolling(window=window, min_periods=min_periods).std()
+    return (series - mu) / sd.replace(0, np.nan)

src/labeling.py ADDED Viewed

	@@ -0,0 +1,219 @@

+"""Triple-barrier labeling — AFML Ch.3 (BonusPDF pp.26-34).
+The triple-barrier method assigns each event one of three labels based on which
+of three barriers is hit first:
+- ``+1`` — upper (profit-taking) horizontal barrier hit first
+- ``-1`` — lower (stop-loss) horizontal barrier hit first
+- ``0``  — vertical (max holding period) barrier hit first
+The horizontal barriers are scaled by a per-event volatility estimate (typically
+EWM daily vol, ``get_daily_vol`` in ``src/data.py``). This is a port of AFML
+Snippets 3.2-3.5 and Rambo's cleaner ``get_triple_barrier_label`` (his repo,
+``Chapter_3.py``).
+"""
+from __future__ import annotations
+import numpy as np
+import pandas as pd
+def apply_pt_sl_on_t1(
+    close: pd.Series, events: pd.DataFrame, pt_sl: tuple[float, float]
+) -> pd.DataFrame:
+    """AFML Snippet 3.2 (BonusPDF p.27). Find time of first barrier touch.
+    Parameters
+    ----------
+    close : pd.Series
+        Closing-price series, indexed by date.
+    events : pd.DataFrame
+        Required columns: ``t1`` (vertical-barrier date or NaT), ``target``
+        (vol estimate at the event), ``side`` (+1 for long, -1 for short; if
+        we don't know side, pass +1 for all).
+    pt_sl : (float, float)
+        Profit-taking and stop-loss multipliers of ``target``. Pass 0 to disable
+        a barrier.
+    Returns
+    -------
+    pd.DataFrame indexed like ``events`` with columns ``t1, pt, sl`` containing
+    the first-touch timestamps (NaT if never touched).
+    """
+    out = events[["t1"]].copy()
+    pt = pt_sl[0] * events["target"] if pt_sl[0] > 0 else pd.Series(np.nan, index=events.index)
+    sl = -pt_sl[1] * events["target"] if pt_sl[1] > 0 else pd.Series(np.nan, index=events.index)
+    for t0, t1 in events["t1"].fillna(close.index[-1]).items():
+        path_prices = close.loc[t0:t1]
+        path_returns = (path_prices / close.loc[t0] - 1) * events.at[t0, "side"]
+        sl_hits = path_returns[path_returns < sl[t0]]
+        pt_hits = path_returns[path_returns > pt[t0]]
+        out.at[t0, "sl"] = sl_hits.index.min() if len(sl_hits) else pd.NaT
+        out.at[t0, "pt"] = pt_hits.index.min() if len(pt_hits) else pd.NaT
+    return out
+def add_vertical_barrier(
+    close: pd.Series, t_events: pd.DatetimeIndex, num_days: int
+) -> pd.Series:
+    """AFML Snippet 3.4 (BonusPDF p.30). Vertical (time-limit) barriers.
+    Returns a Series indexed by ``t_events`` whose values are ``num_days`` later,
+    snapped to the next available trading day; events too close to the end of
+    the series are dropped.
+    """
+    t1 = close.index.searchsorted(t_events + pd.Timedelta(days=num_days))
+    t1 = t1[t1 < close.shape[0]]
+    return pd.Series(close.index[t1], index=t_events[: len(t1)])
+def get_events(
+    close: pd.Series,
+    t_events: pd.DatetimeIndex,
+    pt_sl: tuple[float, float],
+    target: pd.Series,
+    min_ret: float,
+    num_days: int | None = None,
+    side: pd.Series | None = None,
+) -> pd.DataFrame:
+    """AFML Snippet 3.3 (BonusPDF p.29). Run triple-barrier for a batch of events.
+    Returns a DataFrame indexed by event start time with columns:
+    - ``t1`` (timestamp of the *first* barrier hit — earliest of vertical/pt/sl)
+    - ``vertical_t1`` (the original vertical-barrier date)
+    - ``barrier_hit`` (one of ``"vertical"`` / ``"pt"`` / ``"sl"`` — what was hit
+      first; used by ``get_bins`` to produce the {-1, 0, +1} label)
+    - ``target`` (vol estimate at the event)
+    If ``side`` is provided, it is propagated for downstream meta-labeling.
+    """
+    target = target.reindex(t_events).dropna()
+    target = target[target > min_ret]
+    if num_days is not None:
+        vertical_t1 = add_vertical_barrier(close, target.index, num_days)
+    else:
+        vertical_t1 = pd.Series(pd.NaT, index=target.index)
+    if side is None:
+        side_ = pd.Series(1.0, index=target.index)
+    else:
+        side_ = side.reindex(target.index).fillna(1.0)
+    events = pd.concat(
+        {"t1": vertical_t1, "target": target, "side": side_}, axis=1
+    ).dropna(subset=["target"])
+    touches = apply_pt_sl_on_t1(close, events, pt_sl)
+    # Drop events where no barrier ever fires (can't happen with a vertical
+    # barrier present, but defensive against future config changes).
+    touches = touches.dropna(subset=["t1", "pt", "sl"], how="all")
+    events = events.loc[touches.index]
+    # Earliest touch among (vertical, pt, sl); record which barrier won.
+    all_touches = touches[["t1", "pt", "sl"]]
+    earliest = all_touches.min(axis=1)
+    # Manual row-wise argmin: pandas' idxmin chokes on all-NaT slices.
+    barrier_hit = pd.Series("vertical", index=all_touches.index)
+    pt_arr = all_touches["pt"]
+    sl_arr = all_touches["sl"]
+    vert_arr = all_touches["t1"]
+    # Replace NaT with a very large date for comparison purposes
+    far = pd.Timestamp.max
+    cmp = pd.DataFrame(
+        {
+            "pt": pt_arr.fillna(far),
+            "sl": sl_arr.fillna(far),
+            "vertical": vert_arr.fillna(far),
+        }
+    )
+    barrier_hit = cmp.idxmin(axis=1)
+    events["vertical_t1"] = events["t1"]
+    events["t1"] = earliest
+    events["barrier_hit"] = barrier_hit.astype(str)
+    if side is None:
+        events = events.drop("side", axis=1)
+    return events.dropna(subset=["t1"])
+def get_bins(events: pd.DataFrame, close: pd.Series) -> pd.DataFrame:
+    """AFML Snippet 3.5 (BonusPDF p.30). Convert event outcomes to {-1, 0, +1}.
+    Full triple-barrier semantics: the label depends on which barrier was hit
+    *first*:
+    - ``barrier_hit == "pt"``  → ``+1`` (profit-taking, scaled by ``side``)
+    - ``barrier_hit == "sl"``  → ``-1`` (stop-loss, scaled by ``side``)
+    - ``barrier_hit == "vertical"`` → ``0`` (no signal; the time limit ran out
+      before either horizontal barrier was hit)
+    If meta-labeling (``side`` column present), maps to ``{0, 1}`` for
+    "don't act" vs "act in this side".
+    """
+    events_ = events.dropna(subset=["t1"]).copy()
+    px_idx = events_.index.union(events_["t1"].values).unique()
+    px = close.reindex(px_idx, method="bfill")
+    out = pd.DataFrame(index=events_.index)
+    out["ret"] = px.loc[events_["t1"].values].values / px.loc[events_.index].values - 1
+    if "side" in events_.columns:
+        out["ret"] *= events_["side"].values
+    if "barrier_hit" in events_.columns:
+        # Full triple-barrier: 0 when the vertical barrier (time limit) wins.
+        out["bin"] = 0
+        out.loc[events_["barrier_hit"] == "pt", "bin"] = 1
+        out.loc[events_["barrier_hit"] == "sl", "bin"] = -1
+        if "side" in events_.columns:
+            # meta-labeling: collapse to {0, 1} = "don't act / act"
+            out.loc[out["ret"] <= 0, "bin"] = 0
+            out.loc[out["bin"] != 0, "bin"] = 1
+    else:
+        # Fallback to AFML Snippet 3.5 default (sign of return)
+        out["bin"] = np.sign(out["ret"]).astype(int)
+    out["bin"] = out["bin"].astype(int)
+    return out
+def drop_labels(events: pd.DataFrame, min_pct: float = 0.05) -> pd.DataFrame:
+    """AFML Snippet 3.8 (BonusPDF p.34). Drop labels with < ``min_pct`` support.
+    Repeats until every remaining label has at least ``min_pct`` of observations
+    or fewer than 3 classes remain.
+    """
+    while True:
+        counts = events["bin"].value_counts(normalize=True)
+        if counts.min() > min_pct or len(counts) < 3:
+            break
+        smallest = counts.idxmin()
+        events = events[events["bin"] != smallest]
+        print(f"Dropped label {smallest}: {100 * counts.min():.2f}% of observations")
+    return events
+def cusum_filter(series: pd.Series, threshold: float) -> pd.DatetimeIndex:
+    """Symmetric CUSUM filter — AFML §2.5.2 (general technique).
+    Generates event start times where the cumulative sum of returns (in either
+    direction) exceeds ``threshold``. Resets after each event. Returns a
+    DatetimeIndex of event-trigger timestamps.
+    Avoids the "predict on every bar" inefficiency by only labeling at
+    statistically interesting moments.
+    """
+    t_events, s_pos, s_neg = [], 0.0, 0.0
+    diff = series.diff().fillna(0)
+    for t, d in diff.items():
+        s_pos = max(0.0, s_pos + d)
+        s_neg = min(0.0, s_neg + d)
+        if s_neg < -threshold:
+            s_neg = 0.0
+            t_events.append(t)
+        elif s_pos > threshold:
+            s_pos = 0.0
+            t_events.append(t)
+    return pd.DatetimeIndex(t_events)

src/models/__init__.py ADDED Viewed

File without changes

src/models/arima_model.py ADDED Viewed

	@@ -0,0 +1,64 @@

+"""ARIMA wrapper for triple-barrier classification.
+ARIMA forecasts a continuous next-step return; we threshold it into ``{-1, 0, +1}``
+using ``±k·σ`` where ``σ`` is the daily-vol estimate at the event time. The
+``k`` factor matches the profit-taking / stop-loss multiplier used for labeling
+so that the discretization is consistent with the label scheme.
+"""
+from __future__ import annotations
+import warnings
+import numpy as np
+import pandas as pd
+from statsmodels.tsa.arima.model import ARIMA
+class ARIMAClassifier:
+    """Wraps statsmodels ARIMA so it can sit in the same fit/predict loop as XGB/LSTM.
+    The model is fit on the log-price series implied by the training rows (the
+    feature matrix carries the volatility estimate per row, used to threshold).
+    Required X columns: ``frac_diff_close`` (used as a proxy for the underlying
+    log-price level we want to forecast) and ``target_vol`` (per-event vol used
+    to set the ±k·σ threshold).
+    """
+    def __init__(self, order: tuple[int, int, int] = (1, 1, 1), threshold_k: float = 0.5):
+        self.order = order
+        self.threshold_k = threshold_k
+        self.fitted_ = None
+        self.train_tail_value_: float = 0.0
+        self.classes_: np.ndarray = np.array([-1, 0, 1])
+    def fit(self, X, y, sample_weight=None):
+        series = X["frac_diff_close"].astype(float).to_numpy()
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            self.fitted_ = ARIMA(series, order=self.order).fit()
+        self.train_tail_value_ = float(series[-1])
+        return self
+    def predict(self, X):
+        n = len(X)
+        forecast = self.fitted_.forecast(steps=n)
+        # convert forecast deltas back to per-step returns vs the tail of training
+        last = self.train_tail_value_
+        per_step_return = np.diff(np.concatenate([[last], np.asarray(forecast)]))
+        thresholds = self.threshold_k * X["target_vol"].astype(float).to_numpy()
+        preds = np.zeros(n, dtype=int)
+        preds[per_step_return > thresholds] = 1
+        preds[per_step_return < -thresholds] = -1
+        return preds
+    def predict_proba(self, X):
+        # ARIMA isn't probabilistic in the triple-barrier sense; collapse hard
+        # predictions into a one-hot for log-loss calculation.
+        preds = self.predict(X)
+        proba = np.zeros((len(preds), 3))
+        for i, c in enumerate(self.classes_):
+            proba[preds == c, i] = 1.0
+        return proba

src/models/baselines.py ADDED Viewed

	@@ -0,0 +1,89 @@

+"""Naïve and SES baselines for triple-barrier classification.
+The original notebook found SES beat the LSTM under rolling evaluation, which
+was the most interesting result. We keep both baselines under the new label
+scheme to see whether that finding survives a fair (purged-CV) comparison.
+Both classes follow a uniform fit/predict_proba/predict interface so the
+training driver can iterate over models polymorphically.
+"""
+from __future__ import annotations
+import numpy as np
+import pandas as pd
+from statsmodels.tsa.holtwinters import SimpleExpSmoothing
+class MajorityClassClassifier:
+    """Predicts the most common class from the training set every time.
+    The honest "do nothing" baseline. A useful sanity check: any model that
+    fails to beat this on accuracy isn't doing anything.
+    """
+    def __init__(self):
+        self.majority_class_: int | None = None
+        self.classes_: np.ndarray | None = None
+    def fit(self, X, y, sample_weight=None):
+        y = np.asarray(y)
+        self.classes_ = np.unique(y)
+        counts = np.bincount((y - self.classes_.min()).astype(int))
+        self.majority_class_ = int(self.classes_[np.argmax(counts)])
+        return self
+    def predict(self, X):
+        return np.full(len(X), self.majority_class_, dtype=int)
+    def predict_proba(self, X):
+        n = len(X)
+        proba = np.zeros((n, len(self.classes_)))
+        idx = int(np.where(self.classes_ == self.majority_class_)[0][0])
+        proba[:, idx] = 1.0
+        return proba
+class SESClassifier:
+    """Simple exponential smoothing applied to the *label series*, then sign-mapped.
+    Approach: fit ``SimpleExpSmoothing`` on the train labels (treated as a
+    continuous signal in ``{-1, 0, +1}``), forecast next-step level, and round
+    back to the nearest class. Not a real classifier — a sanity check that the
+    label sequence has any short-horizon autocorrelation at all.
+    """
+    def __init__(self, smoothing_level: float | None = None):
+        self.smoothing_level = smoothing_level
+        self.model_ = None
+        self.last_forecast_: float = 0.0
+        self.classes_: np.ndarray | None = None
+    def fit(self, X, y, sample_weight=None):
+        y = np.asarray(y, dtype=float)
+        self.classes_ = np.array(sorted(np.unique(y.astype(int))))
+        self.model_ = SimpleExpSmoothing(y, initialization_method="estimated").fit(
+            smoothing_level=self.smoothing_level, optimized=self.smoothing_level is None
+        )
+        fc = self.model_.forecast(1)
+        self.last_forecast_ = float(fc[0] if hasattr(fc, "__getitem__") else fc)
+        return self
+    def predict(self, X):
+        # SES gives a single forecast; broadcast it across the test window.
+        # The "label series has very weak structure" finding is intentional —
+        # this is meant to be a sanity baseline.
+        n = len(X)
+        forecast = self.last_forecast_
+        return np.full(n, self._nearest_class(forecast), dtype=int)
+    def predict_proba(self, X):
+        n = len(X)
+        pred_class = self._nearest_class(self.last_forecast_)
+        proba = np.zeros((n, len(self.classes_)))
+        idx = int(np.where(self.classes_ == pred_class)[0][0])
+        proba[:, idx] = 1.0
+        return proba
+    def _nearest_class(self, value: float) -> int:
+        return int(self.classes_[np.argmin(np.abs(self.classes_ - value))])

src/models/lstm_model.py ADDED Viewed

	@@ -0,0 +1,179 @@

+"""Refined LSTM for triple-barrier classification.
+Architectural choices vs the original notebook
+-----------------------------------------------
+The original model (128→64 units, MSE, no clipnorm) collapsed to predicting the
+mean. The refinements here are each anchored to a specific reference:
+- **Smaller**: 32→16 units, ~10× fewer params. Jansen's univariate-LSTM notebook
+  uses 10 units on S&P daily. Karpathy warns that over-parameterized RNNs
+  *"do not always show convincing signs of generalizing in the correct way."*
+- **Gradient clipping** (``clipnorm=1.0``): Goodfellow §10.11.1, eq 10.48-49
+  (PDF p.414). Without it, the 60-step BPTT chain has the "cliff" landscape
+  shown in figure 10.17 and SGD updates can be catastrophically large.
+- **Recurrent dropout** (``recurrent_dropout=0.1``): Goodfellow §10.11.2
+  (PDF p.415). Drops the time-axis connections, which is where the
+  generalization problem lives — sequence-level Dropout drops feature
+  dimensions and misses this.
+- **Softmax over 3 classes** with ``categorical_crossentropy``: aligns the
+  loss with the directional-accuracy metric, fixing the original's MSE-vs-
+  direction mismatch.
+- **Forget-gate bias = 1**: Keras default (``unit_forget_bias=True``), kept
+  explicit so a reader sees Goodfellow §10.10.2 (PDF p.412) is honored.
+Sample weighting
+----------------
+AFML Pitfall #7 (Table 1.2) — non-IID samples need uniqueness weighting. The
+training driver passes a ``sample_weight`` array if available; ``categorical_
+crossentropy`` honors it natively via Keras.
+"""
+from __future__ import annotations
+import numpy as np
+def build_lstm(
+    sequence_length: int,
+    n_features: int,
+    n_classes: int = 3,
+    lstm_units: tuple[int, int] = (32, 16),
+    dropout: float = 0.2,
+    recurrent_dropout: float = 0.1,
+    learning_rate: float = 1e-3,
+    clipnorm: float = 1.0,
+):
+    """Build the refined LSTM. Import inside the function so tensorflow doesn't load at module import."""
+    from tensorflow.keras.layers import LSTM, Dense, Dropout, Input
+    from tensorflow.keras.models import Sequential
+    from tensorflow.keras.optimizers import Adam
+    model = Sequential(
+        [
+            Input(shape=(sequence_length, n_features)),
+            LSTM(
+                lstm_units[0],
+                return_sequences=True,
+                recurrent_dropout=recurrent_dropout,
+                unit_forget_bias=True,  # Goodfellow §10.10.2 (PDF p.412)
+            ),
+            Dropout(dropout),
+            LSTM(
+                lstm_units[1],
+                return_sequences=False,
+                recurrent_dropout=recurrent_dropout,
+                unit_forget_bias=True,
+            ),
+            Dropout(dropout),
+            Dense(n_classes, activation="softmax"),
+        ]
+    )
+    model.compile(
+        optimizer=Adam(learning_rate=learning_rate, clipnorm=clipnorm),
+        loss="categorical_crossentropy",
+        metrics=["accuracy"],
+    )
+    return model
+def build_sequences(
+    X: np.ndarray, y: np.ndarray, sequence_length: int
+) -> tuple[np.ndarray, np.ndarray]:
+    """Convert ``(n_obs, n_features)`` into ``(n_seq, sequence_length, n_features)``.
+    The target at sequence index ``i`` is ``y[i + sequence_length - 1]`` — the
+    model predicts the label at the END of each window, not the next step
+    beyond it (the next-step view is handled at the event level by the
+    triple-barrier ``t1``).
+    """
+    n = len(X) - sequence_length + 1
+    if n <= 0:
+        return np.empty((0, sequence_length, X.shape[1])), np.empty((0,))
+    X_seq = np.stack([X[i : i + sequence_length] for i in range(n)])
+    y_seq = y[sequence_length - 1 :]
+    return X_seq, y_seq
+class LSTMTripleBarrier:
+    """Wraps the refined LSTM with the same fit/predict interface as other models.
+    Owns label encoding ``{-1, 0, +1} -> {0, 1, 2}`` and sequence construction
+    so the training driver doesn't have to special-case it.
+    """
+    def __init__(
+        self,
+        sequence_length: int = 60,
+        n_features: int = 8,
+        epochs: int = 50,
+        batch_size: int = 64,
+        patience: int = 15,
+        verbose: int = 0,
+        random_state: int = 42,
+    ):
+        self.sequence_length = sequence_length
+        self.n_features = n_features
+        self.epochs = epochs
+        self.batch_size = batch_size
+        self.patience = patience
+        self.verbose = verbose
+        self.random_state = random_state
+        self.model = None
+        self.classes_ = np.array([-1, 0, 1])
+        self.history_ = None
+    def fit(self, X, y, sample_weight=None):
+        import tensorflow as tf
+        from tensorflow.keras.callbacks import EarlyStopping
+        from tensorflow.keras.utils import to_categorical
+        tf.random.set_seed(self.random_state)
+        np.random.seed(self.random_state)
+        X_arr = np.asarray(X)
+        y_enc = np.asarray(y).astype(int) + 1
+        X_seq, y_seq = build_sequences(X_arr, y_enc, self.sequence_length)
+        if len(X_seq) == 0:
+            raise ValueError(f"Not enough rows ({len(X_arr)}) for sequence_length={self.sequence_length}")
+        y_onehot = to_categorical(y_seq, num_classes=3)
+        sw_seq = None
+        if sample_weight is not None:
+            sw_arr = np.asarray(sample_weight)
+            sw_seq = sw_arr[self.sequence_length - 1 :]
+        self.model = build_lstm(
+            sequence_length=self.sequence_length,
+            n_features=X_arr.shape[1],
+        )
+        callbacks = [
+            EarlyStopping(monitor="loss", patience=self.patience, restore_best_weights=True)
+        ]
+        self.history_ = self.model.fit(
+            X_seq,
+            y_onehot,
+            sample_weight=sw_seq,
+            epochs=self.epochs,
+            batch_size=self.batch_size,
+            verbose=self.verbose,
+            callbacks=callbacks,
+            shuffle=False,
+        )
+        return self
+    def predict_proba(self, X):
+        X_arr = np.asarray(X)
+        # Always pad the start with `sequence_length - 1` copies of the first row
+        # so the output has exactly one prediction per input row. (Without this
+        # pad we'd lose the first 59 rows of every test fold.)
+        n_pad = self.sequence_length - 1
+        pad = np.tile(X_arr[:1], (n_pad, 1))
+        X_padded = np.vstack([pad, X_arr])
+        X_seq, _ = build_sequences(
+            X_padded, np.zeros(len(X_padded)), self.sequence_length
+        )
+        return self.model.predict(X_seq, verbose=0)
+    def predict(self, X):
+        proba = self.predict_proba(X)
+        return self.classes_[np.argmax(proba, axis=1)]

src/models/xgb_model.py ADDED Viewed

	@@ -0,0 +1,59 @@

+"""XGBoost classifier for triple-barrier labels.
+Per Jansen Ch.12, gradient-boosted trees are the natural baseline for tabular
+financial features and routinely beat LSTMs on these problems. The hyper-
+parameters here are conservative (shallow trees, moderate n_estimators) to
+avoid overfitting on small per-fold training sets in the purged CV scheme.
+"""
+from __future__ import annotations
+import numpy as np
+from xgboost import XGBClassifier
+def build_xgb_classifier(random_state: int = 42) -> XGBClassifier:
+    """Returns a fresh XGBClassifier for one CV fold.
+    Output classes use the XGBoost-internal indexing ``{0, 1, 2}`` for
+    ``{-1, 0, +1}`` since XGBoost requires non-negative integer labels. The
+    training driver wraps this with an encoder.
+    """
+    return XGBClassifier(
+        objective="multi:softprob",
+        num_class=3,
+        max_depth=4,
+        n_estimators=300,
+        learning_rate=0.05,
+        subsample=0.8,
+        colsample_bytree=0.8,
+        reg_lambda=1.0,
+        eval_metric="mlogloss",
+        random_state=random_state,
+        n_jobs=-1,
+        tree_method="hist",
+    )
+class XGBTripleBarrier:
+    """Thin wrapper that owns the label encoding from ``{-1, 0, 1}`` ↔ ``{0, 1, 2}``."""
+    def __init__(self, random_state: int = 42):
+        self.model = build_xgb_classifier(random_state=random_state)
+        self.classes_ = np.array([-1, 0, 1])
+    def fit(self, X, y, sample_weight=None):
+        y_enc = np.asarray(y).astype(int) + 1  # {-1, 0, 1} -> {0, 1, 2}
+        self.model.fit(X, y_enc, sample_weight=sample_weight)
+        return self
+    def predict(self, X):
+        y_pred_enc = self.model.predict(X)
+        return y_pred_enc - 1
+    def predict_proba(self, X):
+        return self.model.predict_proba(X)
+    @property
+    def feature_importances_(self):
+        return self.model.feature_importances_

src/train.py ADDED Viewed

	@@ -0,0 +1,102 @@

+"""CV-aware training driver — one harness for all five models.
+The driver expects each model to expose ``fit(X, y, sample_weight=None)``,
+``predict(X)``, and (optionally) ``predict_proba(X)``. The triple-barrier label
+``{-1, 0, +1}`` is shared across all of them.
+Sample weights come from AFML Ch.4 — observations whose label intervals overlap
+contribute less unique information, so they should count less in the loss. The
+simplest implementation is to weight inversely by the number of overlapping
+labels (Snippet 4.1); for now the driver supports passing pre-computed weights
+or falling back to uniform.
+"""
+from __future__ import annotations
+from collections.abc import Callable
+from typing import Any
+import numpy as np
+import pandas as pd
+from sklearn.preprocessing import StandardScaler
+from .cv import PurgedKFold
+from .eval import fold_metrics
+def fit_predict_one_fold(
+    model_builder: Callable[[], Any],
+    X_train: pd.DataFrame,
+    y_train: pd.Series,
+    X_test: pd.DataFrame,
+    sample_weight_train: np.ndarray | None = None,
+    standardize: bool = True,
+) -> tuple[np.ndarray, Any]:
+    """Fit on the train fold, predict on the test fold. Returns (y_pred, fitted_model)."""
+    if standardize:
+        scaler = StandardScaler().fit(X_train.values)
+        X_train_s = pd.DataFrame(
+            scaler.transform(X_train.values), index=X_train.index, columns=X_train.columns
+        )
+        X_test_s = pd.DataFrame(
+            scaler.transform(X_test.values), index=X_test.index, columns=X_test.columns
+        )
+    else:
+        X_train_s, X_test_s = X_train, X_test
+    model = model_builder()
+    model.fit(X_train_s, y_train.values, sample_weight=sample_weight_train)
+    return model.predict(X_test_s), model
+def run_cv(
+    model_name: str,
+    model_builder: Callable[[], Any],
+    X: pd.DataFrame,
+    y: pd.Series,
+    cv: PurgedKFold,
+    sample_weight: pd.Series | None = None,
+    standardize: bool = True,
+    extra_columns: dict | None = None,
+) -> pd.DataFrame:
+    """Run a model across all CV folds. Returns one row per fold."""
+    rows = []
+    for fold_idx, (train_idx, test_idx) in enumerate(cv.split(X)):
+        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
+        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
+        sw_train = sample_weight.iloc[train_idx].values if sample_weight is not None else None
+        y_pred, _ = fit_predict_one_fold(
+            model_builder=model_builder,
+            X_train=X_train,
+            y_train=y_train,
+            X_test=X_test,
+            sample_weight_train=sw_train,
+            standardize=standardize,
+        )
+        metrics = fold_metrics(y_test.values, y_pred)
+        row = {"model": model_name, "fold": fold_idx, **metrics}
+        if extra_columns:
+            row.update(extra_columns)
+        rows.append(row)
+    return pd.DataFrame(rows)
+def uniqueness_weights(t1: pd.Series) -> pd.Series:
+    """Approximate AFML Ch.4 sample-uniqueness weights.
+    For each event, count how many other events have overlapping
+    ``[start, t1]`` intervals, and weight inversely. Not the rigorous Snippet
+    4.1 (which counts overlap proportionally), but the right order of magnitude
+    and much faster.
+    """
+    weights = pd.Series(1.0, index=t1.index)
+    t1_arr = t1.values
+    start_arr = t1.index.values
+    n = len(t1)
+    for i in range(n):
+        overlap = np.sum((start_arr <= t1_arr[i]) & (t1_arr >= start_arr[i]))
+        weights.iloc[i] = 1.0 / max(overlap, 1)
+    # normalize so the weights sum to n (mean weight = 1)
+    weights *= n / weights.sum()
+    return weights

y_test_lstm.npy DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:0b51bdd96ea2313f3a6566cd272c918161d87a704a5b4d7fdab557dca65bdac7
-size 3640

y_train_lstm.npy DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:9ddf60272a325d8ac0328830ed5a7c326a2f4553506a8d75587c8580f025b847
-size 22216

y_val_lstm.npy DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:a7518e2c4f02f2496ac9f5043a9eded9d7fb0e5b37c6c40f5c66dfee6a7bfef4
-size 1656