Spaces:
Sleeping
Sleeping
Replace v1 demo with v2 XGBoost-backed Gradio app (reference-backed rebuild)
Browse filesUpgrades the Space to the v2 pipeline from github.com/moccaram/DataSynth. Real Gradio inference (not the hello-world template), XGBoost trained on triple-barrier labels + fractionally-differenced features, prominent caveat about ~36% directional accuracy when acting.
- .gitattributes +1 -0
- DataSynthis_ML_JobTask.ipynb +0 -0
- README.md +16 -36
- X_test_lstm.npy +0 -3
- X_train_lstm.npy +0 -3
- X_val_lstm.npy +0 -3
- app.py +13 -15
- feature_scaler.pkl → app_screenshot.png +2 -2
- arima_model.pkl +0 -3
- arima_order.pkl +0 -3
- data/raw/AAPL_stock_data_2010_2024.csv +0 -0
- data/raw/SPY_stock_data_2010_2024.csv +0 -0
- data_preparation_metadata.json +0 -59
- lstm_model.h5 +0 -3
- requirements.txt +7 -0
- src/__init__.py +7 -0
- src/app.py +190 -0
- src/cv.py +81 -0
- src/data.py +52 -0
- src/eval.py +76 -0
- src/features.py +109 -0
- src/labeling.py +219 -0
- src/models/__init__.py +0 -0
- src/models/arima_model.py +64 -0
- src/models/baselines.py +89 -0
- src/models/lstm_model.py +179 -0
- src/models/xgb_model.py +59 -0
- src/train.py +102 -0
- y_test_lstm.npy +0 -3
- y_train_lstm.npy +0 -3
- y_val_lstm.npy +0 -3
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
app_screenshot.png filter=lfs diff=lfs merge=lfs -text
|
DataSynthis_ML_JobTask.ipynb
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
README.md
CHANGED
|
@@ -1,47 +1,27 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
colorTo: gray
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
-
license:
|
| 11 |
-
short_description: Stock price forecasting ML demo for DataSynthis internship
|
| 12 |
---
|
| 13 |
|
| 14 |
-
#
|
| 15 |
-
Stock Price Forecasting with Baseline, Statistical, and ML Models
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
|
|
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
4. **Evaluation** → Rolling-window accuracy metrics (RMSE, MAPE)
|
| 25 |
-
5. **Deployment** → Interactive demo with Gradio (via Hugging Face Spaces)
|
| 26 |
|
| 27 |
-
|
| 28 |
-
- Data preprocessing & feature engineering (lags, volatility, RSI, MACD, Bollinger Bands, etc.)
|
| 29 |
-
- Feature validation & pruning (correlation, VIF, outlier checks)
|
| 30 |
-
- Unified comparison of models with a performance summary table
|
| 31 |
-
- Visualizations: trends, normalized comparisons, total returns
|
| 32 |
-
- Exportable datasets for reproducibility
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
- **Models**: Naïve, SES, ARIMA, Prophet, LSTM
|
| 37 |
-
- **Visualizations**: stock trends, indicators, correlations, performance plots
|
| 38 |
-
- **Deployment**: Hugging Face Space with Gradio app
|
| 39 |
-
|
| 40 |
-
## 📂 Repository Structure
|
| 41 |
-
📁 DataSynthis_ML_JobTask
|
| 42 |
-
├── app.py # Gradio demo app
|
| 43 |
-
├── data/ # Preprocessed & engineered datasets
|
| 44 |
-
├── notebooks/ # Jupyter notebooks with full pipeline
|
| 45 |
-
├── models/ # Trained ARIMA / Prophet / LSTM models
|
| 46 |
-
├── outputs/ # Plots, summary tables, feature files
|
| 47 |
-
├── README.md # This file
|
|
|
|
| 1 |
---
|
| 2 |
+
title: AAPL Triple-Barrier Direction Classifier
|
| 3 |
+
emoji: 📊
|
| 4 |
+
colorFrom: blue
|
| 5 |
colorTo: gray
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: "4.44.0"
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
+
license: mit
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# AAPL Triple-Barrier Direction Classifier (educational)
|
|
|
|
| 14 |
|
| 15 |
+
Reference-backed financial-ML demo. XGBoost classifier trained on
|
| 16 |
+
fractionally-differenced features and triple-barrier labels (López de Prado,
|
| 17 |
+
*Advances in Financial Machine Learning*, Ch.3 + Ch.5).
|
| 18 |
|
| 19 |
+
**This is an educational portfolio artifact, not a trading signal.**
|
| 20 |
+
Test-set accuracy ~38% on a 3-class label set (random = 33%, p<0.05 in 3 of 5
|
| 21 |
+
purged folds). Directional accuracy *when the model picks a side* is ~36% —
|
| 22 |
+
worse than coin-flip. Do not trade real money on this.
|
|
|
|
|
|
|
| 23 |
|
| 24 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
+
Full source, technical writeup, and lessons-learned:
|
| 27 |
+
[github.com/moccaram/DataSynth](https://github.com/moccaram/DataSynth).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
X_test_lstm.npy
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:28e28884d7ade2318c01ffa836f14fe66dad42ffd29bcf7c39c589bc9d2ff5b4
|
| 3 |
-
size 2739488
|
|
|
|
|
|
|
|
|
|
|
|
X_train_lstm.npy
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:2bd3342b5569749c14cba69cbc1aae53369ccaaaf0502fc74de1a84c7495788c
|
| 3 |
-
size 17228768
|
|
|
|
|
|
|
|
|
|
|
|
X_val_lstm.npy
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:88464f5ecbfcb80a36d3b7113599d3088f25bc11a38317e343f9868ed907704a
|
| 3 |
-
size 1191968
|
|
|
|
|
|
|
|
|
|
|
|
app.py
CHANGED
|
@@ -1,15 +1,13 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
)
|
| 13 |
-
|
| 14 |
-
if __name__ == "__main__":
|
| 15 |
-
demo.launch()
|
|
|
|
| 1 |
+
"""Hugging Face Spaces entry point. Delegates to src.app for the real interface."""
|
| 2 |
+
|
| 3 |
+
import sys
|
| 4 |
+
from pathlib import Path
|
| 5 |
+
|
| 6 |
+
# Make src/ importable when the Space launches this file from the repo root.
|
| 7 |
+
sys.path.insert(0, str(Path(__file__).resolve().parent))
|
| 8 |
+
|
| 9 |
+
from src.app import build_interface
|
| 10 |
+
|
| 11 |
+
if __name__ == "__main__":
|
| 12 |
+
demo = build_interface()
|
| 13 |
+
demo.launch()
|
|
|
|
|
|
feature_scaler.pkl → app_screenshot.png
RENAMED
|
File without changes
|
arima_model.pkl
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:d6787effc883e371477f02eecc8f5e48e9148a6b286e48af1eeee4f072eb04d9
|
| 3 |
-
size 5295051
|
|
|
|
|
|
|
|
|
|
|
|
arima_order.pkl
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:efc90090103f31c21431c8a3d1ae6c66ca453551649bbf4488b706172c4277a4
|
| 3 |
-
size 20
|
|
|
|
|
|
|
|
|
|
|
|
data/raw/AAPL_stock_data_2010_2024.csv
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/raw/SPY_stock_data_2010_2024.csv
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data_preparation_metadata.json
DELETED
|
@@ -1,59 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"dataset": {
|
| 3 |
-
"total_days": 3572,
|
| 4 |
-
"date_range": "2010-10-19 to 2024-12-27",
|
| 5 |
-
"features": 13,
|
| 6 |
-
"target": "target_return"
|
| 7 |
-
},
|
| 8 |
-
"split": {
|
| 9 |
-
"train_days": 2821,
|
| 10 |
-
"val_days": 251,
|
| 11 |
-
"test_days": 499,
|
| 12 |
-
"train_pct": 78.97536394176932,
|
| 13 |
-
"val_pct": 7.026875699888017,
|
| 14 |
-
"test_pct": 13.96976483762598
|
| 15 |
-
},
|
| 16 |
-
"features": [
|
| 17 |
-
"hl_range",
|
| 18 |
-
"log_return",
|
| 19 |
-
"spy_return",
|
| 20 |
-
"co_range",
|
| 21 |
-
"return_lag2",
|
| 22 |
-
"return_lag5",
|
| 23 |
-
"volatility_20d",
|
| 24 |
-
"volume_change",
|
| 25 |
-
"day_cos",
|
| 26 |
-
"day_of_week",
|
| 27 |
-
"day_sin",
|
| 28 |
-
"month_cos",
|
| 29 |
-
"rolling_beta"
|
| 30 |
-
],
|
| 31 |
-
"prophet_regressors": [
|
| 32 |
-
"hl_range",
|
| 33 |
-
"spy_return",
|
| 34 |
-
"volatility_20d",
|
| 35 |
-
"rolling_beta",
|
| 36 |
-
"volume_change",
|
| 37 |
-
"co_range",
|
| 38 |
-
"day_cos",
|
| 39 |
-
"day_sin"
|
| 40 |
-
],
|
| 41 |
-
"lstm_sequence_length": 60,
|
| 42 |
-
"last_prices": {
|
| 43 |
-
"train": 178.08999633789062,
|
| 44 |
-
"val": 128.41000366210938,
|
| 45 |
-
"test": 257.8299865722656
|
| 46 |
-
},
|
| 47 |
-
"files_created": [
|
| 48 |
-
"feature_scaler.pkl",
|
| 49 |
-
"train_prophet.csv",
|
| 50 |
-
"val_prophet.csv",
|
| 51 |
-
"test_prophet.csv",
|
| 52 |
-
"X_train_lstm.npy",
|
| 53 |
-
"y_train_lstm.npy",
|
| 54 |
-
"X_val_lstm.npy",
|
| 55 |
-
"y_val_lstm.npy",
|
| 56 |
-
"X_test_lstm.npy",
|
| 57 |
-
"y_test_lstm.npy"
|
| 58 |
-
]
|
| 59 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
lstm_model.h5
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:d2e60dea878818cb88f7cd864b68daf9be6c10c80cea8ab0537e3662c48ed041
|
| 3 |
-
size 1535336
|
|
|
|
|
|
|
|
|
|
|
|
requirements.txt
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio>=4.0
|
| 2 |
+
matplotlib>=3.8
|
| 3 |
+
numpy>=1.26,<3
|
| 4 |
+
pandas>=2.1
|
| 5 |
+
scikit-learn>=1.3
|
| 6 |
+
scipy>=1.11
|
| 7 |
+
xgboost>=2.0
|
src/__init__.py
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""DataSynth — reference-backed stock forecasting pipeline.
|
| 2 |
+
|
| 3 |
+
Anchored to:
|
| 4 |
+
- AFML (López de Prado) Ch.3 (labeling), Ch.5 (FFD), Ch.7 (purged CV)
|
| 5 |
+
- Goodfellow et al. Ch.10 §10.11 (RNN optimization)
|
| 6 |
+
- Jansen, *Machine Learning for Algorithmic Trading* Ch.19 (RNNs for time series)
|
| 7 |
+
"""
|
src/app.py
ADDED
|
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Gradio demo — AAPL triple-barrier direction classifier (educational).
|
| 2 |
+
|
| 3 |
+
Loads the XGBoost model (the headline winner in this study, mean test accuracy
|
| 4 |
+
~38% vs 33% random) and lets the user pick any date in the available range to
|
| 5 |
+
inspect the next-10-day direction prediction with class probabilities.
|
| 6 |
+
|
| 7 |
+
This is a *portfolio artifact*. The directional accuracy when the model
|
| 8 |
+
actually picks a side is ~36% — worse than random. Do not trade on this.
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
from __future__ import annotations
|
| 12 |
+
|
| 13 |
+
import io
|
| 14 |
+
import sys
|
| 15 |
+
import warnings
|
| 16 |
+
from pathlib import Path
|
| 17 |
+
|
| 18 |
+
warnings.filterwarnings("ignore")
|
| 19 |
+
|
| 20 |
+
import matplotlib
|
| 21 |
+
matplotlib.use("Agg")
|
| 22 |
+
import matplotlib.pyplot as plt
|
| 23 |
+
import numpy as np
|
| 24 |
+
import pandas as pd
|
| 25 |
+
|
| 26 |
+
ROOT = Path(__file__).resolve().parent.parent
|
| 27 |
+
sys.path.insert(0, str(ROOT))
|
| 28 |
+
|
| 29 |
+
from src.data import load_aapl_with_spy, get_daily_vol
|
| 30 |
+
from src.features import frac_diff_ffd
|
| 31 |
+
from src.labeling import cusum_filter, get_events, get_bins, drop_labels
|
| 32 |
+
from src.models.xgb_model import XGBTripleBarrier
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
CLASS_LABELS = {-1: "DOWN (stop-loss first)", 0: "FLAT (time-out, no signal)", 1: "UP (profit-taking first)"}
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
def build_features_and_labels():
|
| 39 |
+
"""Rebuild the full feature matrix + triple-barrier labels at startup."""
|
| 40 |
+
df = load_aapl_with_spy()
|
| 41 |
+
close = df["Adj Close"]
|
| 42 |
+
log_returns = np.log(close).diff().dropna()
|
| 43 |
+
daily_vol = get_daily_vol(close, span=100)
|
| 44 |
+
|
| 45 |
+
features = pd.DataFrame(index=df.index)
|
| 46 |
+
features["frac_diff_close"] = frac_diff_ffd(np.log(close).to_frame("c"), 0.4, thres=1e-5)["c"]
|
| 47 |
+
features["frac_diff_volume"] = frac_diff_ffd(
|
| 48 |
+
np.log(df["Volume"].replace(0, np.nan)).to_frame("v"), 0.4, thres=1e-5
|
| 49 |
+
)["v"]
|
| 50 |
+
features["hl_range"] = (df["High"] - df["Low"]) / df["Close"]
|
| 51 |
+
features["spy_return"] = np.log(df["SPY_Close"]).diff()
|
| 52 |
+
features["volatility_20d"] = log_returns.rolling(20).std()
|
| 53 |
+
features["rolling_beta"] = (
|
| 54 |
+
log_returns.rolling(30).cov(features["spy_return"])
|
| 55 |
+
/ features["spy_return"].rolling(30).var()
|
| 56 |
+
)
|
| 57 |
+
features["day_of_week"] = df.index.dayofweek
|
| 58 |
+
features["vol_regime"] = daily_vol / daily_vol.rolling(252, min_periods=60).median()
|
| 59 |
+
features = features.dropna()
|
| 60 |
+
|
| 61 |
+
t_events = cusum_filter(np.log(close), threshold=float(daily_vol.median()))
|
| 62 |
+
events = get_events(
|
| 63 |
+
close=close, t_events=t_events, pt_sl=(2.0, 2.0),
|
| 64 |
+
target=daily_vol, min_ret=0.005, num_days=10,
|
| 65 |
+
)
|
| 66 |
+
labels = get_bins(events, close)
|
| 67 |
+
events_with_labels = events.join(labels[["bin"]])
|
| 68 |
+
events_with_labels = drop_labels(events_with_labels, min_pct=0.05)
|
| 69 |
+
labels = labels.loc[events_with_labels.index]
|
| 70 |
+
|
| 71 |
+
aligned = features.index.intersection(labels.index)
|
| 72 |
+
return df, close, features, labels.loc[aligned, "bin"].astype(int), features.loc[aligned]
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
print("Loading data and training XGBoost (one-time, ~10 sec)...")
|
| 76 |
+
DF, CLOSE, FEATURES_FULL, Y_TRAIN, X_TRAIN_ALIGNED = build_features_and_labels()
|
| 77 |
+
|
| 78 |
+
from sklearn.preprocessing import StandardScaler
|
| 79 |
+
SCALER = StandardScaler().fit(X_TRAIN_ALIGNED.values)
|
| 80 |
+
MODEL = XGBTripleBarrier(random_state=42)
|
| 81 |
+
MODEL.fit(
|
| 82 |
+
pd.DataFrame(SCALER.transform(X_TRAIN_ALIGNED.values), index=X_TRAIN_ALIGNED.index, columns=X_TRAIN_ALIGNED.columns),
|
| 83 |
+
Y_TRAIN.values,
|
| 84 |
+
)
|
| 85 |
+
print(f"Model trained on {len(X_TRAIN_ALIGNED)} labeled events. Ready.")
|
| 86 |
+
|
| 87 |
+
VALID_DATES = FEATURES_FULL.index
|
| 88 |
+
DEFAULT_DATE = VALID_DATES[-1]
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
def predict(date_str: str):
|
| 92 |
+
try:
|
| 93 |
+
date = pd.Timestamp(date_str)
|
| 94 |
+
except Exception:
|
| 95 |
+
return "Invalid date format. Use YYYY-MM-DD.", None, None
|
| 96 |
+
|
| 97 |
+
available = FEATURES_FULL.index[FEATURES_FULL.index <= date]
|
| 98 |
+
if len(available) == 0:
|
| 99 |
+
return f"No features available on or before {date.date()}. Try a later date.", None, None
|
| 100 |
+
use_date = available[-1]
|
| 101 |
+
|
| 102 |
+
x_row = FEATURES_FULL.loc[[use_date]]
|
| 103 |
+
x_scaled = pd.DataFrame(SCALER.transform(x_row.values), index=x_row.index, columns=x_row.columns)
|
| 104 |
+
proba = MODEL.predict_proba(x_scaled)[0]
|
| 105 |
+
pred_class = int(MODEL.classes_[np.argmax(proba)])
|
| 106 |
+
|
| 107 |
+
proba_df = pd.DataFrame(
|
| 108 |
+
{"class": [CLASS_LABELS[c] for c in MODEL.classes_], "probability": [f"{p:.1%}" for p in proba]}
|
| 109 |
+
)
|
| 110 |
+
|
| 111 |
+
end_idx = DF.index.get_loc(use_date)
|
| 112 |
+
start_idx = max(0, end_idx - 59)
|
| 113 |
+
chart_data = DF["Adj Close"].iloc[start_idx : end_idx + 1]
|
| 114 |
+
|
| 115 |
+
fig, ax = plt.subplots(figsize=(8, 3.5))
|
| 116 |
+
ax.plot(chart_data.index, chart_data.values, color="black", lw=1.0)
|
| 117 |
+
ax.scatter([chart_data.index[-1]], [chart_data.iloc[-1]], color="red", s=40, zorder=3, label=f"As-of: {use_date.date()}")
|
| 118 |
+
ax.set_title(f"AAPL adjusted close — 60 days ending {use_date.date()}")
|
| 119 |
+
ax.set_ylabel("Price ($)")
|
| 120 |
+
ax.legend(loc="best")
|
| 121 |
+
ax.grid(alpha=0.3)
|
| 122 |
+
plt.tight_layout()
|
| 123 |
+
|
| 124 |
+
summary = (
|
| 125 |
+
f"**As-of date:** {use_date.date()} \n"
|
| 126 |
+
f"**Last close:** ${chart_data.iloc[-1]:.2f} \n"
|
| 127 |
+
f"**Prediction (next 10 trading days):** {CLASS_LABELS[pred_class]} \n"
|
| 128 |
+
f"**Confidence (max class probability):** {proba.max():.1%}"
|
| 129 |
+
)
|
| 130 |
+
return summary, proba_df, fig
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
def build_interface():
|
| 134 |
+
import gradio as gr
|
| 135 |
+
|
| 136 |
+
caveat = """
|
| 137 |
+
> ⚠️ **This is an educational portfolio artifact, NOT a trading signal.**
|
| 138 |
+
>
|
| 139 |
+
> Under 5-fold purged k-fold cross-validation (López de Prado, *AFML*, Ch.7), this XGBoost
|
| 140 |
+
> classifier reaches mean accuracy ~38% on a 3-class triple-barrier label set (random baseline
|
| 141 |
+
> = 33%, p<0.05 in 3 of 5 folds). However, **directional accuracy *when the model picks a side*
|
| 142 |
+
> is ~36% — worse than coin flip**. The model is mildly informative about "will something
|
| 143 |
+
> happen vs nothing" but uninformative about "up vs down." Do not trade real money on this.
|
| 144 |
+
"""
|
| 145 |
+
|
| 146 |
+
with gr.Blocks(title="AAPL Triple-Barrier Direction Classifier") as demo:
|
| 147 |
+
gr.Markdown("# AAPL Triple-Barrier Direction Classifier (educational)")
|
| 148 |
+
gr.Markdown(caveat)
|
| 149 |
+
gr.Markdown(
|
| 150 |
+
"Reference-backed financial-ML pipeline: triple-barrier labeling "
|
| 151 |
+
"(AFML Ch.3), fractional differentiation (Ch.5), purged k-fold CV (Ch.7), "
|
| 152 |
+
"XGBoost classifier. Repo: this folder."
|
| 153 |
+
)
|
| 154 |
+
|
| 155 |
+
with gr.Row():
|
| 156 |
+
with gr.Column(scale=1):
|
| 157 |
+
date_input = gr.Textbox(
|
| 158 |
+
label="As-of date (YYYY-MM-DD)",
|
| 159 |
+
value=str(DEFAULT_DATE.date()),
|
| 160 |
+
info=f"Valid range: {VALID_DATES[0].date()} → {VALID_DATES[-1].date()}",
|
| 161 |
+
)
|
| 162 |
+
predict_btn = gr.Button("Predict next 10-day direction", variant="primary")
|
| 163 |
+
summary_md = gr.Markdown()
|
| 164 |
+
proba_table = gr.Dataframe(headers=["class", "probability"], label="Class probabilities")
|
| 165 |
+
|
| 166 |
+
with gr.Column(scale=2):
|
| 167 |
+
chart = gr.Plot(label="60-day price context")
|
| 168 |
+
|
| 169 |
+
predict_btn.click(
|
| 170 |
+
fn=predict, inputs=[date_input], outputs=[summary_md, proba_table, chart]
|
| 171 |
+
)
|
| 172 |
+
|
| 173 |
+
gr.Markdown(
|
| 174 |
+
"---\n"
|
| 175 |
+
"Headline result table (mean over 5 purged folds):\n\n"
|
| 176 |
+
"| Model | Accuracy | Beat random (p<0.05) | Dir.acc when acting |\n"
|
| 177 |
+
"|-----------|----------|----------------------|---------------------|\n"
|
| 178 |
+
"| Majority | 35.0% | 0/5 folds | N/A |\n"
|
| 179 |
+
"| SES | 36.8% | 2/5 folds | always abstains |\n"
|
| 180 |
+
"| ARIMA | 36.8% | 2/5 folds | always abstains |\n"
|
| 181 |
+
"| LSTM | 35.8% | 2/5 folds | 33% (worse than 50%) |\n"
|
| 182 |
+
"| **XGBoost** | **37.8%** | **3/5 folds** | 36% (worse than 50%) |\n"
|
| 183 |
+
)
|
| 184 |
+
|
| 185 |
+
return demo
|
| 186 |
+
|
| 187 |
+
|
| 188 |
+
if __name__ == "__main__":
|
| 189 |
+
app = build_interface()
|
| 190 |
+
app.launch(server_name="127.0.0.1", server_port=7860, inbrowser=False, share=False)
|
src/cv.py
ADDED
|
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Purged k-fold cross-validation — AFML Ch.7 (BonusPDF pp.62-67).
|
| 2 |
+
|
| 3 |
+
Standard k-fold leaks information in finance because labels span time intervals.
|
| 4 |
+
If a training label's interval ``[t_i, t1_i]`` overlaps a test label's interval
|
| 5 |
+
``[t_j, t1_j]``, the two share underlying price information and the train/test
|
| 6 |
+
boundary is fictitious. ``PurgedKFold`` drops the offending training samples;
|
| 7 |
+
an additional ``pctEmbargo`` buffer drops samples immediately *after* each test
|
| 8 |
+
fold to prevent reverse leakage from the test set into a later train fold.
|
| 9 |
+
|
| 10 |
+
This is a port of AFML Snippets 7.2-7.3 (BonusPDF pp.65-66). The canonical class
|
| 11 |
+
inherits from sklearn's ``_BaseKFold`` so it works as a drop-in replacement.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from __future__ import annotations
|
| 15 |
+
|
| 16 |
+
import numpy as np
|
| 17 |
+
import pandas as pd
|
| 18 |
+
from scipy import stats
|
| 19 |
+
from sklearn.model_selection._split import _BaseKFold
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
class PurgedKFold(_BaseKFold):
|
| 23 |
+
"""K-fold CV with purging + optional embargo. AFML Snippet 7.3 (BonusPDF p.66)."""
|
| 24 |
+
|
| 25 |
+
def __init__(self, n_splits: int = 5, t1: pd.Series | None = None, pct_embargo: float = 0.0):
|
| 26 |
+
if not isinstance(t1, pd.Series):
|
| 27 |
+
raise ValueError("`t1` must be a pd.Series of label-end timestamps")
|
| 28 |
+
super().__init__(n_splits, shuffle=False, random_state=None)
|
| 29 |
+
self.t1 = t1
|
| 30 |
+
self.pct_embargo = pct_embargo
|
| 31 |
+
|
| 32 |
+
def split(self, X, y=None, groups=None):
|
| 33 |
+
if not X.index.equals(self.t1.index):
|
| 34 |
+
raise ValueError("X.index must equal t1.index")
|
| 35 |
+
indices = np.arange(X.shape[0])
|
| 36 |
+
embargo_size = int(X.shape[0] * self.pct_embargo)
|
| 37 |
+
test_ranges = [(arr[0], arr[-1] + 1) for arr in np.array_split(indices, self.n_splits)]
|
| 38 |
+
|
| 39 |
+
for i, j in test_ranges:
|
| 40 |
+
t0 = self.t1.index[i]
|
| 41 |
+
test_indices = indices[i:j]
|
| 42 |
+
max_t1_in_test = self.t1.iloc[test_indices].max()
|
| 43 |
+
max_t1_pos = self.t1.index.searchsorted(max_t1_in_test)
|
| 44 |
+
# left train: rows whose label ended before test starts
|
| 45 |
+
left_train = self.t1.index.searchsorted(self.t1[self.t1 <= t0].index)
|
| 46 |
+
# right train: rows starting after max-t1 + embargo
|
| 47 |
+
if max_t1_pos < X.shape[0]:
|
| 48 |
+
right_train = indices[max_t1_pos + embargo_size :]
|
| 49 |
+
else:
|
| 50 |
+
right_train = np.array([], dtype=int)
|
| 51 |
+
train_indices = np.concatenate([left_train, right_train])
|
| 52 |
+
yield train_indices, test_indices
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def get_embargo_times(times: pd.DatetimeIndex, pct_embargo: float) -> pd.Series:
|
| 56 |
+
"""AFML Snippet 7.2 (BonusPDF p.65). Map each timestamp to its embargo end."""
|
| 57 |
+
step = int(times.shape[0] * pct_embargo)
|
| 58 |
+
if step == 0:
|
| 59 |
+
return pd.Series(times, index=times)
|
| 60 |
+
embargo = pd.Series(times[step:], index=times[:-step])
|
| 61 |
+
return pd.concat([embargo, pd.Series(times[-1], index=times[-step:])])
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
def binomial_pvalue(n_correct: int, n_total: int, p_null: float = 0.5) -> float:
|
| 65 |
+
"""One-sided binomial p-value: ``P(X >= n_correct | n=n_total, p=p_null)``.
|
| 66 |
+
|
| 67 |
+
Used to test whether observed accuracy or directional accuracy exceeds the
|
| 68 |
+
null. For three-class targets, pass ``p_null=1/3``; for binary direction
|
| 69 |
+
after dropping 0-labels, pass ``p_null=0.5``.
|
| 70 |
+
"""
|
| 71 |
+
return float(stats.binomtest(n_correct, n_total, p=p_null, alternative="greater").pvalue)
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def proportion_ci(n_correct: int, n_total: int, alpha: float = 0.05) -> tuple[float, float]:
|
| 75 |
+
"""Wilson 95% CI for an accuracy proportion. More accurate than normal-approx for small n."""
|
| 76 |
+
if n_total == 0:
|
| 77 |
+
return (np.nan, np.nan)
|
| 78 |
+
ci = stats.binomtest(n_correct, n_total).proportion_ci(
|
| 79 |
+
confidence_level=1 - alpha, method="wilson"
|
| 80 |
+
)
|
| 81 |
+
return float(ci.low), float(ci.high)
|
src/data.py
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Data loaders for the AAPL/SPY pipeline + EWM daily volatility (AFML Snippet 3.1).
|
| 2 |
+
|
| 3 |
+
The CSVs under ``data/raw/`` have a column-header bug: the header reads
|
| 4 |
+
``Open,High,Low,Close,Adj Close,Volume`` but the underlying yfinance frame was
|
| 5 |
+
saved after a ``sort_index(axis=1)`` so the actual column order is alphabetical:
|
| 6 |
+
``Adj Close, Close, High, Low, Open, Volume``. We override the headers on load.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
from __future__ import annotations
|
| 10 |
+
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
|
| 13 |
+
import numpy as np
|
| 14 |
+
import pandas as pd
|
| 15 |
+
|
| 16 |
+
DATA_DIR = Path(__file__).resolve().parent.parent / "data" / "raw"
|
| 17 |
+
|
| 18 |
+
ACTUAL_COLUMN_ORDER = ["Date", "Adj Close", "Close", "High", "Low", "Open", "Volume", "company_name"]
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
def load_ohlcv(ticker: str, data_dir: Path | None = None) -> pd.DataFrame:
|
| 22 |
+
"""Load a single-ticker OHLCV CSV from ``data/raw/``, fixing the column order."""
|
| 23 |
+
data_dir = data_dir or DATA_DIR
|
| 24 |
+
path = data_dir / f"{ticker}_stock_data_2010_2024.csv"
|
| 25 |
+
df = pd.read_csv(path, header=0, names=ACTUAL_COLUMN_ORDER, skiprows=1)
|
| 26 |
+
df["Date"] = pd.to_datetime(df["Date"])
|
| 27 |
+
df = df.set_index("Date").sort_index()
|
| 28 |
+
return df[["Open", "High", "Low", "Close", "Adj Close", "Volume"]]
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
def load_aapl_with_spy() -> pd.DataFrame:
|
| 32 |
+
"""Merged AAPL + SPY frame for market-relative features. Index = trading dates."""
|
| 33 |
+
aapl = load_ohlcv("AAPL")
|
| 34 |
+
spy = load_ohlcv("SPY")[["Adj Close", "Volume"]].rename(
|
| 35 |
+
columns={"Adj Close": "SPY_Close", "Volume": "SPY_Volume"}
|
| 36 |
+
)
|
| 37 |
+
return aapl.join(spy, how="inner")
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def get_daily_vol(close: pd.Series, span: int = 100) -> pd.Series:
|
| 41 |
+
"""EWM daily-return volatility — AFML Snippet 3.1 (BonusPDF p.26).
|
| 42 |
+
|
| 43 |
+
Used to set the horizontal barrier widths in triple-barrier labeling. Output
|
| 44 |
+
is forward-fill safe: NaNs only at the leading edge before EWM warmup.
|
| 45 |
+
"""
|
| 46 |
+
returns = close.pct_change()
|
| 47 |
+
return returns.ewm(span=span).std()
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def cumulative_returns_path(close: pd.Series, t0, t1) -> pd.Series:
|
| 51 |
+
"""Return path from t0 to t1 expressed as ``close/close[t0] - 1``."""
|
| 52 |
+
return close.loc[t0:t1] / close.loc[t0] - 1
|
src/eval.py
ADDED
|
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Evaluation metrics with statistical significance — triple-barrier era.
|
| 2 |
+
|
| 3 |
+
The original notebook reported directional accuracy without binomial p-values;
|
| 4 |
+
49.9% over 499 days is statistically indistinguishable from 50%. This module
|
| 5 |
+
makes that explicit by attaching a p-value to every accuracy figure.
|
| 6 |
+
|
| 7 |
+
Metric conventions
|
| 8 |
+
------------------
|
| 9 |
+
- For 3-class labels ``{-1, 0, +1}``, the null is uniform random: ``p_null=1/3``.
|
| 10 |
+
- For *directional accuracy when acting*, restrict to predictions ``in {-1, +1}``
|
| 11 |
+
(i.e. ignore "no-action" 0 predictions), compare to ``p_null=1/2``.
|
| 12 |
+
- Both metrics use a one-sided binomial test (we only care if it beats chance).
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
from __future__ import annotations
|
| 16 |
+
|
| 17 |
+
import numpy as np
|
| 18 |
+
import pandas as pd
|
| 19 |
+
from sklearn.metrics import accuracy_score, confusion_matrix
|
| 20 |
+
|
| 21 |
+
from .cv import binomial_pvalue
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
def directional_accuracy_when_acting(
|
| 25 |
+
y_true: np.ndarray, y_pred: np.ndarray
|
| 26 |
+
) -> tuple[float, int, int]:
|
| 27 |
+
"""Accuracy conditioned on the model predicting a non-zero direction.
|
| 28 |
+
|
| 29 |
+
Returns ``(accuracy, n_correct, n_acting)``. If ``n_acting`` is 0, returns
|
| 30 |
+
``(nan, 0, 0)``.
|
| 31 |
+
"""
|
| 32 |
+
acting_mask = y_pred != 0
|
| 33 |
+
n_acting = int(acting_mask.sum())
|
| 34 |
+
if n_acting == 0:
|
| 35 |
+
return float("nan"), 0, 0
|
| 36 |
+
correct = int(((y_pred == y_true) & acting_mask).sum())
|
| 37 |
+
return correct / n_acting, correct, n_acting
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def fold_metrics(y_true: np.ndarray, y_pred: np.ndarray) -> dict:
|
| 41 |
+
"""Per-fold metric bundle. Designed to be one row in the comparison CSV."""
|
| 42 |
+
y_true = np.asarray(y_true)
|
| 43 |
+
y_pred = np.asarray(y_pred)
|
| 44 |
+
n = len(y_true)
|
| 45 |
+
acc = accuracy_score(y_true, y_pred)
|
| 46 |
+
n_acc_correct = int((y_true == y_pred).sum())
|
| 47 |
+
dir_acc, n_dir_correct, n_acting = directional_accuracy_when_acting(y_true, y_pred)
|
| 48 |
+
|
| 49 |
+
return {
|
| 50 |
+
"n_test": n,
|
| 51 |
+
"accuracy": acc,
|
| 52 |
+
"binom_p_acc": binomial_pvalue(n_acc_correct, n, p_null=1 / 3),
|
| 53 |
+
"n_acting": n_acting,
|
| 54 |
+
"dir_acc_when_acting": dir_acc,
|
| 55 |
+
"binom_p_dir": (
|
| 56 |
+
binomial_pvalue(n_dir_correct, n_acting, p_null=0.5) if n_acting > 0 else float("nan")
|
| 57 |
+
),
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def summarize_results(results: pd.DataFrame) -> pd.DataFrame:
|
| 62 |
+
"""Aggregate per-fold rows to per-model summary with mean ± std."""
|
| 63 |
+
keep = ["accuracy", "binom_p_acc", "dir_acc_when_acting", "binom_p_dir"]
|
| 64 |
+
grouped = results.groupby("model")[keep]
|
| 65 |
+
summary = grouped.agg(["mean", "std"])
|
| 66 |
+
summary.columns = [f"{c}_{stat}" for c, stat in summary.columns]
|
| 67 |
+
summary["n_folds"] = results.groupby("model").size()
|
| 68 |
+
return summary.reset_index()
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
def confusion_table(y_true: np.ndarray, y_pred: np.ndarray, labels=(-1, 0, 1)) -> pd.DataFrame:
|
| 72 |
+
"""Confusion matrix as a labeled DataFrame (rows=true, cols=pred)."""
|
| 73 |
+
cm = confusion_matrix(y_true, y_pred, labels=list(labels))
|
| 74 |
+
return pd.DataFrame(
|
| 75 |
+
cm, index=[f"true_{c}" for c in labels], columns=[f"pred_{c}" for c in labels]
|
| 76 |
+
)
|
src/features.py
ADDED
|
@@ -0,0 +1,109 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Fractional differentiation — AFML Ch.5 §5.4 (BonusPDF p.46).
|
| 2 |
+
|
| 3 |
+
Why this module exists
|
| 4 |
+
----------------------
|
| 5 |
+
Log-returns achieve stationarity but destroy memory: the binomial weights
|
| 6 |
+
``(1-B)^d`` collapse to ``[1, -1, 0, 0, ...]`` at ``d=1``. For ``d ∈ (0, 1)``
|
| 7 |
+
the weights decay as a long power-law tail, so the series stays stationary
|
| 8 |
+
while retaining a long memory of past prices (Table 5.1 in AFML shows most
|
| 9 |
+
liquid futures reach ADF stationarity at ``d < 0.6``, and the majority at
|
| 10 |
+
``d < 0.3``).
|
| 11 |
+
|
| 12 |
+
This is a port of AFML Snippets 5.1, 5.3, 5.4 (BonusPDF pp.48, 51, 53).
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
from __future__ import annotations
|
| 16 |
+
|
| 17 |
+
import numpy as np
|
| 18 |
+
import pandas as pd
|
| 19 |
+
from scipy.special import gamma
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
def get_ffd_weights(d: float, thres: float = 1e-5, max_size: int = 1024) -> np.ndarray:
|
| 23 |
+
"""Binomial-series weights for the fractional-differencing operator ``(1-B)^d``.
|
| 24 |
+
|
| 25 |
+
Cuts the series off once ``|w_k| < thres``. Uses ``scipy.special.gamma`` for
|
| 26 |
+
a vectorized closed form rather than the recursive loop in AFML Snippet 5.1
|
| 27 |
+
— same values, faster and avoids accumulated float error in long series.
|
| 28 |
+
|
| 29 |
+
Returns
|
| 30 |
+
-------
|
| 31 |
+
np.ndarray of shape ``(n,)`` ordered from oldest to newest:
|
| 32 |
+
``[w_{n-1}, w_{n-2}, ..., w_1, w_0]`` so the dot product with
|
| 33 |
+
``series[t-n+1 : t+1]`` is the differenced value at ``t``.
|
| 34 |
+
"""
|
| 35 |
+
k = np.arange(max_size)
|
| 36 |
+
with np.errstate(invalid="ignore", divide="ignore"):
|
| 37 |
+
w = (-1) ** k * gamma(d + 1) / (gamma(k + 1) * gamma(d - k + 1))
|
| 38 |
+
w = np.nan_to_num(w, nan=0.0, posinf=0.0, neginf=0.0)
|
| 39 |
+
cutoff = np.argmax(np.abs(w) < thres) if np.any(np.abs(w) < thres) else max_size
|
| 40 |
+
if cutoff == 0:
|
| 41 |
+
cutoff = max_size
|
| 42 |
+
return w[:cutoff][::-1]
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def frac_diff_ffd(series: pd.Series | pd.DataFrame, d: float, thres: float = 1e-5) -> pd.DataFrame:
|
| 46 |
+
"""Fixed-width fractional differencing — AFML Snippet 5.3 (BonusPDF p.51).
|
| 47 |
+
|
| 48 |
+
The fixed-width window keeps weights stable through time (unlike the
|
| 49 |
+
expanding-window variant in Snippet 5.2 which downweights early observations).
|
| 50 |
+
"""
|
| 51 |
+
if isinstance(series, pd.Series):
|
| 52 |
+
series = series.to_frame()
|
| 53 |
+
w = get_ffd_weights(d, thres=thres) # shape (width+1,)
|
| 54 |
+
width = len(w) - 1
|
| 55 |
+
out = {}
|
| 56 |
+
for col in series.columns:
|
| 57 |
+
s = series[[col]].ffill().dropna()
|
| 58 |
+
if len(s) <= width:
|
| 59 |
+
out[col] = pd.Series(index=s.index[width:], dtype=float)
|
| 60 |
+
continue
|
| 61 |
+
values = s[col].to_numpy()
|
| 62 |
+
# Vectorized: build a (n_out, width+1) sliding-window matrix and dot with w
|
| 63 |
+
from numpy.lib.stride_tricks import sliding_window_view
|
| 64 |
+
windows = sliding_window_view(values, width + 1)
|
| 65 |
+
diffed = windows @ w
|
| 66 |
+
out[col] = pd.Series(diffed, index=s.index[width:])
|
| 67 |
+
return pd.concat(out, axis=1)
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
def find_min_d(series: pd.Series, d_range=(0.0, 1.0), n_steps: int = 11, thres: float = 1e-5) -> pd.DataFrame:
|
| 71 |
+
"""Sweep ``d`` and return ADF stat + correlation — AFML Snippet 5.4 (BonusPDF p.53).
|
| 72 |
+
|
| 73 |
+
Use to pick the smallest ``d`` for which the FFD-differenced log-price passes
|
| 74 |
+
the ADF stationarity test at 95% (statistic < critical value ≈ -2.86).
|
| 75 |
+
Returns a frame indexed by ``d`` with columns: ``adf_stat, p_value, n_obs,
|
| 76 |
+
crit_95, corr_with_original``.
|
| 77 |
+
"""
|
| 78 |
+
from statsmodels.tsa.stattools import adfuller
|
| 79 |
+
|
| 80 |
+
log_series = np.log(series.dropna()).to_frame(name=series.name or "value")
|
| 81 |
+
results = {}
|
| 82 |
+
for d in np.linspace(d_range[0], d_range[1], n_steps):
|
| 83 |
+
diffed = frac_diff_ffd(log_series, d, thres=thres).dropna()
|
| 84 |
+
if len(diffed) < 50:
|
| 85 |
+
continue
|
| 86 |
+
col = diffed.columns[0]
|
| 87 |
+
adf = adfuller(diffed[col], maxlag=1, regression="c", autolag=None)
|
| 88 |
+
aligned = log_series.loc[diffed.index, col]
|
| 89 |
+
corr = float(aligned.corr(diffed[col]))
|
| 90 |
+
results[round(d, 3)] = {
|
| 91 |
+
"adf_stat": adf[0],
|
| 92 |
+
"p_value": adf[1],
|
| 93 |
+
"n_obs": adf[3],
|
| 94 |
+
"crit_95": adf[4]["5%"],
|
| 95 |
+
"corr_with_original": corr,
|
| 96 |
+
}
|
| 97 |
+
return pd.DataFrame(results).T.rename_axis("d")
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
def rolling_zscore(series: pd.Series, window: int = 252, min_periods: int | None = None) -> pd.Series:
|
| 101 |
+
"""Rolling z-score with leak-free statistics (uses only the trailing window).
|
| 102 |
+
|
| 103 |
+
Stronger than a single fit-on-train ``StandardScaler`` because regime shifts
|
| 104 |
+
don't carry stale means forward into the test set.
|
| 105 |
+
"""
|
| 106 |
+
min_periods = min_periods or max(window // 4, 20)
|
| 107 |
+
mu = series.rolling(window=window, min_periods=min_periods).mean()
|
| 108 |
+
sd = series.rolling(window=window, min_periods=min_periods).std()
|
| 109 |
+
return (series - mu) / sd.replace(0, np.nan)
|
src/labeling.py
ADDED
|
@@ -0,0 +1,219 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Triple-barrier labeling — AFML Ch.3 (BonusPDF pp.26-34).
|
| 2 |
+
|
| 3 |
+
The triple-barrier method assigns each event one of three labels based on which
|
| 4 |
+
of three barriers is hit first:
|
| 5 |
+
|
| 6 |
+
- ``+1`` — upper (profit-taking) horizontal barrier hit first
|
| 7 |
+
- ``-1`` — lower (stop-loss) horizontal barrier hit first
|
| 8 |
+
- ``0`` — vertical (max holding period) barrier hit first
|
| 9 |
+
|
| 10 |
+
The horizontal barriers are scaled by a per-event volatility estimate (typically
|
| 11 |
+
EWM daily vol, ``get_daily_vol`` in ``src/data.py``). This is a port of AFML
|
| 12 |
+
Snippets 3.2-3.5 and Rambo's cleaner ``get_triple_barrier_label`` (his repo,
|
| 13 |
+
``Chapter_3.py``).
|
| 14 |
+
"""
|
| 15 |
+
|
| 16 |
+
from __future__ import annotations
|
| 17 |
+
|
| 18 |
+
import numpy as np
|
| 19 |
+
import pandas as pd
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
def apply_pt_sl_on_t1(
|
| 23 |
+
close: pd.Series, events: pd.DataFrame, pt_sl: tuple[float, float]
|
| 24 |
+
) -> pd.DataFrame:
|
| 25 |
+
"""AFML Snippet 3.2 (BonusPDF p.27). Find time of first barrier touch.
|
| 26 |
+
|
| 27 |
+
Parameters
|
| 28 |
+
----------
|
| 29 |
+
close : pd.Series
|
| 30 |
+
Closing-price series, indexed by date.
|
| 31 |
+
events : pd.DataFrame
|
| 32 |
+
Required columns: ``t1`` (vertical-barrier date or NaT), ``target``
|
| 33 |
+
(vol estimate at the event), ``side`` (+1 for long, -1 for short; if
|
| 34 |
+
we don't know side, pass +1 for all).
|
| 35 |
+
pt_sl : (float, float)
|
| 36 |
+
Profit-taking and stop-loss multipliers of ``target``. Pass 0 to disable
|
| 37 |
+
a barrier.
|
| 38 |
+
|
| 39 |
+
Returns
|
| 40 |
+
-------
|
| 41 |
+
pd.DataFrame indexed like ``events`` with columns ``t1, pt, sl`` containing
|
| 42 |
+
the first-touch timestamps (NaT if never touched).
|
| 43 |
+
"""
|
| 44 |
+
out = events[["t1"]].copy()
|
| 45 |
+
pt = pt_sl[0] * events["target"] if pt_sl[0] > 0 else pd.Series(np.nan, index=events.index)
|
| 46 |
+
sl = -pt_sl[1] * events["target"] if pt_sl[1] > 0 else pd.Series(np.nan, index=events.index)
|
| 47 |
+
|
| 48 |
+
for t0, t1 in events["t1"].fillna(close.index[-1]).items():
|
| 49 |
+
path_prices = close.loc[t0:t1]
|
| 50 |
+
path_returns = (path_prices / close.loc[t0] - 1) * events.at[t0, "side"]
|
| 51 |
+
sl_hits = path_returns[path_returns < sl[t0]]
|
| 52 |
+
pt_hits = path_returns[path_returns > pt[t0]]
|
| 53 |
+
out.at[t0, "sl"] = sl_hits.index.min() if len(sl_hits) else pd.NaT
|
| 54 |
+
out.at[t0, "pt"] = pt_hits.index.min() if len(pt_hits) else pd.NaT
|
| 55 |
+
return out
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
def add_vertical_barrier(
|
| 59 |
+
close: pd.Series, t_events: pd.DatetimeIndex, num_days: int
|
| 60 |
+
) -> pd.Series:
|
| 61 |
+
"""AFML Snippet 3.4 (BonusPDF p.30). Vertical (time-limit) barriers.
|
| 62 |
+
|
| 63 |
+
Returns a Series indexed by ``t_events`` whose values are ``num_days`` later,
|
| 64 |
+
snapped to the next available trading day; events too close to the end of
|
| 65 |
+
the series are dropped.
|
| 66 |
+
"""
|
| 67 |
+
t1 = close.index.searchsorted(t_events + pd.Timedelta(days=num_days))
|
| 68 |
+
t1 = t1[t1 < close.shape[0]]
|
| 69 |
+
return pd.Series(close.index[t1], index=t_events[: len(t1)])
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
def get_events(
|
| 73 |
+
close: pd.Series,
|
| 74 |
+
t_events: pd.DatetimeIndex,
|
| 75 |
+
pt_sl: tuple[float, float],
|
| 76 |
+
target: pd.Series,
|
| 77 |
+
min_ret: float,
|
| 78 |
+
num_days: int | None = None,
|
| 79 |
+
side: pd.Series | None = None,
|
| 80 |
+
) -> pd.DataFrame:
|
| 81 |
+
"""AFML Snippet 3.3 (BonusPDF p.29). Run triple-barrier for a batch of events.
|
| 82 |
+
|
| 83 |
+
Returns a DataFrame indexed by event start time with columns:
|
| 84 |
+
|
| 85 |
+
- ``t1`` (timestamp of the *first* barrier hit — earliest of vertical/pt/sl)
|
| 86 |
+
- ``vertical_t1`` (the original vertical-barrier date)
|
| 87 |
+
- ``barrier_hit`` (one of ``"vertical"`` / ``"pt"`` / ``"sl"`` — what was hit
|
| 88 |
+
first; used by ``get_bins`` to produce the {-1, 0, +1} label)
|
| 89 |
+
- ``target`` (vol estimate at the event)
|
| 90 |
+
|
| 91 |
+
If ``side`` is provided, it is propagated for downstream meta-labeling.
|
| 92 |
+
"""
|
| 93 |
+
target = target.reindex(t_events).dropna()
|
| 94 |
+
target = target[target > min_ret]
|
| 95 |
+
|
| 96 |
+
if num_days is not None:
|
| 97 |
+
vertical_t1 = add_vertical_barrier(close, target.index, num_days)
|
| 98 |
+
else:
|
| 99 |
+
vertical_t1 = pd.Series(pd.NaT, index=target.index)
|
| 100 |
+
|
| 101 |
+
if side is None:
|
| 102 |
+
side_ = pd.Series(1.0, index=target.index)
|
| 103 |
+
else:
|
| 104 |
+
side_ = side.reindex(target.index).fillna(1.0)
|
| 105 |
+
|
| 106 |
+
events = pd.concat(
|
| 107 |
+
{"t1": vertical_t1, "target": target, "side": side_}, axis=1
|
| 108 |
+
).dropna(subset=["target"])
|
| 109 |
+
touches = apply_pt_sl_on_t1(close, events, pt_sl)
|
| 110 |
+
|
| 111 |
+
# Drop events where no barrier ever fires (can't happen with a vertical
|
| 112 |
+
# barrier present, but defensive against future config changes).
|
| 113 |
+
touches = touches.dropna(subset=["t1", "pt", "sl"], how="all")
|
| 114 |
+
events = events.loc[touches.index]
|
| 115 |
+
|
| 116 |
+
# Earliest touch among (vertical, pt, sl); record which barrier won.
|
| 117 |
+
all_touches = touches[["t1", "pt", "sl"]]
|
| 118 |
+
earliest = all_touches.min(axis=1)
|
| 119 |
+
# Manual row-wise argmin: pandas' idxmin chokes on all-NaT slices.
|
| 120 |
+
barrier_hit = pd.Series("vertical", index=all_touches.index)
|
| 121 |
+
pt_arr = all_touches["pt"]
|
| 122 |
+
sl_arr = all_touches["sl"]
|
| 123 |
+
vert_arr = all_touches["t1"]
|
| 124 |
+
# Replace NaT with a very large date for comparison purposes
|
| 125 |
+
far = pd.Timestamp.max
|
| 126 |
+
cmp = pd.DataFrame(
|
| 127 |
+
{
|
| 128 |
+
"pt": pt_arr.fillna(far),
|
| 129 |
+
"sl": sl_arr.fillna(far),
|
| 130 |
+
"vertical": vert_arr.fillna(far),
|
| 131 |
+
}
|
| 132 |
+
)
|
| 133 |
+
barrier_hit = cmp.idxmin(axis=1)
|
| 134 |
+
|
| 135 |
+
events["vertical_t1"] = events["t1"]
|
| 136 |
+
events["t1"] = earliest
|
| 137 |
+
events["barrier_hit"] = barrier_hit.astype(str)
|
| 138 |
+
if side is None:
|
| 139 |
+
events = events.drop("side", axis=1)
|
| 140 |
+
return events.dropna(subset=["t1"])
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
def get_bins(events: pd.DataFrame, close: pd.Series) -> pd.DataFrame:
|
| 144 |
+
"""AFML Snippet 3.5 (BonusPDF p.30). Convert event outcomes to {-1, 0, +1}.
|
| 145 |
+
|
| 146 |
+
Full triple-barrier semantics: the label depends on which barrier was hit
|
| 147 |
+
*first*:
|
| 148 |
+
|
| 149 |
+
- ``barrier_hit == "pt"`` → ``+1`` (profit-taking, scaled by ``side``)
|
| 150 |
+
- ``barrier_hit == "sl"`` → ``-1`` (stop-loss, scaled by ``side``)
|
| 151 |
+
- ``barrier_hit == "vertical"`` → ``0`` (no signal; the time limit ran out
|
| 152 |
+
before either horizontal barrier was hit)
|
| 153 |
+
|
| 154 |
+
If meta-labeling (``side`` column present), maps to ``{0, 1}`` for
|
| 155 |
+
"don't act" vs "act in this side".
|
| 156 |
+
"""
|
| 157 |
+
events_ = events.dropna(subset=["t1"]).copy()
|
| 158 |
+
px_idx = events_.index.union(events_["t1"].values).unique()
|
| 159 |
+
px = close.reindex(px_idx, method="bfill")
|
| 160 |
+
|
| 161 |
+
out = pd.DataFrame(index=events_.index)
|
| 162 |
+
out["ret"] = px.loc[events_["t1"].values].values / px.loc[events_.index].values - 1
|
| 163 |
+
if "side" in events_.columns:
|
| 164 |
+
out["ret"] *= events_["side"].values
|
| 165 |
+
|
| 166 |
+
if "barrier_hit" in events_.columns:
|
| 167 |
+
# Full triple-barrier: 0 when the vertical barrier (time limit) wins.
|
| 168 |
+
out["bin"] = 0
|
| 169 |
+
out.loc[events_["barrier_hit"] == "pt", "bin"] = 1
|
| 170 |
+
out.loc[events_["barrier_hit"] == "sl", "bin"] = -1
|
| 171 |
+
if "side" in events_.columns:
|
| 172 |
+
# meta-labeling: collapse to {0, 1} = "don't act / act"
|
| 173 |
+
out.loc[out["ret"] <= 0, "bin"] = 0
|
| 174 |
+
out.loc[out["bin"] != 0, "bin"] = 1
|
| 175 |
+
else:
|
| 176 |
+
# Fallback to AFML Snippet 3.5 default (sign of return)
|
| 177 |
+
out["bin"] = np.sign(out["ret"]).astype(int)
|
| 178 |
+
out["bin"] = out["bin"].astype(int)
|
| 179 |
+
return out
|
| 180 |
+
|
| 181 |
+
|
| 182 |
+
def drop_labels(events: pd.DataFrame, min_pct: float = 0.05) -> pd.DataFrame:
|
| 183 |
+
"""AFML Snippet 3.8 (BonusPDF p.34). Drop labels with < ``min_pct`` support.
|
| 184 |
+
|
| 185 |
+
Repeats until every remaining label has at least ``min_pct`` of observations
|
| 186 |
+
or fewer than 3 classes remain.
|
| 187 |
+
"""
|
| 188 |
+
while True:
|
| 189 |
+
counts = events["bin"].value_counts(normalize=True)
|
| 190 |
+
if counts.min() > min_pct or len(counts) < 3:
|
| 191 |
+
break
|
| 192 |
+
smallest = counts.idxmin()
|
| 193 |
+
events = events[events["bin"] != smallest]
|
| 194 |
+
print(f"Dropped label {smallest}: {100 * counts.min():.2f}% of observations")
|
| 195 |
+
return events
|
| 196 |
+
|
| 197 |
+
|
| 198 |
+
def cusum_filter(series: pd.Series, threshold: float) -> pd.DatetimeIndex:
|
| 199 |
+
"""Symmetric CUSUM filter — AFML §2.5.2 (general technique).
|
| 200 |
+
|
| 201 |
+
Generates event start times where the cumulative sum of returns (in either
|
| 202 |
+
direction) exceeds ``threshold``. Resets after each event. Returns a
|
| 203 |
+
DatetimeIndex of event-trigger timestamps.
|
| 204 |
+
|
| 205 |
+
Avoids the "predict on every bar" inefficiency by only labeling at
|
| 206 |
+
statistically interesting moments.
|
| 207 |
+
"""
|
| 208 |
+
t_events, s_pos, s_neg = [], 0.0, 0.0
|
| 209 |
+
diff = series.diff().fillna(0)
|
| 210 |
+
for t, d in diff.items():
|
| 211 |
+
s_pos = max(0.0, s_pos + d)
|
| 212 |
+
s_neg = min(0.0, s_neg + d)
|
| 213 |
+
if s_neg < -threshold:
|
| 214 |
+
s_neg = 0.0
|
| 215 |
+
t_events.append(t)
|
| 216 |
+
elif s_pos > threshold:
|
| 217 |
+
s_pos = 0.0
|
| 218 |
+
t_events.append(t)
|
| 219 |
+
return pd.DatetimeIndex(t_events)
|
src/models/__init__.py
ADDED
|
File without changes
|
src/models/arima_model.py
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""ARIMA wrapper for triple-barrier classification.
|
| 2 |
+
|
| 3 |
+
ARIMA forecasts a continuous next-step return; we threshold it into ``{-1, 0, +1}``
|
| 4 |
+
using ``±k·σ`` where ``σ`` is the daily-vol estimate at the event time. The
|
| 5 |
+
``k`` factor matches the profit-taking / stop-loss multiplier used for labeling
|
| 6 |
+
so that the discretization is consistent with the label scheme.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
from __future__ import annotations
|
| 10 |
+
|
| 11 |
+
import warnings
|
| 12 |
+
|
| 13 |
+
import numpy as np
|
| 14 |
+
import pandas as pd
|
| 15 |
+
from statsmodels.tsa.arima.model import ARIMA
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
class ARIMAClassifier:
|
| 19 |
+
"""Wraps statsmodels ARIMA so it can sit in the same fit/predict loop as XGB/LSTM.
|
| 20 |
+
|
| 21 |
+
The model is fit on the log-price series implied by the training rows (the
|
| 22 |
+
feature matrix carries the volatility estimate per row, used to threshold).
|
| 23 |
+
|
| 24 |
+
Required X columns: ``frac_diff_close`` (used as a proxy for the underlying
|
| 25 |
+
log-price level we want to forecast) and ``target_vol`` (per-event vol used
|
| 26 |
+
to set the ±k·σ threshold).
|
| 27 |
+
"""
|
| 28 |
+
|
| 29 |
+
def __init__(self, order: tuple[int, int, int] = (1, 1, 1), threshold_k: float = 0.5):
|
| 30 |
+
self.order = order
|
| 31 |
+
self.threshold_k = threshold_k
|
| 32 |
+
self.fitted_ = None
|
| 33 |
+
self.train_tail_value_: float = 0.0
|
| 34 |
+
self.classes_: np.ndarray = np.array([-1, 0, 1])
|
| 35 |
+
|
| 36 |
+
def fit(self, X, y, sample_weight=None):
|
| 37 |
+
series = X["frac_diff_close"].astype(float).to_numpy()
|
| 38 |
+
with warnings.catch_warnings():
|
| 39 |
+
warnings.simplefilter("ignore")
|
| 40 |
+
self.fitted_ = ARIMA(series, order=self.order).fit()
|
| 41 |
+
self.train_tail_value_ = float(series[-1])
|
| 42 |
+
return self
|
| 43 |
+
|
| 44 |
+
def predict(self, X):
|
| 45 |
+
n = len(X)
|
| 46 |
+
forecast = self.fitted_.forecast(steps=n)
|
| 47 |
+
# convert forecast deltas back to per-step returns vs the tail of training
|
| 48 |
+
last = self.train_tail_value_
|
| 49 |
+
per_step_return = np.diff(np.concatenate([[last], np.asarray(forecast)]))
|
| 50 |
+
|
| 51 |
+
thresholds = self.threshold_k * X["target_vol"].astype(float).to_numpy()
|
| 52 |
+
preds = np.zeros(n, dtype=int)
|
| 53 |
+
preds[per_step_return > thresholds] = 1
|
| 54 |
+
preds[per_step_return < -thresholds] = -1
|
| 55 |
+
return preds
|
| 56 |
+
|
| 57 |
+
def predict_proba(self, X):
|
| 58 |
+
# ARIMA isn't probabilistic in the triple-barrier sense; collapse hard
|
| 59 |
+
# predictions into a one-hot for log-loss calculation.
|
| 60 |
+
preds = self.predict(X)
|
| 61 |
+
proba = np.zeros((len(preds), 3))
|
| 62 |
+
for i, c in enumerate(self.classes_):
|
| 63 |
+
proba[preds == c, i] = 1.0
|
| 64 |
+
return proba
|
src/models/baselines.py
ADDED
|
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Naïve and SES baselines for triple-barrier classification.
|
| 2 |
+
|
| 3 |
+
The original notebook found SES beat the LSTM under rolling evaluation, which
|
| 4 |
+
was the most interesting result. We keep both baselines under the new label
|
| 5 |
+
scheme to see whether that finding survives a fair (purged-CV) comparison.
|
| 6 |
+
|
| 7 |
+
Both classes follow a uniform fit/predict_proba/predict interface so the
|
| 8 |
+
training driver can iterate over models polymorphically.
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
from __future__ import annotations
|
| 12 |
+
|
| 13 |
+
import numpy as np
|
| 14 |
+
import pandas as pd
|
| 15 |
+
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
class MajorityClassClassifier:
|
| 19 |
+
"""Predicts the most common class from the training set every time.
|
| 20 |
+
|
| 21 |
+
The honest "do nothing" baseline. A useful sanity check: any model that
|
| 22 |
+
fails to beat this on accuracy isn't doing anything.
|
| 23 |
+
"""
|
| 24 |
+
|
| 25 |
+
def __init__(self):
|
| 26 |
+
self.majority_class_: int | None = None
|
| 27 |
+
self.classes_: np.ndarray | None = None
|
| 28 |
+
|
| 29 |
+
def fit(self, X, y, sample_weight=None):
|
| 30 |
+
y = np.asarray(y)
|
| 31 |
+
self.classes_ = np.unique(y)
|
| 32 |
+
counts = np.bincount((y - self.classes_.min()).astype(int))
|
| 33 |
+
self.majority_class_ = int(self.classes_[np.argmax(counts)])
|
| 34 |
+
return self
|
| 35 |
+
|
| 36 |
+
def predict(self, X):
|
| 37 |
+
return np.full(len(X), self.majority_class_, dtype=int)
|
| 38 |
+
|
| 39 |
+
def predict_proba(self, X):
|
| 40 |
+
n = len(X)
|
| 41 |
+
proba = np.zeros((n, len(self.classes_)))
|
| 42 |
+
idx = int(np.where(self.classes_ == self.majority_class_)[0][0])
|
| 43 |
+
proba[:, idx] = 1.0
|
| 44 |
+
return proba
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
class SESClassifier:
|
| 48 |
+
"""Simple exponential smoothing applied to the *label series*, then sign-mapped.
|
| 49 |
+
|
| 50 |
+
Approach: fit ``SimpleExpSmoothing`` on the train labels (treated as a
|
| 51 |
+
continuous signal in ``{-1, 0, +1}``), forecast next-step level, and round
|
| 52 |
+
back to the nearest class. Not a real classifier — a sanity check that the
|
| 53 |
+
label sequence has any short-horizon autocorrelation at all.
|
| 54 |
+
"""
|
| 55 |
+
|
| 56 |
+
def __init__(self, smoothing_level: float | None = None):
|
| 57 |
+
self.smoothing_level = smoothing_level
|
| 58 |
+
self.model_ = None
|
| 59 |
+
self.last_forecast_: float = 0.0
|
| 60 |
+
self.classes_: np.ndarray | None = None
|
| 61 |
+
|
| 62 |
+
def fit(self, X, y, sample_weight=None):
|
| 63 |
+
y = np.asarray(y, dtype=float)
|
| 64 |
+
self.classes_ = np.array(sorted(np.unique(y.astype(int))))
|
| 65 |
+
self.model_ = SimpleExpSmoothing(y, initialization_method="estimated").fit(
|
| 66 |
+
smoothing_level=self.smoothing_level, optimized=self.smoothing_level is None
|
| 67 |
+
)
|
| 68 |
+
fc = self.model_.forecast(1)
|
| 69 |
+
self.last_forecast_ = float(fc[0] if hasattr(fc, "__getitem__") else fc)
|
| 70 |
+
return self
|
| 71 |
+
|
| 72 |
+
def predict(self, X):
|
| 73 |
+
# SES gives a single forecast; broadcast it across the test window.
|
| 74 |
+
# The "label series has very weak structure" finding is intentional —
|
| 75 |
+
# this is meant to be a sanity baseline.
|
| 76 |
+
n = len(X)
|
| 77 |
+
forecast = self.last_forecast_
|
| 78 |
+
return np.full(n, self._nearest_class(forecast), dtype=int)
|
| 79 |
+
|
| 80 |
+
def predict_proba(self, X):
|
| 81 |
+
n = len(X)
|
| 82 |
+
pred_class = self._nearest_class(self.last_forecast_)
|
| 83 |
+
proba = np.zeros((n, len(self.classes_)))
|
| 84 |
+
idx = int(np.where(self.classes_ == pred_class)[0][0])
|
| 85 |
+
proba[:, idx] = 1.0
|
| 86 |
+
return proba
|
| 87 |
+
|
| 88 |
+
def _nearest_class(self, value: float) -> int:
|
| 89 |
+
return int(self.classes_[np.argmin(np.abs(self.classes_ - value))])
|
src/models/lstm_model.py
ADDED
|
@@ -0,0 +1,179 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Refined LSTM for triple-barrier classification.
|
| 2 |
+
|
| 3 |
+
Architectural choices vs the original notebook
|
| 4 |
+
-----------------------------------------------
|
| 5 |
+
The original model (128→64 units, MSE, no clipnorm) collapsed to predicting the
|
| 6 |
+
mean. The refinements here are each anchored to a specific reference:
|
| 7 |
+
|
| 8 |
+
- **Smaller**: 32→16 units, ~10× fewer params. Jansen's univariate-LSTM notebook
|
| 9 |
+
uses 10 units on S&P daily. Karpathy warns that over-parameterized RNNs
|
| 10 |
+
*"do not always show convincing signs of generalizing in the correct way."*
|
| 11 |
+
- **Gradient clipping** (``clipnorm=1.0``): Goodfellow §10.11.1, eq 10.48-49
|
| 12 |
+
(PDF p.414). Without it, the 60-step BPTT chain has the "cliff" landscape
|
| 13 |
+
shown in figure 10.17 and SGD updates can be catastrophically large.
|
| 14 |
+
- **Recurrent dropout** (``recurrent_dropout=0.1``): Goodfellow §10.11.2
|
| 15 |
+
(PDF p.415). Drops the time-axis connections, which is where the
|
| 16 |
+
generalization problem lives — sequence-level Dropout drops feature
|
| 17 |
+
dimensions and misses this.
|
| 18 |
+
- **Softmax over 3 classes** with ``categorical_crossentropy``: aligns the
|
| 19 |
+
loss with the directional-accuracy metric, fixing the original's MSE-vs-
|
| 20 |
+
direction mismatch.
|
| 21 |
+
- **Forget-gate bias = 1**: Keras default (``unit_forget_bias=True``), kept
|
| 22 |
+
explicit so a reader sees Goodfellow §10.10.2 (PDF p.412) is honored.
|
| 23 |
+
|
| 24 |
+
Sample weighting
|
| 25 |
+
----------------
|
| 26 |
+
AFML Pitfall #7 (Table 1.2) — non-IID samples need uniqueness weighting. The
|
| 27 |
+
training driver passes a ``sample_weight`` array if available; ``categorical_
|
| 28 |
+
crossentropy`` honors it natively via Keras.
|
| 29 |
+
"""
|
| 30 |
+
|
| 31 |
+
from __future__ import annotations
|
| 32 |
+
|
| 33 |
+
import numpy as np
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
def build_lstm(
|
| 37 |
+
sequence_length: int,
|
| 38 |
+
n_features: int,
|
| 39 |
+
n_classes: int = 3,
|
| 40 |
+
lstm_units: tuple[int, int] = (32, 16),
|
| 41 |
+
dropout: float = 0.2,
|
| 42 |
+
recurrent_dropout: float = 0.1,
|
| 43 |
+
learning_rate: float = 1e-3,
|
| 44 |
+
clipnorm: float = 1.0,
|
| 45 |
+
):
|
| 46 |
+
"""Build the refined LSTM. Import inside the function so tensorflow doesn't load at module import."""
|
| 47 |
+
from tensorflow.keras.layers import LSTM, Dense, Dropout, Input
|
| 48 |
+
from tensorflow.keras.models import Sequential
|
| 49 |
+
from tensorflow.keras.optimizers import Adam
|
| 50 |
+
|
| 51 |
+
model = Sequential(
|
| 52 |
+
[
|
| 53 |
+
Input(shape=(sequence_length, n_features)),
|
| 54 |
+
LSTM(
|
| 55 |
+
lstm_units[0],
|
| 56 |
+
return_sequences=True,
|
| 57 |
+
recurrent_dropout=recurrent_dropout,
|
| 58 |
+
unit_forget_bias=True, # Goodfellow §10.10.2 (PDF p.412)
|
| 59 |
+
),
|
| 60 |
+
Dropout(dropout),
|
| 61 |
+
LSTM(
|
| 62 |
+
lstm_units[1],
|
| 63 |
+
return_sequences=False,
|
| 64 |
+
recurrent_dropout=recurrent_dropout,
|
| 65 |
+
unit_forget_bias=True,
|
| 66 |
+
),
|
| 67 |
+
Dropout(dropout),
|
| 68 |
+
Dense(n_classes, activation="softmax"),
|
| 69 |
+
]
|
| 70 |
+
)
|
| 71 |
+
model.compile(
|
| 72 |
+
optimizer=Adam(learning_rate=learning_rate, clipnorm=clipnorm),
|
| 73 |
+
loss="categorical_crossentropy",
|
| 74 |
+
metrics=["accuracy"],
|
| 75 |
+
)
|
| 76 |
+
return model
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
def build_sequences(
|
| 80 |
+
X: np.ndarray, y: np.ndarray, sequence_length: int
|
| 81 |
+
) -> tuple[np.ndarray, np.ndarray]:
|
| 82 |
+
"""Convert ``(n_obs, n_features)`` into ``(n_seq, sequence_length, n_features)``.
|
| 83 |
+
|
| 84 |
+
The target at sequence index ``i`` is ``y[i + sequence_length - 1]`` — the
|
| 85 |
+
model predicts the label at the END of each window, not the next step
|
| 86 |
+
beyond it (the next-step view is handled at the event level by the
|
| 87 |
+
triple-barrier ``t1``).
|
| 88 |
+
"""
|
| 89 |
+
n = len(X) - sequence_length + 1
|
| 90 |
+
if n <= 0:
|
| 91 |
+
return np.empty((0, sequence_length, X.shape[1])), np.empty((0,))
|
| 92 |
+
X_seq = np.stack([X[i : i + sequence_length] for i in range(n)])
|
| 93 |
+
y_seq = y[sequence_length - 1 :]
|
| 94 |
+
return X_seq, y_seq
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
class LSTMTripleBarrier:
|
| 98 |
+
"""Wraps the refined LSTM with the same fit/predict interface as other models.
|
| 99 |
+
|
| 100 |
+
Owns label encoding ``{-1, 0, +1} -> {0, 1, 2}`` and sequence construction
|
| 101 |
+
so the training driver doesn't have to special-case it.
|
| 102 |
+
"""
|
| 103 |
+
|
| 104 |
+
def __init__(
|
| 105 |
+
self,
|
| 106 |
+
sequence_length: int = 60,
|
| 107 |
+
n_features: int = 8,
|
| 108 |
+
epochs: int = 50,
|
| 109 |
+
batch_size: int = 64,
|
| 110 |
+
patience: int = 15,
|
| 111 |
+
verbose: int = 0,
|
| 112 |
+
random_state: int = 42,
|
| 113 |
+
):
|
| 114 |
+
self.sequence_length = sequence_length
|
| 115 |
+
self.n_features = n_features
|
| 116 |
+
self.epochs = epochs
|
| 117 |
+
self.batch_size = batch_size
|
| 118 |
+
self.patience = patience
|
| 119 |
+
self.verbose = verbose
|
| 120 |
+
self.random_state = random_state
|
| 121 |
+
self.model = None
|
| 122 |
+
self.classes_ = np.array([-1, 0, 1])
|
| 123 |
+
self.history_ = None
|
| 124 |
+
|
| 125 |
+
def fit(self, X, y, sample_weight=None):
|
| 126 |
+
import tensorflow as tf
|
| 127 |
+
from tensorflow.keras.callbacks import EarlyStopping
|
| 128 |
+
from tensorflow.keras.utils import to_categorical
|
| 129 |
+
|
| 130 |
+
tf.random.set_seed(self.random_state)
|
| 131 |
+
np.random.seed(self.random_state)
|
| 132 |
+
|
| 133 |
+
X_arr = np.asarray(X)
|
| 134 |
+
y_enc = np.asarray(y).astype(int) + 1
|
| 135 |
+
X_seq, y_seq = build_sequences(X_arr, y_enc, self.sequence_length)
|
| 136 |
+
if len(X_seq) == 0:
|
| 137 |
+
raise ValueError(f"Not enough rows ({len(X_arr)}) for sequence_length={self.sequence_length}")
|
| 138 |
+
y_onehot = to_categorical(y_seq, num_classes=3)
|
| 139 |
+
|
| 140 |
+
sw_seq = None
|
| 141 |
+
if sample_weight is not None:
|
| 142 |
+
sw_arr = np.asarray(sample_weight)
|
| 143 |
+
sw_seq = sw_arr[self.sequence_length - 1 :]
|
| 144 |
+
|
| 145 |
+
self.model = build_lstm(
|
| 146 |
+
sequence_length=self.sequence_length,
|
| 147 |
+
n_features=X_arr.shape[1],
|
| 148 |
+
)
|
| 149 |
+
callbacks = [
|
| 150 |
+
EarlyStopping(monitor="loss", patience=self.patience, restore_best_weights=True)
|
| 151 |
+
]
|
| 152 |
+
self.history_ = self.model.fit(
|
| 153 |
+
X_seq,
|
| 154 |
+
y_onehot,
|
| 155 |
+
sample_weight=sw_seq,
|
| 156 |
+
epochs=self.epochs,
|
| 157 |
+
batch_size=self.batch_size,
|
| 158 |
+
verbose=self.verbose,
|
| 159 |
+
callbacks=callbacks,
|
| 160 |
+
shuffle=False,
|
| 161 |
+
)
|
| 162 |
+
return self
|
| 163 |
+
|
| 164 |
+
def predict_proba(self, X):
|
| 165 |
+
X_arr = np.asarray(X)
|
| 166 |
+
# Always pad the start with `sequence_length - 1` copies of the first row
|
| 167 |
+
# so the output has exactly one prediction per input row. (Without this
|
| 168 |
+
# pad we'd lose the first 59 rows of every test fold.)
|
| 169 |
+
n_pad = self.sequence_length - 1
|
| 170 |
+
pad = np.tile(X_arr[:1], (n_pad, 1))
|
| 171 |
+
X_padded = np.vstack([pad, X_arr])
|
| 172 |
+
X_seq, _ = build_sequences(
|
| 173 |
+
X_padded, np.zeros(len(X_padded)), self.sequence_length
|
| 174 |
+
)
|
| 175 |
+
return self.model.predict(X_seq, verbose=0)
|
| 176 |
+
|
| 177 |
+
def predict(self, X):
|
| 178 |
+
proba = self.predict_proba(X)
|
| 179 |
+
return self.classes_[np.argmax(proba, axis=1)]
|
src/models/xgb_model.py
ADDED
|
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""XGBoost classifier for triple-barrier labels.
|
| 2 |
+
|
| 3 |
+
Per Jansen Ch.12, gradient-boosted trees are the natural baseline for tabular
|
| 4 |
+
financial features and routinely beat LSTMs on these problems. The hyper-
|
| 5 |
+
parameters here are conservative (shallow trees, moderate n_estimators) to
|
| 6 |
+
avoid overfitting on small per-fold training sets in the purged CV scheme.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
from __future__ import annotations
|
| 10 |
+
|
| 11 |
+
import numpy as np
|
| 12 |
+
from xgboost import XGBClassifier
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def build_xgb_classifier(random_state: int = 42) -> XGBClassifier:
|
| 16 |
+
"""Returns a fresh XGBClassifier for one CV fold.
|
| 17 |
+
|
| 18 |
+
Output classes use the XGBoost-internal indexing ``{0, 1, 2}`` for
|
| 19 |
+
``{-1, 0, +1}`` since XGBoost requires non-negative integer labels. The
|
| 20 |
+
training driver wraps this with an encoder.
|
| 21 |
+
"""
|
| 22 |
+
return XGBClassifier(
|
| 23 |
+
objective="multi:softprob",
|
| 24 |
+
num_class=3,
|
| 25 |
+
max_depth=4,
|
| 26 |
+
n_estimators=300,
|
| 27 |
+
learning_rate=0.05,
|
| 28 |
+
subsample=0.8,
|
| 29 |
+
colsample_bytree=0.8,
|
| 30 |
+
reg_lambda=1.0,
|
| 31 |
+
eval_metric="mlogloss",
|
| 32 |
+
random_state=random_state,
|
| 33 |
+
n_jobs=-1,
|
| 34 |
+
tree_method="hist",
|
| 35 |
+
)
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
class XGBTripleBarrier:
|
| 39 |
+
"""Thin wrapper that owns the label encoding from ``{-1, 0, 1}`` ↔ ``{0, 1, 2}``."""
|
| 40 |
+
|
| 41 |
+
def __init__(self, random_state: int = 42):
|
| 42 |
+
self.model = build_xgb_classifier(random_state=random_state)
|
| 43 |
+
self.classes_ = np.array([-1, 0, 1])
|
| 44 |
+
|
| 45 |
+
def fit(self, X, y, sample_weight=None):
|
| 46 |
+
y_enc = np.asarray(y).astype(int) + 1 # {-1, 0, 1} -> {0, 1, 2}
|
| 47 |
+
self.model.fit(X, y_enc, sample_weight=sample_weight)
|
| 48 |
+
return self
|
| 49 |
+
|
| 50 |
+
def predict(self, X):
|
| 51 |
+
y_pred_enc = self.model.predict(X)
|
| 52 |
+
return y_pred_enc - 1
|
| 53 |
+
|
| 54 |
+
def predict_proba(self, X):
|
| 55 |
+
return self.model.predict_proba(X)
|
| 56 |
+
|
| 57 |
+
@property
|
| 58 |
+
def feature_importances_(self):
|
| 59 |
+
return self.model.feature_importances_
|
src/train.py
ADDED
|
@@ -0,0 +1,102 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""CV-aware training driver — one harness for all five models.
|
| 2 |
+
|
| 3 |
+
The driver expects each model to expose ``fit(X, y, sample_weight=None)``,
|
| 4 |
+
``predict(X)``, and (optionally) ``predict_proba(X)``. The triple-barrier label
|
| 5 |
+
``{-1, 0, +1}`` is shared across all of them.
|
| 6 |
+
|
| 7 |
+
Sample weights come from AFML Ch.4 — observations whose label intervals overlap
|
| 8 |
+
contribute less unique information, so they should count less in the loss. The
|
| 9 |
+
simplest implementation is to weight inversely by the number of overlapping
|
| 10 |
+
labels (Snippet 4.1); for now the driver supports passing pre-computed weights
|
| 11 |
+
or falling back to uniform.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from __future__ import annotations
|
| 15 |
+
|
| 16 |
+
from collections.abc import Callable
|
| 17 |
+
from typing import Any
|
| 18 |
+
|
| 19 |
+
import numpy as np
|
| 20 |
+
import pandas as pd
|
| 21 |
+
from sklearn.preprocessing import StandardScaler
|
| 22 |
+
|
| 23 |
+
from .cv import PurgedKFold
|
| 24 |
+
from .eval import fold_metrics
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def fit_predict_one_fold(
|
| 28 |
+
model_builder: Callable[[], Any],
|
| 29 |
+
X_train: pd.DataFrame,
|
| 30 |
+
y_train: pd.Series,
|
| 31 |
+
X_test: pd.DataFrame,
|
| 32 |
+
sample_weight_train: np.ndarray | None = None,
|
| 33 |
+
standardize: bool = True,
|
| 34 |
+
) -> tuple[np.ndarray, Any]:
|
| 35 |
+
"""Fit on the train fold, predict on the test fold. Returns (y_pred, fitted_model)."""
|
| 36 |
+
if standardize:
|
| 37 |
+
scaler = StandardScaler().fit(X_train.values)
|
| 38 |
+
X_train_s = pd.DataFrame(
|
| 39 |
+
scaler.transform(X_train.values), index=X_train.index, columns=X_train.columns
|
| 40 |
+
)
|
| 41 |
+
X_test_s = pd.DataFrame(
|
| 42 |
+
scaler.transform(X_test.values), index=X_test.index, columns=X_test.columns
|
| 43 |
+
)
|
| 44 |
+
else:
|
| 45 |
+
X_train_s, X_test_s = X_train, X_test
|
| 46 |
+
model = model_builder()
|
| 47 |
+
model.fit(X_train_s, y_train.values, sample_weight=sample_weight_train)
|
| 48 |
+
return model.predict(X_test_s), model
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
def run_cv(
|
| 52 |
+
model_name: str,
|
| 53 |
+
model_builder: Callable[[], Any],
|
| 54 |
+
X: pd.DataFrame,
|
| 55 |
+
y: pd.Series,
|
| 56 |
+
cv: PurgedKFold,
|
| 57 |
+
sample_weight: pd.Series | None = None,
|
| 58 |
+
standardize: bool = True,
|
| 59 |
+
extra_columns: dict | None = None,
|
| 60 |
+
) -> pd.DataFrame:
|
| 61 |
+
"""Run a model across all CV folds. Returns one row per fold."""
|
| 62 |
+
rows = []
|
| 63 |
+
for fold_idx, (train_idx, test_idx) in enumerate(cv.split(X)):
|
| 64 |
+
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
|
| 65 |
+
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
|
| 66 |
+
sw_train = sample_weight.iloc[train_idx].values if sample_weight is not None else None
|
| 67 |
+
|
| 68 |
+
y_pred, _ = fit_predict_one_fold(
|
| 69 |
+
model_builder=model_builder,
|
| 70 |
+
X_train=X_train,
|
| 71 |
+
y_train=y_train,
|
| 72 |
+
X_test=X_test,
|
| 73 |
+
sample_weight_train=sw_train,
|
| 74 |
+
standardize=standardize,
|
| 75 |
+
)
|
| 76 |
+
|
| 77 |
+
metrics = fold_metrics(y_test.values, y_pred)
|
| 78 |
+
row = {"model": model_name, "fold": fold_idx, **metrics}
|
| 79 |
+
if extra_columns:
|
| 80 |
+
row.update(extra_columns)
|
| 81 |
+
rows.append(row)
|
| 82 |
+
return pd.DataFrame(rows)
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
def uniqueness_weights(t1: pd.Series) -> pd.Series:
|
| 86 |
+
"""Approximate AFML Ch.4 sample-uniqueness weights.
|
| 87 |
+
|
| 88 |
+
For each event, count how many other events have overlapping
|
| 89 |
+
``[start, t1]`` intervals, and weight inversely. Not the rigorous Snippet
|
| 90 |
+
4.1 (which counts overlap proportionally), but the right order of magnitude
|
| 91 |
+
and much faster.
|
| 92 |
+
"""
|
| 93 |
+
weights = pd.Series(1.0, index=t1.index)
|
| 94 |
+
t1_arr = t1.values
|
| 95 |
+
start_arr = t1.index.values
|
| 96 |
+
n = len(t1)
|
| 97 |
+
for i in range(n):
|
| 98 |
+
overlap = np.sum((start_arr <= t1_arr[i]) & (t1_arr >= start_arr[i]))
|
| 99 |
+
weights.iloc[i] = 1.0 / max(overlap, 1)
|
| 100 |
+
# normalize so the weights sum to n (mean weight = 1)
|
| 101 |
+
weights *= n / weights.sum()
|
| 102 |
+
return weights
|
y_test_lstm.npy
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:0b51bdd96ea2313f3a6566cd272c918161d87a704a5b4d7fdab557dca65bdac7
|
| 3 |
-
size 3640
|
|
|
|
|
|
|
|
|
|
|
|
y_train_lstm.npy
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:9ddf60272a325d8ac0328830ed5a7c326a2f4553506a8d75587c8580f025b847
|
| 3 |
-
size 22216
|
|
|
|
|
|
|
|
|
|
|
|
y_val_lstm.npy
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:a7518e2c4f02f2496ac9f5043a9eded9d7fb0e5b37c6c40f5c66dfee6a7bfef4
|
| 3 |
-
size 1656
|
|
|
|
|
|
|
|
|
|
|
|