moccaram commited on
Commit
8ba081b
·
verified ·
1 Parent(s): 93c1d7b

Replace v1 demo with v2 XGBoost-backed Gradio app (reference-backed rebuild)

Browse files

Upgrades the Space to the v2 pipeline from github.com/moccaram/DataSynth. Real Gradio inference (not the hello-world template), XGBoost trained on triple-barrier labels + fractionally-differenced features, prominent caveat about ~36% directional accuracy when acting.

.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ app_screenshot.png filter=lfs diff=lfs merge=lfs -text
DataSynthis_ML_JobTask.ipynb DELETED
The diff for this file is too large to render. See raw diff
 
README.md CHANGED
@@ -1,47 +1,27 @@
1
  ---
2
- title: DataSynthis ML JobTask
3
- emoji: 🐢
4
- colorFrom: green
5
  colorTo: gray
6
  sdk: gradio
7
- sdk_version: 5.48.0
8
  app_file: app.py
9
  pinned: false
10
- license: apache-2.0
11
- short_description: Stock price forecasting ML demo for DataSynthis internship
12
  ---
13
 
14
- # 📈 DataSynthis ML JobTask
15
- Stock Price Forecasting with Baseline, Statistical, and ML Models
16
 
17
- ## 🚀 Project Overview
18
- This project demonstrates a complete **time-series forecasting pipeline** using daily stock price data (2010–2024). It was developed as part of the **DataSynthis ML Internship Task**.
 
19
 
20
- We cover the full workflow:
21
- 1. **Baseline Models** Naïve Forecast, Simple Exponential Smoothing (SES)
22
- 2. **Statistical Model** ARIMA
23
- 3. **ML / DL Models** Prophet, LSTM
24
- 4. **Evaluation** → Rolling-window accuracy metrics (RMSE, MAPE)
25
- 5. **Deployment** → Interactive demo with Gradio (via Hugging Face Spaces)
26
 
27
- ## 🛠️ Features
28
- - Data preprocessing & feature engineering (lags, volatility, RSI, MACD, Bollinger Bands, etc.)
29
- - Feature validation & pruning (correlation, VIF, outlier checks)
30
- - Unified comparison of models with a performance summary table
31
- - Visualizations: trends, normalized comparisons, total returns
32
- - Exportable datasets for reproducibility
33
 
34
- ## 📊 Deliverables
35
- - **Notebook**: End-to-end workflow (data → models → evaluation)
36
- - **Models**: Naïve, SES, ARIMA, Prophet, LSTM
37
- - **Visualizations**: stock trends, indicators, correlations, performance plots
38
- - **Deployment**: Hugging Face Space with Gradio app
39
-
40
- ## 📂 Repository Structure
41
- 📁 DataSynthis_ML_JobTask
42
- ├── app.py # Gradio demo app
43
- ├── data/ # Preprocessed & engineered datasets
44
- ├── notebooks/ # Jupyter notebooks with full pipeline
45
- ├── models/ # Trained ARIMA / Prophet / LSTM models
46
- ├── outputs/ # Plots, summary tables, feature files
47
- ├── README.md # This file
 
1
  ---
2
+ title: AAPL Triple-Barrier Direction Classifier
3
+ emoji: 📊
4
+ colorFrom: blue
5
  colorTo: gray
6
  sdk: gradio
7
+ sdk_version: "4.44.0"
8
  app_file: app.py
9
  pinned: false
10
+ license: mit
 
11
  ---
12
 
13
+ # AAPL Triple-Barrier Direction Classifier (educational)
 
14
 
15
+ Reference-backed financial-ML demo. XGBoost classifier trained on
16
+ fractionally-differenced features and triple-barrier labels (López de Prado,
17
+ *Advances in Financial Machine Learning*, Ch.3 + Ch.5).
18
 
19
+ **This is an educational portfolio artifact, not a trading signal.**
20
+ Test-set accuracy ~38% on a 3-class label set (random = 33%, p<0.05 in 3 of 5
21
+ purged folds). Directional accuracy *when the model picks a side* is ~36% —
22
+ worse than coin-flip. Do not trade real money on this.
 
 
23
 
24
+ ![Gradio interface](app_screenshot.png)
 
 
 
 
 
25
 
26
+ Full source, technical writeup, and lessons-learned:
27
+ [github.com/moccaram/DataSynth](https://github.com/moccaram/DataSynth).
 
 
 
 
 
 
 
 
 
 
 
 
X_test_lstm.npy DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:28e28884d7ade2318c01ffa836f14fe66dad42ffd29bcf7c39c589bc9d2ff5b4
3
- size 2739488
 
 
 
 
X_train_lstm.npy DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:2bd3342b5569749c14cba69cbc1aae53369ccaaaf0502fc74de1a84c7495788c
3
- size 17228768
 
 
 
 
X_val_lstm.npy DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:88464f5ecbfcb80a36d3b7113599d3088f25bc11a38317e343f9868ed907704a
3
- size 1191968
 
 
 
 
app.py CHANGED
@@ -1,15 +1,13 @@
1
- import gradio as gr
2
-
3
- def greet(name):
4
- return "Hello " + name + "!!"
5
-
6
- demo = gr.Interface(
7
- fn=greet,
8
- inputs="text",
9
- outputs="text",
10
- title="👋 Greeting Demo",
11
- description="Enter your name to receive a warm greeting."
12
- )
13
-
14
- if __name__ == "__main__":
15
- demo.launch()
 
1
+ """Hugging Face Spaces entry point. Delegates to src.app for the real interface."""
2
+
3
+ import sys
4
+ from pathlib import Path
5
+
6
+ # Make src/ importable when the Space launches this file from the repo root.
7
+ sys.path.insert(0, str(Path(__file__).resolve().parent))
8
+
9
+ from src.app import build_interface
10
+
11
+ if __name__ == "__main__":
12
+ demo = build_interface()
13
+ demo.launch()
 
 
feature_scaler.pkl → app_screenshot.png RENAMED
File without changes
arima_model.pkl DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:d6787effc883e371477f02eecc8f5e48e9148a6b286e48af1eeee4f072eb04d9
3
- size 5295051
 
 
 
 
arima_order.pkl DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:efc90090103f31c21431c8a3d1ae6c66ca453551649bbf4488b706172c4277a4
3
- size 20
 
 
 
 
data/raw/AAPL_stock_data_2010_2024.csv ADDED
The diff for this file is too large to render. See raw diff
 
data/raw/SPY_stock_data_2010_2024.csv ADDED
The diff for this file is too large to render. See raw diff
 
data_preparation_metadata.json DELETED
@@ -1,59 +0,0 @@
1
- {
2
- "dataset": {
3
- "total_days": 3572,
4
- "date_range": "2010-10-19 to 2024-12-27",
5
- "features": 13,
6
- "target": "target_return"
7
- },
8
- "split": {
9
- "train_days": 2821,
10
- "val_days": 251,
11
- "test_days": 499,
12
- "train_pct": 78.97536394176932,
13
- "val_pct": 7.026875699888017,
14
- "test_pct": 13.96976483762598
15
- },
16
- "features": [
17
- "hl_range",
18
- "log_return",
19
- "spy_return",
20
- "co_range",
21
- "return_lag2",
22
- "return_lag5",
23
- "volatility_20d",
24
- "volume_change",
25
- "day_cos",
26
- "day_of_week",
27
- "day_sin",
28
- "month_cos",
29
- "rolling_beta"
30
- ],
31
- "prophet_regressors": [
32
- "hl_range",
33
- "spy_return",
34
- "volatility_20d",
35
- "rolling_beta",
36
- "volume_change",
37
- "co_range",
38
- "day_cos",
39
- "day_sin"
40
- ],
41
- "lstm_sequence_length": 60,
42
- "last_prices": {
43
- "train": 178.08999633789062,
44
- "val": 128.41000366210938,
45
- "test": 257.8299865722656
46
- },
47
- "files_created": [
48
- "feature_scaler.pkl",
49
- "train_prophet.csv",
50
- "val_prophet.csv",
51
- "test_prophet.csv",
52
- "X_train_lstm.npy",
53
- "y_train_lstm.npy",
54
- "X_val_lstm.npy",
55
- "y_val_lstm.npy",
56
- "X_test_lstm.npy",
57
- "y_test_lstm.npy"
58
- ]
59
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
lstm_model.h5 DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:d2e60dea878818cb88f7cd864b68daf9be6c10c80cea8ab0537e3662c48ed041
3
- size 1535336
 
 
 
 
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ gradio>=4.0
2
+ matplotlib>=3.8
3
+ numpy>=1.26,<3
4
+ pandas>=2.1
5
+ scikit-learn>=1.3
6
+ scipy>=1.11
7
+ xgboost>=2.0
src/__init__.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ """DataSynth — reference-backed stock forecasting pipeline.
2
+
3
+ Anchored to:
4
+ - AFML (López de Prado) Ch.3 (labeling), Ch.5 (FFD), Ch.7 (purged CV)
5
+ - Goodfellow et al. Ch.10 §10.11 (RNN optimization)
6
+ - Jansen, *Machine Learning for Algorithmic Trading* Ch.19 (RNNs for time series)
7
+ """
src/app.py ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Gradio demo — AAPL triple-barrier direction classifier (educational).
2
+
3
+ Loads the XGBoost model (the headline winner in this study, mean test accuracy
4
+ ~38% vs 33% random) and lets the user pick any date in the available range to
5
+ inspect the next-10-day direction prediction with class probabilities.
6
+
7
+ This is a *portfolio artifact*. The directional accuracy when the model
8
+ actually picks a side is ~36% — worse than random. Do not trade on this.
9
+ """
10
+
11
+ from __future__ import annotations
12
+
13
+ import io
14
+ import sys
15
+ import warnings
16
+ from pathlib import Path
17
+
18
+ warnings.filterwarnings("ignore")
19
+
20
+ import matplotlib
21
+ matplotlib.use("Agg")
22
+ import matplotlib.pyplot as plt
23
+ import numpy as np
24
+ import pandas as pd
25
+
26
+ ROOT = Path(__file__).resolve().parent.parent
27
+ sys.path.insert(0, str(ROOT))
28
+
29
+ from src.data import load_aapl_with_spy, get_daily_vol
30
+ from src.features import frac_diff_ffd
31
+ from src.labeling import cusum_filter, get_events, get_bins, drop_labels
32
+ from src.models.xgb_model import XGBTripleBarrier
33
+
34
+
35
+ CLASS_LABELS = {-1: "DOWN (stop-loss first)", 0: "FLAT (time-out, no signal)", 1: "UP (profit-taking first)"}
36
+
37
+
38
+ def build_features_and_labels():
39
+ """Rebuild the full feature matrix + triple-barrier labels at startup."""
40
+ df = load_aapl_with_spy()
41
+ close = df["Adj Close"]
42
+ log_returns = np.log(close).diff().dropna()
43
+ daily_vol = get_daily_vol(close, span=100)
44
+
45
+ features = pd.DataFrame(index=df.index)
46
+ features["frac_diff_close"] = frac_diff_ffd(np.log(close).to_frame("c"), 0.4, thres=1e-5)["c"]
47
+ features["frac_diff_volume"] = frac_diff_ffd(
48
+ np.log(df["Volume"].replace(0, np.nan)).to_frame("v"), 0.4, thres=1e-5
49
+ )["v"]
50
+ features["hl_range"] = (df["High"] - df["Low"]) / df["Close"]
51
+ features["spy_return"] = np.log(df["SPY_Close"]).diff()
52
+ features["volatility_20d"] = log_returns.rolling(20).std()
53
+ features["rolling_beta"] = (
54
+ log_returns.rolling(30).cov(features["spy_return"])
55
+ / features["spy_return"].rolling(30).var()
56
+ )
57
+ features["day_of_week"] = df.index.dayofweek
58
+ features["vol_regime"] = daily_vol / daily_vol.rolling(252, min_periods=60).median()
59
+ features = features.dropna()
60
+
61
+ t_events = cusum_filter(np.log(close), threshold=float(daily_vol.median()))
62
+ events = get_events(
63
+ close=close, t_events=t_events, pt_sl=(2.0, 2.0),
64
+ target=daily_vol, min_ret=0.005, num_days=10,
65
+ )
66
+ labels = get_bins(events, close)
67
+ events_with_labels = events.join(labels[["bin"]])
68
+ events_with_labels = drop_labels(events_with_labels, min_pct=0.05)
69
+ labels = labels.loc[events_with_labels.index]
70
+
71
+ aligned = features.index.intersection(labels.index)
72
+ return df, close, features, labels.loc[aligned, "bin"].astype(int), features.loc[aligned]
73
+
74
+
75
+ print("Loading data and training XGBoost (one-time, ~10 sec)...")
76
+ DF, CLOSE, FEATURES_FULL, Y_TRAIN, X_TRAIN_ALIGNED = build_features_and_labels()
77
+
78
+ from sklearn.preprocessing import StandardScaler
79
+ SCALER = StandardScaler().fit(X_TRAIN_ALIGNED.values)
80
+ MODEL = XGBTripleBarrier(random_state=42)
81
+ MODEL.fit(
82
+ pd.DataFrame(SCALER.transform(X_TRAIN_ALIGNED.values), index=X_TRAIN_ALIGNED.index, columns=X_TRAIN_ALIGNED.columns),
83
+ Y_TRAIN.values,
84
+ )
85
+ print(f"Model trained on {len(X_TRAIN_ALIGNED)} labeled events. Ready.")
86
+
87
+ VALID_DATES = FEATURES_FULL.index
88
+ DEFAULT_DATE = VALID_DATES[-1]
89
+
90
+
91
+ def predict(date_str: str):
92
+ try:
93
+ date = pd.Timestamp(date_str)
94
+ except Exception:
95
+ return "Invalid date format. Use YYYY-MM-DD.", None, None
96
+
97
+ available = FEATURES_FULL.index[FEATURES_FULL.index <= date]
98
+ if len(available) == 0:
99
+ return f"No features available on or before {date.date()}. Try a later date.", None, None
100
+ use_date = available[-1]
101
+
102
+ x_row = FEATURES_FULL.loc[[use_date]]
103
+ x_scaled = pd.DataFrame(SCALER.transform(x_row.values), index=x_row.index, columns=x_row.columns)
104
+ proba = MODEL.predict_proba(x_scaled)[0]
105
+ pred_class = int(MODEL.classes_[np.argmax(proba)])
106
+
107
+ proba_df = pd.DataFrame(
108
+ {"class": [CLASS_LABELS[c] for c in MODEL.classes_], "probability": [f"{p:.1%}" for p in proba]}
109
+ )
110
+
111
+ end_idx = DF.index.get_loc(use_date)
112
+ start_idx = max(0, end_idx - 59)
113
+ chart_data = DF["Adj Close"].iloc[start_idx : end_idx + 1]
114
+
115
+ fig, ax = plt.subplots(figsize=(8, 3.5))
116
+ ax.plot(chart_data.index, chart_data.values, color="black", lw=1.0)
117
+ ax.scatter([chart_data.index[-1]], [chart_data.iloc[-1]], color="red", s=40, zorder=3, label=f"As-of: {use_date.date()}")
118
+ ax.set_title(f"AAPL adjusted close — 60 days ending {use_date.date()}")
119
+ ax.set_ylabel("Price ($)")
120
+ ax.legend(loc="best")
121
+ ax.grid(alpha=0.3)
122
+ plt.tight_layout()
123
+
124
+ summary = (
125
+ f"**As-of date:** {use_date.date()} \n"
126
+ f"**Last close:** ${chart_data.iloc[-1]:.2f} \n"
127
+ f"**Prediction (next 10 trading days):** {CLASS_LABELS[pred_class]} \n"
128
+ f"**Confidence (max class probability):** {proba.max():.1%}"
129
+ )
130
+ return summary, proba_df, fig
131
+
132
+
133
+ def build_interface():
134
+ import gradio as gr
135
+
136
+ caveat = """
137
+ > ⚠️ **This is an educational portfolio artifact, NOT a trading signal.**
138
+ >
139
+ > Under 5-fold purged k-fold cross-validation (López de Prado, *AFML*, Ch.7), this XGBoost
140
+ > classifier reaches mean accuracy ~38% on a 3-class triple-barrier label set (random baseline
141
+ > = 33%, p<0.05 in 3 of 5 folds). However, **directional accuracy *when the model picks a side*
142
+ > is ~36% — worse than coin flip**. The model is mildly informative about "will something
143
+ > happen vs nothing" but uninformative about "up vs down." Do not trade real money on this.
144
+ """
145
+
146
+ with gr.Blocks(title="AAPL Triple-Barrier Direction Classifier") as demo:
147
+ gr.Markdown("# AAPL Triple-Barrier Direction Classifier (educational)")
148
+ gr.Markdown(caveat)
149
+ gr.Markdown(
150
+ "Reference-backed financial-ML pipeline: triple-barrier labeling "
151
+ "(AFML Ch.3), fractional differentiation (Ch.5), purged k-fold CV (Ch.7), "
152
+ "XGBoost classifier. Repo: this folder."
153
+ )
154
+
155
+ with gr.Row():
156
+ with gr.Column(scale=1):
157
+ date_input = gr.Textbox(
158
+ label="As-of date (YYYY-MM-DD)",
159
+ value=str(DEFAULT_DATE.date()),
160
+ info=f"Valid range: {VALID_DATES[0].date()} → {VALID_DATES[-1].date()}",
161
+ )
162
+ predict_btn = gr.Button("Predict next 10-day direction", variant="primary")
163
+ summary_md = gr.Markdown()
164
+ proba_table = gr.Dataframe(headers=["class", "probability"], label="Class probabilities")
165
+
166
+ with gr.Column(scale=2):
167
+ chart = gr.Plot(label="60-day price context")
168
+
169
+ predict_btn.click(
170
+ fn=predict, inputs=[date_input], outputs=[summary_md, proba_table, chart]
171
+ )
172
+
173
+ gr.Markdown(
174
+ "---\n"
175
+ "Headline result table (mean over 5 purged folds):\n\n"
176
+ "| Model | Accuracy | Beat random (p<0.05) | Dir.acc when acting |\n"
177
+ "|-----------|----------|----------------------|---------------------|\n"
178
+ "| Majority | 35.0% | 0/5 folds | N/A |\n"
179
+ "| SES | 36.8% | 2/5 folds | always abstains |\n"
180
+ "| ARIMA | 36.8% | 2/5 folds | always abstains |\n"
181
+ "| LSTM | 35.8% | 2/5 folds | 33% (worse than 50%) |\n"
182
+ "| **XGBoost** | **37.8%** | **3/5 folds** | 36% (worse than 50%) |\n"
183
+ )
184
+
185
+ return demo
186
+
187
+
188
+ if __name__ == "__main__":
189
+ app = build_interface()
190
+ app.launch(server_name="127.0.0.1", server_port=7860, inbrowser=False, share=False)
src/cv.py ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Purged k-fold cross-validation — AFML Ch.7 (BonusPDF pp.62-67).
2
+
3
+ Standard k-fold leaks information in finance because labels span time intervals.
4
+ If a training label's interval ``[t_i, t1_i]`` overlaps a test label's interval
5
+ ``[t_j, t1_j]``, the two share underlying price information and the train/test
6
+ boundary is fictitious. ``PurgedKFold`` drops the offending training samples;
7
+ an additional ``pctEmbargo`` buffer drops samples immediately *after* each test
8
+ fold to prevent reverse leakage from the test set into a later train fold.
9
+
10
+ This is a port of AFML Snippets 7.2-7.3 (BonusPDF pp.65-66). The canonical class
11
+ inherits from sklearn's ``_BaseKFold`` so it works as a drop-in replacement.
12
+ """
13
+
14
+ from __future__ import annotations
15
+
16
+ import numpy as np
17
+ import pandas as pd
18
+ from scipy import stats
19
+ from sklearn.model_selection._split import _BaseKFold
20
+
21
+
22
+ class PurgedKFold(_BaseKFold):
23
+ """K-fold CV with purging + optional embargo. AFML Snippet 7.3 (BonusPDF p.66)."""
24
+
25
+ def __init__(self, n_splits: int = 5, t1: pd.Series | None = None, pct_embargo: float = 0.0):
26
+ if not isinstance(t1, pd.Series):
27
+ raise ValueError("`t1` must be a pd.Series of label-end timestamps")
28
+ super().__init__(n_splits, shuffle=False, random_state=None)
29
+ self.t1 = t1
30
+ self.pct_embargo = pct_embargo
31
+
32
+ def split(self, X, y=None, groups=None):
33
+ if not X.index.equals(self.t1.index):
34
+ raise ValueError("X.index must equal t1.index")
35
+ indices = np.arange(X.shape[0])
36
+ embargo_size = int(X.shape[0] * self.pct_embargo)
37
+ test_ranges = [(arr[0], arr[-1] + 1) for arr in np.array_split(indices, self.n_splits)]
38
+
39
+ for i, j in test_ranges:
40
+ t0 = self.t1.index[i]
41
+ test_indices = indices[i:j]
42
+ max_t1_in_test = self.t1.iloc[test_indices].max()
43
+ max_t1_pos = self.t1.index.searchsorted(max_t1_in_test)
44
+ # left train: rows whose label ended before test starts
45
+ left_train = self.t1.index.searchsorted(self.t1[self.t1 <= t0].index)
46
+ # right train: rows starting after max-t1 + embargo
47
+ if max_t1_pos < X.shape[0]:
48
+ right_train = indices[max_t1_pos + embargo_size :]
49
+ else:
50
+ right_train = np.array([], dtype=int)
51
+ train_indices = np.concatenate([left_train, right_train])
52
+ yield train_indices, test_indices
53
+
54
+
55
+ def get_embargo_times(times: pd.DatetimeIndex, pct_embargo: float) -> pd.Series:
56
+ """AFML Snippet 7.2 (BonusPDF p.65). Map each timestamp to its embargo end."""
57
+ step = int(times.shape[0] * pct_embargo)
58
+ if step == 0:
59
+ return pd.Series(times, index=times)
60
+ embargo = pd.Series(times[step:], index=times[:-step])
61
+ return pd.concat([embargo, pd.Series(times[-1], index=times[-step:])])
62
+
63
+
64
+ def binomial_pvalue(n_correct: int, n_total: int, p_null: float = 0.5) -> float:
65
+ """One-sided binomial p-value: ``P(X >= n_correct | n=n_total, p=p_null)``.
66
+
67
+ Used to test whether observed accuracy or directional accuracy exceeds the
68
+ null. For three-class targets, pass ``p_null=1/3``; for binary direction
69
+ after dropping 0-labels, pass ``p_null=0.5``.
70
+ """
71
+ return float(stats.binomtest(n_correct, n_total, p=p_null, alternative="greater").pvalue)
72
+
73
+
74
+ def proportion_ci(n_correct: int, n_total: int, alpha: float = 0.05) -> tuple[float, float]:
75
+ """Wilson 95% CI for an accuracy proportion. More accurate than normal-approx for small n."""
76
+ if n_total == 0:
77
+ return (np.nan, np.nan)
78
+ ci = stats.binomtest(n_correct, n_total).proportion_ci(
79
+ confidence_level=1 - alpha, method="wilson"
80
+ )
81
+ return float(ci.low), float(ci.high)
src/data.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Data loaders for the AAPL/SPY pipeline + EWM daily volatility (AFML Snippet 3.1).
2
+
3
+ The CSVs under ``data/raw/`` have a column-header bug: the header reads
4
+ ``Open,High,Low,Close,Adj Close,Volume`` but the underlying yfinance frame was
5
+ saved after a ``sort_index(axis=1)`` so the actual column order is alphabetical:
6
+ ``Adj Close, Close, High, Low, Open, Volume``. We override the headers on load.
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ from pathlib import Path
12
+
13
+ import numpy as np
14
+ import pandas as pd
15
+
16
+ DATA_DIR = Path(__file__).resolve().parent.parent / "data" / "raw"
17
+
18
+ ACTUAL_COLUMN_ORDER = ["Date", "Adj Close", "Close", "High", "Low", "Open", "Volume", "company_name"]
19
+
20
+
21
+ def load_ohlcv(ticker: str, data_dir: Path | None = None) -> pd.DataFrame:
22
+ """Load a single-ticker OHLCV CSV from ``data/raw/``, fixing the column order."""
23
+ data_dir = data_dir or DATA_DIR
24
+ path = data_dir / f"{ticker}_stock_data_2010_2024.csv"
25
+ df = pd.read_csv(path, header=0, names=ACTUAL_COLUMN_ORDER, skiprows=1)
26
+ df["Date"] = pd.to_datetime(df["Date"])
27
+ df = df.set_index("Date").sort_index()
28
+ return df[["Open", "High", "Low", "Close", "Adj Close", "Volume"]]
29
+
30
+
31
+ def load_aapl_with_spy() -> pd.DataFrame:
32
+ """Merged AAPL + SPY frame for market-relative features. Index = trading dates."""
33
+ aapl = load_ohlcv("AAPL")
34
+ spy = load_ohlcv("SPY")[["Adj Close", "Volume"]].rename(
35
+ columns={"Adj Close": "SPY_Close", "Volume": "SPY_Volume"}
36
+ )
37
+ return aapl.join(spy, how="inner")
38
+
39
+
40
+ def get_daily_vol(close: pd.Series, span: int = 100) -> pd.Series:
41
+ """EWM daily-return volatility — AFML Snippet 3.1 (BonusPDF p.26).
42
+
43
+ Used to set the horizontal barrier widths in triple-barrier labeling. Output
44
+ is forward-fill safe: NaNs only at the leading edge before EWM warmup.
45
+ """
46
+ returns = close.pct_change()
47
+ return returns.ewm(span=span).std()
48
+
49
+
50
+ def cumulative_returns_path(close: pd.Series, t0, t1) -> pd.Series:
51
+ """Return path from t0 to t1 expressed as ``close/close[t0] - 1``."""
52
+ return close.loc[t0:t1] / close.loc[t0] - 1
src/eval.py ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Evaluation metrics with statistical significance — triple-barrier era.
2
+
3
+ The original notebook reported directional accuracy without binomial p-values;
4
+ 49.9% over 499 days is statistically indistinguishable from 50%. This module
5
+ makes that explicit by attaching a p-value to every accuracy figure.
6
+
7
+ Metric conventions
8
+ ------------------
9
+ - For 3-class labels ``{-1, 0, +1}``, the null is uniform random: ``p_null=1/3``.
10
+ - For *directional accuracy when acting*, restrict to predictions ``in {-1, +1}``
11
+ (i.e. ignore "no-action" 0 predictions), compare to ``p_null=1/2``.
12
+ - Both metrics use a one-sided binomial test (we only care if it beats chance).
13
+ """
14
+
15
+ from __future__ import annotations
16
+
17
+ import numpy as np
18
+ import pandas as pd
19
+ from sklearn.metrics import accuracy_score, confusion_matrix
20
+
21
+ from .cv import binomial_pvalue
22
+
23
+
24
+ def directional_accuracy_when_acting(
25
+ y_true: np.ndarray, y_pred: np.ndarray
26
+ ) -> tuple[float, int, int]:
27
+ """Accuracy conditioned on the model predicting a non-zero direction.
28
+
29
+ Returns ``(accuracy, n_correct, n_acting)``. If ``n_acting`` is 0, returns
30
+ ``(nan, 0, 0)``.
31
+ """
32
+ acting_mask = y_pred != 0
33
+ n_acting = int(acting_mask.sum())
34
+ if n_acting == 0:
35
+ return float("nan"), 0, 0
36
+ correct = int(((y_pred == y_true) & acting_mask).sum())
37
+ return correct / n_acting, correct, n_acting
38
+
39
+
40
+ def fold_metrics(y_true: np.ndarray, y_pred: np.ndarray) -> dict:
41
+ """Per-fold metric bundle. Designed to be one row in the comparison CSV."""
42
+ y_true = np.asarray(y_true)
43
+ y_pred = np.asarray(y_pred)
44
+ n = len(y_true)
45
+ acc = accuracy_score(y_true, y_pred)
46
+ n_acc_correct = int((y_true == y_pred).sum())
47
+ dir_acc, n_dir_correct, n_acting = directional_accuracy_when_acting(y_true, y_pred)
48
+
49
+ return {
50
+ "n_test": n,
51
+ "accuracy": acc,
52
+ "binom_p_acc": binomial_pvalue(n_acc_correct, n, p_null=1 / 3),
53
+ "n_acting": n_acting,
54
+ "dir_acc_when_acting": dir_acc,
55
+ "binom_p_dir": (
56
+ binomial_pvalue(n_dir_correct, n_acting, p_null=0.5) if n_acting > 0 else float("nan")
57
+ ),
58
+ }
59
+
60
+
61
+ def summarize_results(results: pd.DataFrame) -> pd.DataFrame:
62
+ """Aggregate per-fold rows to per-model summary with mean ± std."""
63
+ keep = ["accuracy", "binom_p_acc", "dir_acc_when_acting", "binom_p_dir"]
64
+ grouped = results.groupby("model")[keep]
65
+ summary = grouped.agg(["mean", "std"])
66
+ summary.columns = [f"{c}_{stat}" for c, stat in summary.columns]
67
+ summary["n_folds"] = results.groupby("model").size()
68
+ return summary.reset_index()
69
+
70
+
71
+ def confusion_table(y_true: np.ndarray, y_pred: np.ndarray, labels=(-1, 0, 1)) -> pd.DataFrame:
72
+ """Confusion matrix as a labeled DataFrame (rows=true, cols=pred)."""
73
+ cm = confusion_matrix(y_true, y_pred, labels=list(labels))
74
+ return pd.DataFrame(
75
+ cm, index=[f"true_{c}" for c in labels], columns=[f"pred_{c}" for c in labels]
76
+ )
src/features.py ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Fractional differentiation — AFML Ch.5 §5.4 (BonusPDF p.46).
2
+
3
+ Why this module exists
4
+ ----------------------
5
+ Log-returns achieve stationarity but destroy memory: the binomial weights
6
+ ``(1-B)^d`` collapse to ``[1, -1, 0, 0, ...]`` at ``d=1``. For ``d ∈ (0, 1)``
7
+ the weights decay as a long power-law tail, so the series stays stationary
8
+ while retaining a long memory of past prices (Table 5.1 in AFML shows most
9
+ liquid futures reach ADF stationarity at ``d < 0.6``, and the majority at
10
+ ``d < 0.3``).
11
+
12
+ This is a port of AFML Snippets 5.1, 5.3, 5.4 (BonusPDF pp.48, 51, 53).
13
+ """
14
+
15
+ from __future__ import annotations
16
+
17
+ import numpy as np
18
+ import pandas as pd
19
+ from scipy.special import gamma
20
+
21
+
22
+ def get_ffd_weights(d: float, thres: float = 1e-5, max_size: int = 1024) -> np.ndarray:
23
+ """Binomial-series weights for the fractional-differencing operator ``(1-B)^d``.
24
+
25
+ Cuts the series off once ``|w_k| < thres``. Uses ``scipy.special.gamma`` for
26
+ a vectorized closed form rather than the recursive loop in AFML Snippet 5.1
27
+ — same values, faster and avoids accumulated float error in long series.
28
+
29
+ Returns
30
+ -------
31
+ np.ndarray of shape ``(n,)`` ordered from oldest to newest:
32
+ ``[w_{n-1}, w_{n-2}, ..., w_1, w_0]`` so the dot product with
33
+ ``series[t-n+1 : t+1]`` is the differenced value at ``t``.
34
+ """
35
+ k = np.arange(max_size)
36
+ with np.errstate(invalid="ignore", divide="ignore"):
37
+ w = (-1) ** k * gamma(d + 1) / (gamma(k + 1) * gamma(d - k + 1))
38
+ w = np.nan_to_num(w, nan=0.0, posinf=0.0, neginf=0.0)
39
+ cutoff = np.argmax(np.abs(w) < thres) if np.any(np.abs(w) < thres) else max_size
40
+ if cutoff == 0:
41
+ cutoff = max_size
42
+ return w[:cutoff][::-1]
43
+
44
+
45
+ def frac_diff_ffd(series: pd.Series | pd.DataFrame, d: float, thres: float = 1e-5) -> pd.DataFrame:
46
+ """Fixed-width fractional differencing — AFML Snippet 5.3 (BonusPDF p.51).
47
+
48
+ The fixed-width window keeps weights stable through time (unlike the
49
+ expanding-window variant in Snippet 5.2 which downweights early observations).
50
+ """
51
+ if isinstance(series, pd.Series):
52
+ series = series.to_frame()
53
+ w = get_ffd_weights(d, thres=thres) # shape (width+1,)
54
+ width = len(w) - 1
55
+ out = {}
56
+ for col in series.columns:
57
+ s = series[[col]].ffill().dropna()
58
+ if len(s) <= width:
59
+ out[col] = pd.Series(index=s.index[width:], dtype=float)
60
+ continue
61
+ values = s[col].to_numpy()
62
+ # Vectorized: build a (n_out, width+1) sliding-window matrix and dot with w
63
+ from numpy.lib.stride_tricks import sliding_window_view
64
+ windows = sliding_window_view(values, width + 1)
65
+ diffed = windows @ w
66
+ out[col] = pd.Series(diffed, index=s.index[width:])
67
+ return pd.concat(out, axis=1)
68
+
69
+
70
+ def find_min_d(series: pd.Series, d_range=(0.0, 1.0), n_steps: int = 11, thres: float = 1e-5) -> pd.DataFrame:
71
+ """Sweep ``d`` and return ADF stat + correlation — AFML Snippet 5.4 (BonusPDF p.53).
72
+
73
+ Use to pick the smallest ``d`` for which the FFD-differenced log-price passes
74
+ the ADF stationarity test at 95% (statistic < critical value ≈ -2.86).
75
+ Returns a frame indexed by ``d`` with columns: ``adf_stat, p_value, n_obs,
76
+ crit_95, corr_with_original``.
77
+ """
78
+ from statsmodels.tsa.stattools import adfuller
79
+
80
+ log_series = np.log(series.dropna()).to_frame(name=series.name or "value")
81
+ results = {}
82
+ for d in np.linspace(d_range[0], d_range[1], n_steps):
83
+ diffed = frac_diff_ffd(log_series, d, thres=thres).dropna()
84
+ if len(diffed) < 50:
85
+ continue
86
+ col = diffed.columns[0]
87
+ adf = adfuller(diffed[col], maxlag=1, regression="c", autolag=None)
88
+ aligned = log_series.loc[diffed.index, col]
89
+ corr = float(aligned.corr(diffed[col]))
90
+ results[round(d, 3)] = {
91
+ "adf_stat": adf[0],
92
+ "p_value": adf[1],
93
+ "n_obs": adf[3],
94
+ "crit_95": adf[4]["5%"],
95
+ "corr_with_original": corr,
96
+ }
97
+ return pd.DataFrame(results).T.rename_axis("d")
98
+
99
+
100
+ def rolling_zscore(series: pd.Series, window: int = 252, min_periods: int | None = None) -> pd.Series:
101
+ """Rolling z-score with leak-free statistics (uses only the trailing window).
102
+
103
+ Stronger than a single fit-on-train ``StandardScaler`` because regime shifts
104
+ don't carry stale means forward into the test set.
105
+ """
106
+ min_periods = min_periods or max(window // 4, 20)
107
+ mu = series.rolling(window=window, min_periods=min_periods).mean()
108
+ sd = series.rolling(window=window, min_periods=min_periods).std()
109
+ return (series - mu) / sd.replace(0, np.nan)
src/labeling.py ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Triple-barrier labeling — AFML Ch.3 (BonusPDF pp.26-34).
2
+
3
+ The triple-barrier method assigns each event one of three labels based on which
4
+ of three barriers is hit first:
5
+
6
+ - ``+1`` — upper (profit-taking) horizontal barrier hit first
7
+ - ``-1`` — lower (stop-loss) horizontal barrier hit first
8
+ - ``0`` — vertical (max holding period) barrier hit first
9
+
10
+ The horizontal barriers are scaled by a per-event volatility estimate (typically
11
+ EWM daily vol, ``get_daily_vol`` in ``src/data.py``). This is a port of AFML
12
+ Snippets 3.2-3.5 and Rambo's cleaner ``get_triple_barrier_label`` (his repo,
13
+ ``Chapter_3.py``).
14
+ """
15
+
16
+ from __future__ import annotations
17
+
18
+ import numpy as np
19
+ import pandas as pd
20
+
21
+
22
+ def apply_pt_sl_on_t1(
23
+ close: pd.Series, events: pd.DataFrame, pt_sl: tuple[float, float]
24
+ ) -> pd.DataFrame:
25
+ """AFML Snippet 3.2 (BonusPDF p.27). Find time of first barrier touch.
26
+
27
+ Parameters
28
+ ----------
29
+ close : pd.Series
30
+ Closing-price series, indexed by date.
31
+ events : pd.DataFrame
32
+ Required columns: ``t1`` (vertical-barrier date or NaT), ``target``
33
+ (vol estimate at the event), ``side`` (+1 for long, -1 for short; if
34
+ we don't know side, pass +1 for all).
35
+ pt_sl : (float, float)
36
+ Profit-taking and stop-loss multipliers of ``target``. Pass 0 to disable
37
+ a barrier.
38
+
39
+ Returns
40
+ -------
41
+ pd.DataFrame indexed like ``events`` with columns ``t1, pt, sl`` containing
42
+ the first-touch timestamps (NaT if never touched).
43
+ """
44
+ out = events[["t1"]].copy()
45
+ pt = pt_sl[0] * events["target"] if pt_sl[0] > 0 else pd.Series(np.nan, index=events.index)
46
+ sl = -pt_sl[1] * events["target"] if pt_sl[1] > 0 else pd.Series(np.nan, index=events.index)
47
+
48
+ for t0, t1 in events["t1"].fillna(close.index[-1]).items():
49
+ path_prices = close.loc[t0:t1]
50
+ path_returns = (path_prices / close.loc[t0] - 1) * events.at[t0, "side"]
51
+ sl_hits = path_returns[path_returns < sl[t0]]
52
+ pt_hits = path_returns[path_returns > pt[t0]]
53
+ out.at[t0, "sl"] = sl_hits.index.min() if len(sl_hits) else pd.NaT
54
+ out.at[t0, "pt"] = pt_hits.index.min() if len(pt_hits) else pd.NaT
55
+ return out
56
+
57
+
58
+ def add_vertical_barrier(
59
+ close: pd.Series, t_events: pd.DatetimeIndex, num_days: int
60
+ ) -> pd.Series:
61
+ """AFML Snippet 3.4 (BonusPDF p.30). Vertical (time-limit) barriers.
62
+
63
+ Returns a Series indexed by ``t_events`` whose values are ``num_days`` later,
64
+ snapped to the next available trading day; events too close to the end of
65
+ the series are dropped.
66
+ """
67
+ t1 = close.index.searchsorted(t_events + pd.Timedelta(days=num_days))
68
+ t1 = t1[t1 < close.shape[0]]
69
+ return pd.Series(close.index[t1], index=t_events[: len(t1)])
70
+
71
+
72
+ def get_events(
73
+ close: pd.Series,
74
+ t_events: pd.DatetimeIndex,
75
+ pt_sl: tuple[float, float],
76
+ target: pd.Series,
77
+ min_ret: float,
78
+ num_days: int | None = None,
79
+ side: pd.Series | None = None,
80
+ ) -> pd.DataFrame:
81
+ """AFML Snippet 3.3 (BonusPDF p.29). Run triple-barrier for a batch of events.
82
+
83
+ Returns a DataFrame indexed by event start time with columns:
84
+
85
+ - ``t1`` (timestamp of the *first* barrier hit — earliest of vertical/pt/sl)
86
+ - ``vertical_t1`` (the original vertical-barrier date)
87
+ - ``barrier_hit`` (one of ``"vertical"`` / ``"pt"`` / ``"sl"`` — what was hit
88
+ first; used by ``get_bins`` to produce the {-1, 0, +1} label)
89
+ - ``target`` (vol estimate at the event)
90
+
91
+ If ``side`` is provided, it is propagated for downstream meta-labeling.
92
+ """
93
+ target = target.reindex(t_events).dropna()
94
+ target = target[target > min_ret]
95
+
96
+ if num_days is not None:
97
+ vertical_t1 = add_vertical_barrier(close, target.index, num_days)
98
+ else:
99
+ vertical_t1 = pd.Series(pd.NaT, index=target.index)
100
+
101
+ if side is None:
102
+ side_ = pd.Series(1.0, index=target.index)
103
+ else:
104
+ side_ = side.reindex(target.index).fillna(1.0)
105
+
106
+ events = pd.concat(
107
+ {"t1": vertical_t1, "target": target, "side": side_}, axis=1
108
+ ).dropna(subset=["target"])
109
+ touches = apply_pt_sl_on_t1(close, events, pt_sl)
110
+
111
+ # Drop events where no barrier ever fires (can't happen with a vertical
112
+ # barrier present, but defensive against future config changes).
113
+ touches = touches.dropna(subset=["t1", "pt", "sl"], how="all")
114
+ events = events.loc[touches.index]
115
+
116
+ # Earliest touch among (vertical, pt, sl); record which barrier won.
117
+ all_touches = touches[["t1", "pt", "sl"]]
118
+ earliest = all_touches.min(axis=1)
119
+ # Manual row-wise argmin: pandas' idxmin chokes on all-NaT slices.
120
+ barrier_hit = pd.Series("vertical", index=all_touches.index)
121
+ pt_arr = all_touches["pt"]
122
+ sl_arr = all_touches["sl"]
123
+ vert_arr = all_touches["t1"]
124
+ # Replace NaT with a very large date for comparison purposes
125
+ far = pd.Timestamp.max
126
+ cmp = pd.DataFrame(
127
+ {
128
+ "pt": pt_arr.fillna(far),
129
+ "sl": sl_arr.fillna(far),
130
+ "vertical": vert_arr.fillna(far),
131
+ }
132
+ )
133
+ barrier_hit = cmp.idxmin(axis=1)
134
+
135
+ events["vertical_t1"] = events["t1"]
136
+ events["t1"] = earliest
137
+ events["barrier_hit"] = barrier_hit.astype(str)
138
+ if side is None:
139
+ events = events.drop("side", axis=1)
140
+ return events.dropna(subset=["t1"])
141
+
142
+
143
+ def get_bins(events: pd.DataFrame, close: pd.Series) -> pd.DataFrame:
144
+ """AFML Snippet 3.5 (BonusPDF p.30). Convert event outcomes to {-1, 0, +1}.
145
+
146
+ Full triple-barrier semantics: the label depends on which barrier was hit
147
+ *first*:
148
+
149
+ - ``barrier_hit == "pt"`` → ``+1`` (profit-taking, scaled by ``side``)
150
+ - ``barrier_hit == "sl"`` → ``-1`` (stop-loss, scaled by ``side``)
151
+ - ``barrier_hit == "vertical"`` → ``0`` (no signal; the time limit ran out
152
+ before either horizontal barrier was hit)
153
+
154
+ If meta-labeling (``side`` column present), maps to ``{0, 1}`` for
155
+ "don't act" vs "act in this side".
156
+ """
157
+ events_ = events.dropna(subset=["t1"]).copy()
158
+ px_idx = events_.index.union(events_["t1"].values).unique()
159
+ px = close.reindex(px_idx, method="bfill")
160
+
161
+ out = pd.DataFrame(index=events_.index)
162
+ out["ret"] = px.loc[events_["t1"].values].values / px.loc[events_.index].values - 1
163
+ if "side" in events_.columns:
164
+ out["ret"] *= events_["side"].values
165
+
166
+ if "barrier_hit" in events_.columns:
167
+ # Full triple-barrier: 0 when the vertical barrier (time limit) wins.
168
+ out["bin"] = 0
169
+ out.loc[events_["barrier_hit"] == "pt", "bin"] = 1
170
+ out.loc[events_["barrier_hit"] == "sl", "bin"] = -1
171
+ if "side" in events_.columns:
172
+ # meta-labeling: collapse to {0, 1} = "don't act / act"
173
+ out.loc[out["ret"] <= 0, "bin"] = 0
174
+ out.loc[out["bin"] != 0, "bin"] = 1
175
+ else:
176
+ # Fallback to AFML Snippet 3.5 default (sign of return)
177
+ out["bin"] = np.sign(out["ret"]).astype(int)
178
+ out["bin"] = out["bin"].astype(int)
179
+ return out
180
+
181
+
182
+ def drop_labels(events: pd.DataFrame, min_pct: float = 0.05) -> pd.DataFrame:
183
+ """AFML Snippet 3.8 (BonusPDF p.34). Drop labels with < ``min_pct`` support.
184
+
185
+ Repeats until every remaining label has at least ``min_pct`` of observations
186
+ or fewer than 3 classes remain.
187
+ """
188
+ while True:
189
+ counts = events["bin"].value_counts(normalize=True)
190
+ if counts.min() > min_pct or len(counts) < 3:
191
+ break
192
+ smallest = counts.idxmin()
193
+ events = events[events["bin"] != smallest]
194
+ print(f"Dropped label {smallest}: {100 * counts.min():.2f}% of observations")
195
+ return events
196
+
197
+
198
+ def cusum_filter(series: pd.Series, threshold: float) -> pd.DatetimeIndex:
199
+ """Symmetric CUSUM filter — AFML §2.5.2 (general technique).
200
+
201
+ Generates event start times where the cumulative sum of returns (in either
202
+ direction) exceeds ``threshold``. Resets after each event. Returns a
203
+ DatetimeIndex of event-trigger timestamps.
204
+
205
+ Avoids the "predict on every bar" inefficiency by only labeling at
206
+ statistically interesting moments.
207
+ """
208
+ t_events, s_pos, s_neg = [], 0.0, 0.0
209
+ diff = series.diff().fillna(0)
210
+ for t, d in diff.items():
211
+ s_pos = max(0.0, s_pos + d)
212
+ s_neg = min(0.0, s_neg + d)
213
+ if s_neg < -threshold:
214
+ s_neg = 0.0
215
+ t_events.append(t)
216
+ elif s_pos > threshold:
217
+ s_pos = 0.0
218
+ t_events.append(t)
219
+ return pd.DatetimeIndex(t_events)
src/models/__init__.py ADDED
File without changes
src/models/arima_model.py ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ARIMA wrapper for triple-barrier classification.
2
+
3
+ ARIMA forecasts a continuous next-step return; we threshold it into ``{-1, 0, +1}``
4
+ using ``±k·σ`` where ``σ`` is the daily-vol estimate at the event time. The
5
+ ``k`` factor matches the profit-taking / stop-loss multiplier used for labeling
6
+ so that the discretization is consistent with the label scheme.
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ import warnings
12
+
13
+ import numpy as np
14
+ import pandas as pd
15
+ from statsmodels.tsa.arima.model import ARIMA
16
+
17
+
18
+ class ARIMAClassifier:
19
+ """Wraps statsmodels ARIMA so it can sit in the same fit/predict loop as XGB/LSTM.
20
+
21
+ The model is fit on the log-price series implied by the training rows (the
22
+ feature matrix carries the volatility estimate per row, used to threshold).
23
+
24
+ Required X columns: ``frac_diff_close`` (used as a proxy for the underlying
25
+ log-price level we want to forecast) and ``target_vol`` (per-event vol used
26
+ to set the ±k·σ threshold).
27
+ """
28
+
29
+ def __init__(self, order: tuple[int, int, int] = (1, 1, 1), threshold_k: float = 0.5):
30
+ self.order = order
31
+ self.threshold_k = threshold_k
32
+ self.fitted_ = None
33
+ self.train_tail_value_: float = 0.0
34
+ self.classes_: np.ndarray = np.array([-1, 0, 1])
35
+
36
+ def fit(self, X, y, sample_weight=None):
37
+ series = X["frac_diff_close"].astype(float).to_numpy()
38
+ with warnings.catch_warnings():
39
+ warnings.simplefilter("ignore")
40
+ self.fitted_ = ARIMA(series, order=self.order).fit()
41
+ self.train_tail_value_ = float(series[-1])
42
+ return self
43
+
44
+ def predict(self, X):
45
+ n = len(X)
46
+ forecast = self.fitted_.forecast(steps=n)
47
+ # convert forecast deltas back to per-step returns vs the tail of training
48
+ last = self.train_tail_value_
49
+ per_step_return = np.diff(np.concatenate([[last], np.asarray(forecast)]))
50
+
51
+ thresholds = self.threshold_k * X["target_vol"].astype(float).to_numpy()
52
+ preds = np.zeros(n, dtype=int)
53
+ preds[per_step_return > thresholds] = 1
54
+ preds[per_step_return < -thresholds] = -1
55
+ return preds
56
+
57
+ def predict_proba(self, X):
58
+ # ARIMA isn't probabilistic in the triple-barrier sense; collapse hard
59
+ # predictions into a one-hot for log-loss calculation.
60
+ preds = self.predict(X)
61
+ proba = np.zeros((len(preds), 3))
62
+ for i, c in enumerate(self.classes_):
63
+ proba[preds == c, i] = 1.0
64
+ return proba
src/models/baselines.py ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Naïve and SES baselines for triple-barrier classification.
2
+
3
+ The original notebook found SES beat the LSTM under rolling evaluation, which
4
+ was the most interesting result. We keep both baselines under the new label
5
+ scheme to see whether that finding survives a fair (purged-CV) comparison.
6
+
7
+ Both classes follow a uniform fit/predict_proba/predict interface so the
8
+ training driver can iterate over models polymorphically.
9
+ """
10
+
11
+ from __future__ import annotations
12
+
13
+ import numpy as np
14
+ import pandas as pd
15
+ from statsmodels.tsa.holtwinters import SimpleExpSmoothing
16
+
17
+
18
+ class MajorityClassClassifier:
19
+ """Predicts the most common class from the training set every time.
20
+
21
+ The honest "do nothing" baseline. A useful sanity check: any model that
22
+ fails to beat this on accuracy isn't doing anything.
23
+ """
24
+
25
+ def __init__(self):
26
+ self.majority_class_: int | None = None
27
+ self.classes_: np.ndarray | None = None
28
+
29
+ def fit(self, X, y, sample_weight=None):
30
+ y = np.asarray(y)
31
+ self.classes_ = np.unique(y)
32
+ counts = np.bincount((y - self.classes_.min()).astype(int))
33
+ self.majority_class_ = int(self.classes_[np.argmax(counts)])
34
+ return self
35
+
36
+ def predict(self, X):
37
+ return np.full(len(X), self.majority_class_, dtype=int)
38
+
39
+ def predict_proba(self, X):
40
+ n = len(X)
41
+ proba = np.zeros((n, len(self.classes_)))
42
+ idx = int(np.where(self.classes_ == self.majority_class_)[0][0])
43
+ proba[:, idx] = 1.0
44
+ return proba
45
+
46
+
47
+ class SESClassifier:
48
+ """Simple exponential smoothing applied to the *label series*, then sign-mapped.
49
+
50
+ Approach: fit ``SimpleExpSmoothing`` on the train labels (treated as a
51
+ continuous signal in ``{-1, 0, +1}``), forecast next-step level, and round
52
+ back to the nearest class. Not a real classifier — a sanity check that the
53
+ label sequence has any short-horizon autocorrelation at all.
54
+ """
55
+
56
+ def __init__(self, smoothing_level: float | None = None):
57
+ self.smoothing_level = smoothing_level
58
+ self.model_ = None
59
+ self.last_forecast_: float = 0.0
60
+ self.classes_: np.ndarray | None = None
61
+
62
+ def fit(self, X, y, sample_weight=None):
63
+ y = np.asarray(y, dtype=float)
64
+ self.classes_ = np.array(sorted(np.unique(y.astype(int))))
65
+ self.model_ = SimpleExpSmoothing(y, initialization_method="estimated").fit(
66
+ smoothing_level=self.smoothing_level, optimized=self.smoothing_level is None
67
+ )
68
+ fc = self.model_.forecast(1)
69
+ self.last_forecast_ = float(fc[0] if hasattr(fc, "__getitem__") else fc)
70
+ return self
71
+
72
+ def predict(self, X):
73
+ # SES gives a single forecast; broadcast it across the test window.
74
+ # The "label series has very weak structure" finding is intentional —
75
+ # this is meant to be a sanity baseline.
76
+ n = len(X)
77
+ forecast = self.last_forecast_
78
+ return np.full(n, self._nearest_class(forecast), dtype=int)
79
+
80
+ def predict_proba(self, X):
81
+ n = len(X)
82
+ pred_class = self._nearest_class(self.last_forecast_)
83
+ proba = np.zeros((n, len(self.classes_)))
84
+ idx = int(np.where(self.classes_ == pred_class)[0][0])
85
+ proba[:, idx] = 1.0
86
+ return proba
87
+
88
+ def _nearest_class(self, value: float) -> int:
89
+ return int(self.classes_[np.argmin(np.abs(self.classes_ - value))])
src/models/lstm_model.py ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Refined LSTM for triple-barrier classification.
2
+
3
+ Architectural choices vs the original notebook
4
+ -----------------------------------------------
5
+ The original model (128→64 units, MSE, no clipnorm) collapsed to predicting the
6
+ mean. The refinements here are each anchored to a specific reference:
7
+
8
+ - **Smaller**: 32→16 units, ~10× fewer params. Jansen's univariate-LSTM notebook
9
+ uses 10 units on S&P daily. Karpathy warns that over-parameterized RNNs
10
+ *"do not always show convincing signs of generalizing in the correct way."*
11
+ - **Gradient clipping** (``clipnorm=1.0``): Goodfellow §10.11.1, eq 10.48-49
12
+ (PDF p.414). Without it, the 60-step BPTT chain has the "cliff" landscape
13
+ shown in figure 10.17 and SGD updates can be catastrophically large.
14
+ - **Recurrent dropout** (``recurrent_dropout=0.1``): Goodfellow §10.11.2
15
+ (PDF p.415). Drops the time-axis connections, which is where the
16
+ generalization problem lives — sequence-level Dropout drops feature
17
+ dimensions and misses this.
18
+ - **Softmax over 3 classes** with ``categorical_crossentropy``: aligns the
19
+ loss with the directional-accuracy metric, fixing the original's MSE-vs-
20
+ direction mismatch.
21
+ - **Forget-gate bias = 1**: Keras default (``unit_forget_bias=True``), kept
22
+ explicit so a reader sees Goodfellow §10.10.2 (PDF p.412) is honored.
23
+
24
+ Sample weighting
25
+ ----------------
26
+ AFML Pitfall #7 (Table 1.2) — non-IID samples need uniqueness weighting. The
27
+ training driver passes a ``sample_weight`` array if available; ``categorical_
28
+ crossentropy`` honors it natively via Keras.
29
+ """
30
+
31
+ from __future__ import annotations
32
+
33
+ import numpy as np
34
+
35
+
36
+ def build_lstm(
37
+ sequence_length: int,
38
+ n_features: int,
39
+ n_classes: int = 3,
40
+ lstm_units: tuple[int, int] = (32, 16),
41
+ dropout: float = 0.2,
42
+ recurrent_dropout: float = 0.1,
43
+ learning_rate: float = 1e-3,
44
+ clipnorm: float = 1.0,
45
+ ):
46
+ """Build the refined LSTM. Import inside the function so tensorflow doesn't load at module import."""
47
+ from tensorflow.keras.layers import LSTM, Dense, Dropout, Input
48
+ from tensorflow.keras.models import Sequential
49
+ from tensorflow.keras.optimizers import Adam
50
+
51
+ model = Sequential(
52
+ [
53
+ Input(shape=(sequence_length, n_features)),
54
+ LSTM(
55
+ lstm_units[0],
56
+ return_sequences=True,
57
+ recurrent_dropout=recurrent_dropout,
58
+ unit_forget_bias=True, # Goodfellow §10.10.2 (PDF p.412)
59
+ ),
60
+ Dropout(dropout),
61
+ LSTM(
62
+ lstm_units[1],
63
+ return_sequences=False,
64
+ recurrent_dropout=recurrent_dropout,
65
+ unit_forget_bias=True,
66
+ ),
67
+ Dropout(dropout),
68
+ Dense(n_classes, activation="softmax"),
69
+ ]
70
+ )
71
+ model.compile(
72
+ optimizer=Adam(learning_rate=learning_rate, clipnorm=clipnorm),
73
+ loss="categorical_crossentropy",
74
+ metrics=["accuracy"],
75
+ )
76
+ return model
77
+
78
+
79
+ def build_sequences(
80
+ X: np.ndarray, y: np.ndarray, sequence_length: int
81
+ ) -> tuple[np.ndarray, np.ndarray]:
82
+ """Convert ``(n_obs, n_features)`` into ``(n_seq, sequence_length, n_features)``.
83
+
84
+ The target at sequence index ``i`` is ``y[i + sequence_length - 1]`` — the
85
+ model predicts the label at the END of each window, not the next step
86
+ beyond it (the next-step view is handled at the event level by the
87
+ triple-barrier ``t1``).
88
+ """
89
+ n = len(X) - sequence_length + 1
90
+ if n <= 0:
91
+ return np.empty((0, sequence_length, X.shape[1])), np.empty((0,))
92
+ X_seq = np.stack([X[i : i + sequence_length] for i in range(n)])
93
+ y_seq = y[sequence_length - 1 :]
94
+ return X_seq, y_seq
95
+
96
+
97
+ class LSTMTripleBarrier:
98
+ """Wraps the refined LSTM with the same fit/predict interface as other models.
99
+
100
+ Owns label encoding ``{-1, 0, +1} -> {0, 1, 2}`` and sequence construction
101
+ so the training driver doesn't have to special-case it.
102
+ """
103
+
104
+ def __init__(
105
+ self,
106
+ sequence_length: int = 60,
107
+ n_features: int = 8,
108
+ epochs: int = 50,
109
+ batch_size: int = 64,
110
+ patience: int = 15,
111
+ verbose: int = 0,
112
+ random_state: int = 42,
113
+ ):
114
+ self.sequence_length = sequence_length
115
+ self.n_features = n_features
116
+ self.epochs = epochs
117
+ self.batch_size = batch_size
118
+ self.patience = patience
119
+ self.verbose = verbose
120
+ self.random_state = random_state
121
+ self.model = None
122
+ self.classes_ = np.array([-1, 0, 1])
123
+ self.history_ = None
124
+
125
+ def fit(self, X, y, sample_weight=None):
126
+ import tensorflow as tf
127
+ from tensorflow.keras.callbacks import EarlyStopping
128
+ from tensorflow.keras.utils import to_categorical
129
+
130
+ tf.random.set_seed(self.random_state)
131
+ np.random.seed(self.random_state)
132
+
133
+ X_arr = np.asarray(X)
134
+ y_enc = np.asarray(y).astype(int) + 1
135
+ X_seq, y_seq = build_sequences(X_arr, y_enc, self.sequence_length)
136
+ if len(X_seq) == 0:
137
+ raise ValueError(f"Not enough rows ({len(X_arr)}) for sequence_length={self.sequence_length}")
138
+ y_onehot = to_categorical(y_seq, num_classes=3)
139
+
140
+ sw_seq = None
141
+ if sample_weight is not None:
142
+ sw_arr = np.asarray(sample_weight)
143
+ sw_seq = sw_arr[self.sequence_length - 1 :]
144
+
145
+ self.model = build_lstm(
146
+ sequence_length=self.sequence_length,
147
+ n_features=X_arr.shape[1],
148
+ )
149
+ callbacks = [
150
+ EarlyStopping(monitor="loss", patience=self.patience, restore_best_weights=True)
151
+ ]
152
+ self.history_ = self.model.fit(
153
+ X_seq,
154
+ y_onehot,
155
+ sample_weight=sw_seq,
156
+ epochs=self.epochs,
157
+ batch_size=self.batch_size,
158
+ verbose=self.verbose,
159
+ callbacks=callbacks,
160
+ shuffle=False,
161
+ )
162
+ return self
163
+
164
+ def predict_proba(self, X):
165
+ X_arr = np.asarray(X)
166
+ # Always pad the start with `sequence_length - 1` copies of the first row
167
+ # so the output has exactly one prediction per input row. (Without this
168
+ # pad we'd lose the first 59 rows of every test fold.)
169
+ n_pad = self.sequence_length - 1
170
+ pad = np.tile(X_arr[:1], (n_pad, 1))
171
+ X_padded = np.vstack([pad, X_arr])
172
+ X_seq, _ = build_sequences(
173
+ X_padded, np.zeros(len(X_padded)), self.sequence_length
174
+ )
175
+ return self.model.predict(X_seq, verbose=0)
176
+
177
+ def predict(self, X):
178
+ proba = self.predict_proba(X)
179
+ return self.classes_[np.argmax(proba, axis=1)]
src/models/xgb_model.py ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """XGBoost classifier for triple-barrier labels.
2
+
3
+ Per Jansen Ch.12, gradient-boosted trees are the natural baseline for tabular
4
+ financial features and routinely beat LSTMs on these problems. The hyper-
5
+ parameters here are conservative (shallow trees, moderate n_estimators) to
6
+ avoid overfitting on small per-fold training sets in the purged CV scheme.
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ import numpy as np
12
+ from xgboost import XGBClassifier
13
+
14
+
15
+ def build_xgb_classifier(random_state: int = 42) -> XGBClassifier:
16
+ """Returns a fresh XGBClassifier for one CV fold.
17
+
18
+ Output classes use the XGBoost-internal indexing ``{0, 1, 2}`` for
19
+ ``{-1, 0, +1}`` since XGBoost requires non-negative integer labels. The
20
+ training driver wraps this with an encoder.
21
+ """
22
+ return XGBClassifier(
23
+ objective="multi:softprob",
24
+ num_class=3,
25
+ max_depth=4,
26
+ n_estimators=300,
27
+ learning_rate=0.05,
28
+ subsample=0.8,
29
+ colsample_bytree=0.8,
30
+ reg_lambda=1.0,
31
+ eval_metric="mlogloss",
32
+ random_state=random_state,
33
+ n_jobs=-1,
34
+ tree_method="hist",
35
+ )
36
+
37
+
38
+ class XGBTripleBarrier:
39
+ """Thin wrapper that owns the label encoding from ``{-1, 0, 1}`` ↔ ``{0, 1, 2}``."""
40
+
41
+ def __init__(self, random_state: int = 42):
42
+ self.model = build_xgb_classifier(random_state=random_state)
43
+ self.classes_ = np.array([-1, 0, 1])
44
+
45
+ def fit(self, X, y, sample_weight=None):
46
+ y_enc = np.asarray(y).astype(int) + 1 # {-1, 0, 1} -> {0, 1, 2}
47
+ self.model.fit(X, y_enc, sample_weight=sample_weight)
48
+ return self
49
+
50
+ def predict(self, X):
51
+ y_pred_enc = self.model.predict(X)
52
+ return y_pred_enc - 1
53
+
54
+ def predict_proba(self, X):
55
+ return self.model.predict_proba(X)
56
+
57
+ @property
58
+ def feature_importances_(self):
59
+ return self.model.feature_importances_
src/train.py ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """CV-aware training driver — one harness for all five models.
2
+
3
+ The driver expects each model to expose ``fit(X, y, sample_weight=None)``,
4
+ ``predict(X)``, and (optionally) ``predict_proba(X)``. The triple-barrier label
5
+ ``{-1, 0, +1}`` is shared across all of them.
6
+
7
+ Sample weights come from AFML Ch.4 — observations whose label intervals overlap
8
+ contribute less unique information, so they should count less in the loss. The
9
+ simplest implementation is to weight inversely by the number of overlapping
10
+ labels (Snippet 4.1); for now the driver supports passing pre-computed weights
11
+ or falling back to uniform.
12
+ """
13
+
14
+ from __future__ import annotations
15
+
16
+ from collections.abc import Callable
17
+ from typing import Any
18
+
19
+ import numpy as np
20
+ import pandas as pd
21
+ from sklearn.preprocessing import StandardScaler
22
+
23
+ from .cv import PurgedKFold
24
+ from .eval import fold_metrics
25
+
26
+
27
+ def fit_predict_one_fold(
28
+ model_builder: Callable[[], Any],
29
+ X_train: pd.DataFrame,
30
+ y_train: pd.Series,
31
+ X_test: pd.DataFrame,
32
+ sample_weight_train: np.ndarray | None = None,
33
+ standardize: bool = True,
34
+ ) -> tuple[np.ndarray, Any]:
35
+ """Fit on the train fold, predict on the test fold. Returns (y_pred, fitted_model)."""
36
+ if standardize:
37
+ scaler = StandardScaler().fit(X_train.values)
38
+ X_train_s = pd.DataFrame(
39
+ scaler.transform(X_train.values), index=X_train.index, columns=X_train.columns
40
+ )
41
+ X_test_s = pd.DataFrame(
42
+ scaler.transform(X_test.values), index=X_test.index, columns=X_test.columns
43
+ )
44
+ else:
45
+ X_train_s, X_test_s = X_train, X_test
46
+ model = model_builder()
47
+ model.fit(X_train_s, y_train.values, sample_weight=sample_weight_train)
48
+ return model.predict(X_test_s), model
49
+
50
+
51
+ def run_cv(
52
+ model_name: str,
53
+ model_builder: Callable[[], Any],
54
+ X: pd.DataFrame,
55
+ y: pd.Series,
56
+ cv: PurgedKFold,
57
+ sample_weight: pd.Series | None = None,
58
+ standardize: bool = True,
59
+ extra_columns: dict | None = None,
60
+ ) -> pd.DataFrame:
61
+ """Run a model across all CV folds. Returns one row per fold."""
62
+ rows = []
63
+ for fold_idx, (train_idx, test_idx) in enumerate(cv.split(X)):
64
+ X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
65
+ y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
66
+ sw_train = sample_weight.iloc[train_idx].values if sample_weight is not None else None
67
+
68
+ y_pred, _ = fit_predict_one_fold(
69
+ model_builder=model_builder,
70
+ X_train=X_train,
71
+ y_train=y_train,
72
+ X_test=X_test,
73
+ sample_weight_train=sw_train,
74
+ standardize=standardize,
75
+ )
76
+
77
+ metrics = fold_metrics(y_test.values, y_pred)
78
+ row = {"model": model_name, "fold": fold_idx, **metrics}
79
+ if extra_columns:
80
+ row.update(extra_columns)
81
+ rows.append(row)
82
+ return pd.DataFrame(rows)
83
+
84
+
85
+ def uniqueness_weights(t1: pd.Series) -> pd.Series:
86
+ """Approximate AFML Ch.4 sample-uniqueness weights.
87
+
88
+ For each event, count how many other events have overlapping
89
+ ``[start, t1]`` intervals, and weight inversely. Not the rigorous Snippet
90
+ 4.1 (which counts overlap proportionally), but the right order of magnitude
91
+ and much faster.
92
+ """
93
+ weights = pd.Series(1.0, index=t1.index)
94
+ t1_arr = t1.values
95
+ start_arr = t1.index.values
96
+ n = len(t1)
97
+ for i in range(n):
98
+ overlap = np.sum((start_arr <= t1_arr[i]) & (t1_arr >= start_arr[i]))
99
+ weights.iloc[i] = 1.0 / max(overlap, 1)
100
+ # normalize so the weights sum to n (mean weight = 1)
101
+ weights *= n / weights.sum()
102
+ return weights
y_test_lstm.npy DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:0b51bdd96ea2313f3a6566cd272c918161d87a704a5b4d7fdab557dca65bdac7
3
- size 3640
 
 
 
 
y_train_lstm.npy DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:9ddf60272a325d8ac0328830ed5a7c326a2f4553506a8d75587c8580f025b847
3
- size 22216
 
 
 
 
y_val_lstm.npy DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:a7518e2c4f02f2496ac9f5043a9eded9d7fb0e5b37c6c40f5c66dfee6a7bfef4
3
- size 1656